<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Deep Learning Based Approach to Detect Suspicious Weapons</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prashant Varshney</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harsh Tyagi</string-name>
          <email>tyagih1699@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikhil Kr. Lohia</string-name>
          <email>nikhillohia6128@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhishek Kajla</string-name>
          <email>abhishekkajla5511@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Palak Girdhar</string-name>
          <email>palakgirdhar@bpitindia.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science &amp; Engineering Department, Bhagwan Parshuram Institute of Technology, Guru Gobind Singh Indraprastha University</institution>
          ,
          <addr-line>Sector 17, Rohini, New Delhi, 110089</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Over the past few decades, the world has witnessed a lot of terrorist and criminal activities. The Public Surveillance System has gained a lot of importance as a response to counter these activities. Various state governments have started to install cameras in their densely populated and important cities to safeguard their citizens. To cover a complete city under a surveillance network of thousands of cameras, hundreds of security personnels are needed to monitor its video feed in real-time. To make this task cost-effective and feasible, one security personnel is monitoring nearly 6 - 8 cameras which usually leads to failure in detection of threats. One slip of concentration can cause damage to many lives. This research paper determines the optimized, efficient &amp; faster way to detect commonly used weapons like AK47, Hand Revolver, Pistol, Combat Knife, Grenade, etc. in a live video feed.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Object Detection</kwd>
        <kwd>Neural Network</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>mAP Score</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Object Detection is a field of Artificial
Intelligence associated with Digital Image
Processing and Computer Vision, which deals
with detecting instances of an object like a car,
humans, weapons, etc. possessing similar
features with the trained object classes [1].
Object Detection methods are generally
categorized into either ML-based approach or
DL-based approach depending upon the
complexity of object class. For Machine
Learning based approaches, it is essential to
define features beforehand using methods like
Haar Cascade, SIFT, etc. which further uses the
support vector machine (SVM) technique for
detecting object class. The Deep Learning
based approaches uses Artificial Neural
Networks to do an end-to-end object detection
without defining features specifically.
Recent growth in the field of Artificial
Intelligence has contributed a lot to solve major
crises all over the world. This research paper
mainly focuses on different Object Detection
models in Deep Learning like RCNN
(Regionbased Convolutional Neural Network) [2], SSD
(Single Shot Detector) [3], and YOLO (You
Only Look Once) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">4, 5, 6, 7</xref>
        ] using which
possession of any suspicious weapons are
detected in video surveillance. The primary
goal of this paper is to analyze the performance
of these models and determine the most
efficient and reliable model amongst them for
surveillance purposes.
      </p>
      <p>
        The main inspiration for this research paper
came from the Mumbai Chhatrapati Shivaji
terminus railway station attack where a couple
of terrorists entered the railway station with
their automatic assault rifles and started
indiscriminate shooting which killed 58
civilians and caused 100 plus casualties [
        <xref ref-type="bibr" rid="ref5">8</xref>
        ]. At
that time, if there had been any AI-based
Weapon Detection Technology that was
monitoring the city, then the Police Control
Room would have known about them
beforehand. We decided to build such a system
that might add an extra layer to the security at
public places preventing such threatening
activities.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The detection of threatening weapons in the
surveillance system is a challenging task to do.
To cover a complete city under a surveillance
network of thousands of cameras, hundreds of
security personnels are needed to monitor its
video feed in real-time. To make this task
costeffective and feasible, one security personnel is
monitoring nearly 6 – 8 cameras which leads to
failure in detection of threats at their initial
stage and results in delayed response causing
causalities.</p>
      <p>
        According to the research of Velastin et al.
of the Queen Mary University of London,
carried in 2006 usually after 20 minutes of
video monitoring, operators in many instances
fail to notice the presence of threatening objects
in a video feed [
        <xref ref-type="bibr" rid="ref6">9</xref>
        ]. Researching further
analyzed that after 12 minutes of monitoring
video feed, an operator is likely to fail to notice
up to 45% of suspicious activities and after 22
minutes of monitoring, up to 95% of suspicious
activities are failed to notice [
        <xref ref-type="bibr" rid="ref7">10</xref>
        ]. Thus the
most optimal solution to this problem could be
to eliminate humans from the equation as much
as possible.
      </p>
      <p>
        In the year 2001, Paul Viola and Michael
Jones proposed the first robust, efficient, and
real-time Machine Learning based Object
Detection Framework in their paper “Rapid
Object Detection using Boosted Cascade of
Simple Features” [
        <xref ref-type="bibr" rid="ref8 ref9">11,12</xref>
        ]. This framework can
be trained to discover the variety of objects by
taking lots of images that contain the object
which we want our classifier to detect (positive
images) and the same images but without the
object which needs to be detected (negative
images), to train the classifier. However, this
approach cannot be used to identify the
presence of complex objects in different
orientations and sizes.
      </p>
      <p>Deep Learning based Object Detection
frameworks mainly consist of two types –
1. Region Proposal based framework
that includes models like RCNN,</p>
      <p>FRCNN, and Faster RCNN.
2. Regression-based frameworks that
include models like YOLO and</p>
      <p>SSD.</p>
      <p>Region Proposal-based algorithms use a
sliding window approach to extract features
from the visual data. In the year 2014, Ross
Girshick presented RCNN model based on this
algorithm, which obtained mAP of 53.3% On
the contrary, to the results achieved on the
PASCAL VOC dataset, an improvement of
30% was achieved by this model. In this model,
the whole image is processed with a
Convolution Neural Network to produce a
feature map and then a feature vector of
fixedlength with a Region of Interest (RoI) pooling
layer is extracted from each region proposal.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Flowchart</title>
    </sec>
    <sec id="sec-5">
      <title>3.2. Scraping images of commonly used weapons</title>
      <p>To build DL-based object detection model,
nearly 5000 images of commonly used
weapons like AK47, Hand Pistol, Revolver,
Shotgun, Combat Knife, etc. were scrapped
available in various sizes and orientations,
which is later pre-processed for building
dataset. These images were gathered from
different sources available on the internet using
automation scripts.
3.3.</p>
    </sec>
    <sec id="sec-6">
      <title>Building labelled dataset</title>
      <p>After pre-processing and scaling, these
scrapped images were labelled using
“LabelImg” software. This software helps in
labelling class objects by simply marking the
object manually in the image using selection
tool. The software then creates a text (*.txt) file
where the exact 2D coordinates of the bounding
box along with the class identifier is stored.
These files when combined with images give us
a complete labelled dataset which further can be
used to train any desired DL-based object
detection model.
3.4.</p>
    </sec>
    <sec id="sec-7">
      <title>Model training</title>
      <p>Then proceeding further different Deep
Learning frameworks and models such as
RCNN, SSD, and YOLO were trained using the
above built dataset. The dataset was randomly
divided in the ratio of 80 - 20. Using 80 percent
of the dataset the model was trained. The
remaining 20 percent of the dataset was used for
testing purposes. To train these DL-based
models Google Collab was used because of its
free and powerful GPUs.</p>
    </sec>
    <sec id="sec-8">
      <title>4. Experimental Results</title>
      <p>After completion of training with 80
percent of dataset the desired weights were
obtained. The remaining 20 percent of the
dataset was tested against these obtained
weights, giving a complete and detailed
analysis of accuracy and precision of these
models which was recorded for further
comparison and eventually figuring out the best
detection model.
in
an</p>
      <p>
        In computer vision, Mean Average
Precision (mAP) is used to evaluate the Object
Detection Model [
        <xref ref-type="bibr" rid="ref10">13</xref>
        ]. It measures the accuracy
by calculating the number of correct predictions
that the model made. To find the mAP score of
a model, we have to find the value of
Intersection of Union (IoU), precision and
recall prior.
      </p>
      <p>Object detection models generate
predictions in terms of a class label and a
bounding box. For every prediction, we will
measure IoU by taking the ratio of area of
overlap between the predicted bounding box
and the ground truth bounding box to the area
of union of both bounding boxes.
Mathematically,
 =</p>
      <p>Area of Overlap
  
(1)</p>
      <p>We will find the values of Recall and
Precision by using this IoU value, for a given
threshold. Taking an example, if IoU is set to
0.7 threshold, and the IoU value of 0.8 is
achieved for a prediction, which is greater than
or equals to the set threshold then the prediction
is classified as TP i.e. True Positive otherwise
the prediction is classified as FP i.e. False
Positive.</p>
      <p>Precision of an object detection model is the
ratio of total number of instances of True
Positives to the total number of instances of
True Positives and False Positives all together.
Mathematically,
 =</p>
      <p>TP
( + )
(2)
where,</p>
      <p>TP is true positives i.e. predicted as positive
and is true &amp; FP is false positives i.e. predicted
as positive and is false.</p>
      <p>Recall measures the fraction of relevant
predictions that were predicted by the object
detection model. It is measured by taking the
ratio of total number of instances of True
Positives to the total number of instances of
True Positives and False Negatives all together.
Mathematically,
 =</p>
      <p>TP
( + )
(3)
where,</p>
      <p>TP is true positives i.e. predicted as
positive and is correct &amp; FN is false negatives
i.e. model is failed to predict the presence of the
object and object is present there.</p>
      <p>
        The Average Precision (AP) is the area
under the Precision vs Recall curve [
        <xref ref-type="bibr" rid="ref11">14</xref>
        ]. (Mean
Average Precision) mAP is the average of
Average Precision (AP).
      </p>
    </sec>
    <sec id="sec-9">
      <title>4.2. Analysis of Accuracy Precision (AP) and GPU time to process one frame (in ms)</title>
    </sec>
    <sec id="sec-10">
      <title>5. Conclusion</title>
      <p>A comparative study of various DL-based
object detectors that uses Artificial Neural
Network to classify and localize visual data was
conducted. It was tested on a custom dataset of
commonly used weapons. SSD amongst all the
other DL-based Object Detection Frameworks
has the least mAP score of 20, but the
computational time to detect the object was the
fastest of others. So, it can be a better choice if
we need a fast object detector in trade-off to
accuracy. YOLO and RCNN provided a similar
mAP score of 33 and 34.2 respectively which
gives the better accuracy of detecting the object.
Although the YOLO trained model is
comparatively faster than the RCNN model
making it an efficient and reliable object
detector.</p>
    </sec>
    <sec id="sec-11">
      <title>6. Future Scope</title>
      <p>Deep Learning based Object Detection
framework mainly consists of two types –
1. This research paper is limited to
weapons like AK47, Shotguns, 9mm
Hand Pistols, Combat Knives, and Hand
Grenades which are some of the most
commonly used weapons amongst
criminals. But there are still a large
number of dangerous and illegal
weapons whose possession by any
individual can cause a problem. So our
target will be to add more and more of
these labeled datasets into our training
part to ensure maximum accuracy.
2. We can enhance this project to a
collision detection system where all
sorts of vehicle accidents can be traced
and reported to the nearby police station
to prevent any further damage.
3. Another application can be to use the
system for detection of any major blood
around the area which will report a
nearby hospital to avoid any fatality.
4.
5.</p>
      <p>Another possible application could be
the detection of fire at any place which
upon detection can be reported directly
to a fire department ensuring that there
is minimum damage around that area.
One can also monitor through the traffic
using this system where the cameras
will detect all the vehicles breaking any
law and reporting the same to a traffic
control department helping them to
resolve traffic issues.</p>
      <p>One of the limitations of our project is
that there is no possible solution right
now to detect any weapon which is
hidden by the criminal in either his
pocket or suitcases. We are thinking of
a way to overcome this problem and
build a better and safer environment for
citizens.</p>
    </sec>
    <sec id="sec-12">
      <title>7. Acknowledgement</title>
      <p>The Graphics Processing Unit (GTX 1050
Ti) used in this research paper is provided by
the Computer Science &amp; Engineering
Department, Bhagwan Parshuram Institute of
Technology, Delhi, India.</p>
    </sec>
    <sec id="sec-13">
      <title>8. References</title>
      <p>[1] Dasiopoulou, Stamatia, et al.
"Knowledgeassisted semantic video object detection."
IEEE Transactions on Circuits and
Systems for Video Technology 15.10
(2005): 1210–1224.
[2] Ross, Girshick (2014). "Rich feature
hierarchies for accurate object detection
and semantic
segmentation" (PDF). Proceedings of the
IEEE Conference on Computer Vision and
Pattern Recognition. IEEE: 580–
587. arXiv:1311.2524. doi:10.1109/CVP
R.2014.81. ISBN
978-1-4799-51185. S2CID 215827080
[3] Liu, Wei (October 2016). "SSD: Single
shot multibox detector". Computer Vision
– ECCV 2016. European Conference on
Computer Vision. Lecture Notes in
Computer Science. 9905. pp. 21–
37. arXiv:1512.02325.
doi:10.1007/9783-319-46448-0_2. ISBN
978-3-31946447-3. S2CID 2141740.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joseph</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>"You only look once: Unified, real-time object detection"</article-title>
          .
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. arXiv:1506.02640. Bibcode: 2015arXiv150602640R.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joseph</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>"YOLO9000: better, faster, stronger"</article-title>
          .
          <source>arXiv:1612.08242 [cs.CV].</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joseph</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>"Yolov3: An incremental improvement"</article-title>
          . arXiv:
          <year>1804</year>
          .
          <article-title>02767 [cs</article-title>
          .CV.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [7] Zhang,
          <string-name>
            <surname>Shifeng</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>"Single-Shot Refinement Neural Network for Object Detection"</article-title>
          .
          <source>Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition:
          <fpage>4203</fpage>
          -
          <lpage>4212</lpage>
          . arXiv:
          <volume>1711</volume>
          .06897. Bibcode:2017ar
          <fpage>Xiv171106897Z</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[8] "</source>
          <volume>26</volume>
          /11 Mumbai Terror Attacks Aftermath:
          <article-title>Security Audits Carried Out On 227 NonMajor Seaports Till Date"</article-title>
          .
          <source>NDTV. Press Trust of India. 26 November 2017. Retrieved 7 December</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Velastin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Boghossian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Vicencio-Silva</surname>
          </string-name>
          ,
          <article-title>A motion-based image processing system for detecting potentially dangerous situations in underground railway stations</article-title>
          ,
          <source>Transportation ResearchPart C: Emerging Technologies</source>
          <volume>14</volume>
          (
          <issue>2</issue>
          ) (
          <year>2006</year>
          )
          <fpage>96</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ainsworth</surname>
          </string-name>
          , Buyer beware,
          <source>Security Oz</source>
          <volume>19</volume>
          (
          <year>2002</year>
          )
          <fpage>18</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [11]
          <article-title>Rapid object detection using a boosted cascade of simple features</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Viola</surname>
          </string-name>
          , Jones: Robust
          <string-name>
            <surname>Real-time Object</surname>
            <given-names>Detection</given-names>
          </string-name>
          ,
          <string-name>
            <surname>IJCV</surname>
          </string-name>
          <year>2001</year>
          (pages
          <issue>1,3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1968</year>
          ).
          <article-title>On the mean accuracy of statistical pattern recognizers</article-title>
          .
          <source>IEEE transactions on information theory</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ),
          <fpage>55</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Buckland</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gey</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>The relationship between recall and precision</article-title>
          .
          <source>Journal of the American society for information science</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>12</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Ahmad</given-names>
            <surname>Salihi Ben Musa</surname>
          </string-name>
          , Sanjay Kumar Singh, Prateek Agrawal, “
          <article-title>Suspicious Human Activity Recognition for Video Surveillance System”</article-title>
          , International Conference on Control, Instrumentation, Communication &amp; Computational
          <string-name>
            <surname>Technologies</surname>
          </string-name>
          ICCICCT-2014, IEEEXplore.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Nivid</surname>
            <given-names>Limbasiya</given-names>
          </string-name>
          , Prateek Agrawal,
          <article-title>"Bidirectional Long Short Term Memory Based Spatio-Temporal in Community Question Answering", A book on Deep learning based approaches for sentiment analysis</article-title>
          , pp.
          <fpage>291</fpage>
          -
          <lpage>310</lpage>
          ,
          <year>Jan 2020</year>
          , Springer.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Prateek</surname>
            <given-names>Agrawal</given-names>
          </string-name>
          , Deepak Chaudhary, Vishu Madaan, Anatoliy Zabrovskiy, Radu Prodan, Dragi Kimovski, Christian Timmerer, “
          <source>Automated Bank Cheque Verification Using Image Processing and Deep Learning Methods”, Multimedia tools and applications (MTAP)</source>
          ,
          <volume>80</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          . https://doi.org/10.1007/s11042-020- 09818-1.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Prateek</surname>
            <given-names>Agrawal</given-names>
          </string-name>
          , Deepak Chaudhary, Vishu Madaan, Anatoliy Zabrovskiy, Radu Prodan, Dragi Kimovski, Christian Timmerer, “
          <source>Automated Bank Cheque Verification Using Image Processing and Deep Learning Methods”, Multimedia tools and applications (MTAP)</source>
          ,
          <volume>80</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          . https://doi.org/10.1007/s11042-020- 09818-1.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Neha</surname>
            <given-names>Bhadwal</given-names>
          </string-name>
          , Prateek Agrawal, Vishu Madaan, Awadhesh Shukla, Anuj Kakran, “
          <article-title>Smart Border Surveillance System using Wireless Sensor Network</article-title>
          and Computer Vision”, International Conference on Automation,
          <source>Computational and Technology Management (ICACTM'19)</source>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>190</lpage>
          , IEEEXplore.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>