A Deep Learning Based Approach to Detect Suspicious Weapons Prashant Varshney, Harsh Tyagi, Nikhil Kr. Lohia, Abhishek Kajla and Palak Girdhar Computer Science & Engineering Department, Bhagwan Parshuram Institute of Technology, Guru Gobind Singh Indraprastha University, Sector 17, Rohini, New Delhi, 110089, India Abstract Over the past few decades, the world has witnessed a lot of terrorist and criminal activities. The Public Surveillance System has gained a lot of importance as a response to counter these activities. Various state governments have started to install cameras in their densely populated and important cities to safeguard their citizens. To cover a complete city under a surveillance network of thousands of cameras, hundreds of security personnels are needed to monitor its video feed in real-time. To make this task cost-effective and feasible, one security personnel is monitoring nearly 6 – 8 cameras which usually leads to failure in detection of threats. One slip of concentration can cause damage to many lives. This research paper determines the optimized, efficient & faster way to detect commonly used weapons like AK47, Hand Revolver, Pistol, Combat Knife, Grenade, etc. in a live video feed. Keywords 1 Object Detection, Neural Network, Deep Learning, Computer Vision, mAP Score 1. Introduction Recent growth in the field of Artificial Intelligence has contributed a lot to solve major Object Detection is a field of Artificial crises all over the world. This research paper Intelligence associated with Digital Image mainly focuses on different Object Detection Processing and Computer Vision, which deals models in Deep Learning like RCNN (Region- with detecting instances of an object like a car, based Convolutional Neural Network) [2], SSD humans, weapons, etc. possessing similar (Single Shot Detector) [3], and YOLO (You features with the trained object classes [1]. Only Look Once) [4, 5, 6, 7] using which Object Detection methods are generally possession of any suspicious weapons are categorized into either ML-based approach or detected in video surveillance. The primary DL-based approach depending upon the goal of this paper is to analyze the performance complexity of object class. For Machine of these models and determine the most Learning based approaches, it is essential to efficient and reliable model amongst them for define features beforehand using methods like surveillance purposes. Haar Cascade, SIFT, etc. which further uses the The main inspiration for this research paper support vector machine (SVM) technique for came from the Mumbai Chhatrapati Shivaji detecting object class. The Deep Learning terminus railway station attack where a couple based approaches uses Artificial Neural of terrorists entered the railway station with Networks to do an end-to-end object detection their automatic assault rifles and started without defining features specifically. indiscriminate shooting which killed 58 ACI’21: Workshop on Advances in Computational Intelligence at ISIC 2021, February 25–27, 2021, Delhi, India EMAIL: pv03158@gmail.com (P. Varshney); tyagih1699@gmail.com (H. Tyagi); nikhillohia6128@gmail.com (N. Lohia); abhishekkajla5511@gmail.com (A. Kajla); palakgirdhar@bpitindia.com (P. Girdhar) ORCID: 0000-0003-0497-6214 (P. Varshney); 0000-0001-5919- 811X (H. Tyagi); 0000-0002-7338-7619 (N. Lohia); 0000-0002- 4042-6001 (P. Girdhar) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) civilians and caused 100 plus casualties [8]. At 1. Region Proposal based framework that time, if there had been any AI-based that includes models like RCNN, Weapon Detection Technology that was FRCNN, and Faster RCNN. monitoring the city, then the Police Control 2. Regression-based frameworks that Room would have known about them include models like YOLO and beforehand. We decided to build such a system SSD. that might add an extra layer to the security at Region Proposal-based algorithms use a public places preventing such threatening sliding window approach to extract features activities. from the visual data. In the year 2014, Ross Girshick presented RCNN model based on this 2. Related Work algorithm, which obtained mAP of 53.3% On the contrary, to the results achieved on the PASCAL VOC dataset, an improvement of The detection of threatening weapons in the 30% was achieved by this model. In this model, surveillance system is a challenging task to do. the whole image is processed with a To cover a complete city under a surveillance Convolution Neural Network to produce a network of thousands of cameras, hundreds of feature map and then a feature vector of fixed- security personnels are needed to monitor its length with a Region of Interest (RoI) pooling video feed in real-time. To make this task cost- layer is extracted from each region proposal. effective and feasible, one security personnel is monitoring nearly 6 – 8 cameras which leads to failure in detection of threats at their initial 3. Proposed Methodology stage and results in delayed response causing 3.1. Flowchart causalities. According to the research of Velastin et al. of the Queen Mary University of London, carried in 2006 usually after 20 minutes of video monitoring, operators in many instances fail to notice the presence of threatening objects in a video feed [9]. Researching further analyzed that after 12 minutes of monitoring video feed, an operator is likely to fail to notice up to 45% of suspicious activities and after 22 minutes of monitoring, up to 95% of suspicious activities are failed to notice [10]. Thus the most optimal solution to this problem could be to eliminate humans from the equation as much as possible. In the year 2001, Paul Viola and Michael Jones proposed the first robust, efficient, and real-time Machine Learning based Object Detection Framework in their paper “Rapid Object Detection using Boosted Cascade of Simple Features” [11,12]. This framework can be trained to discover the variety of objects by taking lots of images that contain the object which we want our classifier to detect (positive images) and the same images but without the object which needs to be detected (negative images), to train the classifier. However, this approach cannot be used to identify the presence of complex objects in different orientations and sizes. Figure 1: Flowchart of weapon detection Deep Learning based Object Detection system proposed frameworks mainly consist of two types – 3.2. Scraping images of commonly 4.1. Evaluation Metrics in an used weapons Object Detection Model To build DL-based object detection model, In computer vision, Mean Average nearly 5000 images of commonly used Precision (mAP) is used to evaluate the Object weapons like AK47, Hand Pistol, Revolver, Detection Model [13]. It measures the accuracy Shotgun, Combat Knife, etc. were scrapped by calculating the number of correct predictions available in various sizes and orientations, that the model made. To find the mAP score of which is later pre-processed for building a model, we have to find the value of dataset. These images were gathered from Intersection of Union (IoU), precision and different sources available on the internet using recall prior. automation scripts. Object detection models generate 3.3. Building labelled dataset predictions in terms of a class label and a bounding box. For every prediction, we will measure IoU by taking the ratio of area of After pre-processing and scaling, these overlap between the predicted bounding box scrapped images were labelled using and the ground truth bounding box to the area “LabelImg” software. This software helps in of union of both bounding boxes. labelling class objects by simply marking the Mathematically, object manually in the image using selection tool. The software then creates a text (*.txt) file Area of Overlap (1) where the exact 2D coordinates of the bounding 𝐼𝑜𝑈 = box along with the class identifier is stored. 𝐴𝑟𝑒𝑎 𝑜𝑓 𝑈𝑛𝑖𝑜𝑛 These files when combined with images give us a complete labelled dataset which further can be We will find the values of Recall and used to train any desired DL-based object Precision by using this IoU value, for a given detection model. threshold. Taking an example, if IoU is set to 0.7 threshold, and the IoU value of 0.8 is achieved for a prediction, which is greater than 3.4. Model training or equals to the set threshold then the prediction is classified as TP i.e. True Positive otherwise Then proceeding further different Deep the prediction is classified as FP i.e. False Learning frameworks and models such as Positive. RCNN, SSD, and YOLO were trained using the above built dataset. The dataset was randomly divided in the ratio of 80 - 20. Using 80 percent of the dataset the model was trained. The remaining 20 percent of the dataset was used for testing purposes. To train these DL-based models Google Collab was used because of its free and powerful GPUs. 4. Experimental Results Figure 2: Marking a predicted box as True Positive or False Positive based on IoU value After completion of training with 80 percent of dataset the desired weights were Precision of an object detection model is the obtained. The remaining 20 percent of the ratio of total number of instances of True dataset was tested against these obtained Positives to the total number of instances of weights, giving a complete and detailed True Positives and False Positives all together. analysis of accuracy and precision of these Mathematically, models which was recorded for further comparison and eventually figuring out the best TP (2) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = detection model. (𝑇𝑃 + 𝐹𝑃) where, TP is true positives i.e. predicted as positive and is true & FP is false positives i.e. predicted as positive and is false. Recall measures the fraction of relevant predictions that were predicted by the object detection model. It is measured by taking the ratio of total number of instances of True Positives to the total number of instances of True Positives and False Negatives all together. Mathematically, TP (3) Figure 4: Graph of GPU time vs Accuracy 𝑅𝑒𝑐𝑎𝑙𝑙 = (𝑇𝑃 + 𝐹𝑁) Precision for SSD Model where, TP is true positives i.e. predicted as positive and is correct & FN is false negatives i.e. model is failed to predict the presence of the object and object is present there. The Average Precision (AP) is the area under the Precision vs Recall curve [14]. (Mean Average Precision) mAP is the average of Average Precision (AP). 4.2. Analysis of Accuracy Precision (AP) and GPU time to process one Figure 5: Graph of GPU time vs Accuracy Precision for YOLO Model frame (in ms) Table 1 Comparison of mAP Score and GPU Time of different object detection model Model mAP Score GPU time (in ms) RCNN 33.325 874 SSD 20.58 86.8 YOLO 33.1 394 5. Conclusion A comparative study of various DL-based object detectors that uses Artificial Neural Network to classify and localize visual data was Figure 3: Graph of GPU time vs Accuracy conducted. It was tested on a custom dataset of Precision for RCNN Model commonly used weapons. SSD amongst all the other DL-based Object Detection Frameworks has the least mAP score of 20, but the computational time to detect the object was the fastest of others. So, it can be a better choice if we need a fast object detector in trade-off to 4. Another possible application could be accuracy. YOLO and RCNN provided a similar the detection of fire at any place which mAP score of 33 and 34.2 respectively which upon detection can be reported directly gives the better accuracy of detecting the object. to a fire department ensuring that there Although the YOLO trained model is is minimum damage around that area. comparatively faster than the RCNN model 5. One can also monitor through the traffic making it an efficient and reliable object using this system where the cameras detector. will detect all the vehicles breaking any law and reporting the same to a traffic control department helping them to resolve traffic issues. 6. One of the limitations of our project is that there is no possible solution right now to detect any weapon which is hidden by the criminal in either his pocket or suitcases. We are thinking of a way to overcome this problem and build a better and safer environment for citizens. 7. Acknowledgement The Graphics Processing Unit (GTX 1050 Ti) used in this research paper is provided by Figure 6: Assault rifle AK47 detected in a the Computer Science & Engineering captured video Department, Bhagwan Parshuram Institute of Technology, Delhi, India. 6. Future Scope 8. References Deep Learning based Object Detection framework mainly consists of two types – [1] Dasiopoulou, Stamatia, et al. "Knowledge- assisted semantic video object detection." 1. This research paper is limited to IEEE Transactions on Circuits and weapons like AK47, Shotguns, 9mm Systems for Video Technology 15.10 Hand Pistols, Combat Knives, and Hand (2005): 1210–1224. Grenades which are some of the most [2] Ross, Girshick (2014). "Rich feature commonly used weapons amongst hierarchies for accurate object detection criminals. But there are still a large and semantic number of dangerous and illegal segmentation" (PDF). Proceedings of the weapons whose possession by any IEEE Conference on Computer Vision and individual can cause a problem. So our Pattern Recognition. IEEE: 580– target will be to add more and more of 587. arXiv:1311.2524. doi:10.1109/CVP these labeled datasets into our training R.2014.81. ISBN 978-1-4799-5118- part to ensure maximum accuracy. 5. S2CID 215827080 2. We can enhance this project to a [3] Liu, Wei (October 2016). "SSD: Single collision detection system where all shot multibox detector". Computer Vision sorts of vehicle accidents can be traced – ECCV 2016. European Conference on and reported to the nearby police station Computer Vision. Lecture Notes in to prevent any further damage. Computer Science. 9905. pp. 21– 3. Another application can be to use the 37. arXiv:1512.02325. doi:10.1007/978- system for detection of any major blood 3-319-46448-0_2. ISBN 978-3-319- around the area which will report a 46447-3. S2CID 2141740. nearby hospital to avoid any fatality. [4] Redmon, Joseph (2016). "You only look [16] Nivid Limbasiya, Prateek Agrawal, once: Unified, real-time object "Bidirectional Long Short Term Memory detection". Proceedings of the IEEE Based Spatio-Temporal in Community Conference on Computer Vision and Question Answering", A book on Deep Pattern learning based approaches for sentiment Recognition. arXiv:1506.02640. Bibcode: analysis, pp. 291-310, Jan 2020, Springer. 2015arXiv150602640R. [17] Prateek Agrawal, Deepak Chaudhary, [5] Redmon, Joseph (2017). "YOLO9000: Vishu Madaan, Anatoliy Zabrovskiy, better, faster, Radu Prodan, Dragi Kimovski, Christian stronger". arXiv:1612.08242 [cs.CV]. Timmerer, “Automated Bank Cheque [6] Redmon, Joseph (2018). "Yolov3: An Verification Using Image Processing and incremental Deep Learning Methods”, Multimedia improvement". arXiv:1804.02767 [cs.CV. tools and applications (MTAP), 80(1), pp. [7] Zhang, Shifeng (2018). "Single-Shot 1-32. https://doi.org/10.1007/s11042-020- Refinement Neural Network for Object 09818-1. Detection". Proceedings of the IEEE [18] Prateek Agrawal, Deepak Chaudhary, Conference on Computer Vision and Vishu Madaan, Anatoliy Zabrovskiy, Pattern Recognition: 4203– Radu Prodan, Dragi Kimovski, Christian 4212. arXiv:1711.06897. Bibcode:2017ar Timmerer, “Automated Bank Cheque Xiv171106897Z. Verification Using Image Processing and [8] "26/11 Mumbai Terror Attacks Aftermath: Deep Learning Methods”, Multimedia Security Audits Carried Out On 227 Non- tools and applications (MTAP), 80(1), pp. Major Seaports Till Date". NDTV. Press 1-32. https://doi.org/10.1007/s11042-020- Trust of India. 26 November 2017. 09818-1. Retrieved 7 December 2017. [19] Neha Bhadwal, Prateek Agrawal, Vishu [9] S. A. Velastin, B. A. Boghossian, M. A. Madaan, Awadhesh Shukla, Anuj Kakran, Vicencio-Silva, A motion-based image “Smart Border Surveillance System using processing system for detecting potentially Wireless Sensor Network and Computer dangerous situations in underground Vision”, International Conference on railway stations, Transportation Automation, Computational and ResearchPart C: Emerging Technologies Technology Management (ICACTM’19), 14 (2) (2006) 96–113. pp. 183-190, IEEEXplore. [10] T. Ainsworth, Buyer beware, Security Oz 19 (2002) 18–26. [11] Rapid object detection using a boosted cascade of simple features [12] Viola, Jones: Robust Real-time Object Detection, IJCV 2001 (pages 1,3). [13] Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory, 14(1), 55-63. [14] Buckland, M., & Gey, F. (1994). The relationship between recall and precision. Journal of the American society for information science, 45(1), 12-19. [15] Ahmad Salihi Ben Musa, Sanjay Kumar Singh, Prateek Agrawal, “Suspicious Human Activity Recognition for Video Surveillance System”, International Conference on Control, Instrumentation, Communication & Computational Technologies ICCICCT-2014, IEEEXplore.