-

Ital-IA

1613-0073

An integrated intelligent surveillance system for Industrial areas

Francesco Camastra

francesco.camastra@uniparthenope.it 0

Angelo Ciaramella

Angelo Casolaro

Pasquale De Trino

pasquale.detrino001@studenti.uniparthenope.it 0

Alessio Ferone

alessio.ferone@uniparthenope.it 0

Giovanni Hauber

giovanni.hauber@studenti.uniparthenope.it 0

Gennaro Iannuzzo

gennaro.iannuzzo001@studenti.uniparthenope.it 0

Vincenzo Mariano Scarrica

Antonio Junior Spoleto

antoniojunior.spoleto001@studenti.uniparthenope.it 0

Antonino Staiano

antonino.staiano@uniparthenope.it 0

Maria Concetta Vitale

mariaconcetta.vitale001@studenti.uniparthenope.it 0

Plate Detection, Face Detection, Fall Detection, Parking Detection

0 Department of Science and Technology, Parthenope University of Naples, Centro Direzionale Isola C4 , Naples, 80143 , Italy

2024

4 29 30

This paper presents the design and implementation phases of a software prototype developed by the University of Parthenope for the SE4I project (Smart Energy Eficiency & Environment for Industry lo Sviluppo sperimentale” (PNR 2015-2020). The prototype leverages advanced computer vision techniques based on deep learning architectures to address industrial security and monitoring needs. Specifically, the prototype tackles three key functionalities, (1) personnel and vehicle identification: The system recognizes authorized personnel and vehicle license plates within video streams captured in restricted industrial areas; (2) anomaly detection: The software can detect various anomalies in video feeds, including falls of personnel in monitored zones and unattended objects left in unauthorized areas; (3) smart parking management: The prototype identifies vacant parking spaces within camera-monitored zones, enabling eficient parking management. These functionalities are integrated into the software prototype, and its performance has been thoroughly evaluated.

CEUR

ceur-ws.org 1. Introduction nated industrial area by implementing a real-time video monitoring system. This system uses strategically placed smart poles equipped with RGB cameras to capture video streams. The project focuses on three key functionalities (see Fig. 1): (a) authorized access control: The system will recognize individuals and vehicle license plates. This ensures that only authorized personnel and vehicles can enter the area, likely through controlled access points nEvelop-O itoring is crucial. This system will inform users about free parking spaces as they approach designated parking spaces. In the context of performing learning tasks 0009-0003-0680-4501 (P. D. Trino); 0000-0002-4883-0164 (A. Ferone); to its size, automated and intelligent parking lot mon© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License from video streams in surveillance and security appliAttribution 4.0 International (CC BY 4.0). cations, the state of the art is represented by computer vision techniques based on the use of deep learning techniques [ 1, 2 ].In the subsequent sections, we will discuss the proposed solutions for each of the aforementioned tasks.

2. Plate Detection

region, the Warped Planar Object Detection Network (WPOD-NET) [ 3 ] is used to search for license plates. The WPOD-NET performs afine transformations to rectify the LP area to resemble a frontal view. These detections are then passed to an OCR network for accurate character recognition and extraction. To balance computation time and performance, we chose YoloV4 [4]. For the classification problem, we treated the network as a closed system, consolidating the outputs specifically related to vehicles such as cars, buses, and motorcycles, while ignoring outputs related to other classes. The design of

The objective is to recognize vehicles and their license WPOD-NET, which is responsible for warping the license plates from a surveillance video feed and retrieve the as- plate into a rectangular shape, was influenced by insights sociated alphanumeric sequence. This sequence is subse- from YOLO, SSD [5], and Spatial Transformer Networks quently utilized for additional vehicle recognition within (STN) [6]. Finally, in our OCR module, we used Tesserthe system. act [7], a fine-tuned optical character recognition engine trained on our license plate character dataset. Tesseract’s advantage over a simple CNN lies in its recurrent 2.1. Challenges neural network (RNN) architecture [8], which takes into Automatic license plate recognition faces several hurdles, account the sequential nature of the characters on a lispecifically, (a) Variable Lighting: Extreme brightness, cense plate. This allows for accurate recognition as the low light, and shadows can significantly reduce plate RNN captures the contextual dependencies between charvisibility. The systems address this with techniques like acters. Tesseract’s extensive training on diverse datasets adaptive thresholding and contrast enhancement; (b) Car makes it robust, handling diferent font styles, sizes, and Position: Vehicles approach cameras at various angles noise levels commonly found in license plate images. and distances. Sophisticated algorithms are required for accurate plate localization and perspective correction to 2.3. Execution account for these variations; (c) Occlusions: Objects like bumpers, dirt, or even other vehicles can partially or fully The modules are designed for real-time execution in an obscure the plate. Robust object detection and diverse embedded system environment, given the strict time contraining data are crucial to overcome these occlusions; straints imposed by vehicle identification. Fig. 3 illus(d) Font Diversity: License plate formats and fonts difer trates the time required for the execution steps. significantly across countries and even regions. Training models on a wide variety of datasets is essential for generalization across diferent plate styles.

2.2. Methods The proposed approach consists of three key steps: ve

hicle detection, license plate (LP) detection, and optical character recognition (OCR), as shown in Figure 2. In the first step, the system detects vehicles in the scene using a dedicated module. Within each detected vehicle

3. Face Recognition

Our research team has developed a framework for secure access control in the industrial area as a smart city environment. The framework leverages surveillance cameras positioned at entry points to industrial areas. It aims to match detected faces with the license plates of associated vehicles. This ensures that the driver corresponds to the registered vehicle owner, improving access control security.

3.1. Challenges The framework faces diferent issues, particularly with on-board processing. In addition, reflective surfaces and occlusions caused by sunlight on vehicle windshields can hinder facial recognition. Anomaly recognition in video streams presents a signifi

cant challenge: identifying rare and short-lived events 3.2. Methods that deviate from the norm. These anomalies often occur The face recognition process is divided into two main for just a few seconds, making them dificult for humans steps, namely, face localization and face classification. to detect and nearly impossible to capture in a single, In the localization step, faces are accurately localized universal model. The vast number of possible anomaly within an image by a Multi-task Cascaded Convolutional types, locations, and contexts makes defining a compreNetworks (MTCNN) algorithm [9]. Its unique cascade hensive model impractical. It would require an enormous structure consists of three stages: Proposal Network (P- amount of data and manual efort. A more efective apNet), Refinement Network (R-Net), and Output Network proach is to train models that can diferentiate between (O-Net). By simultaneously performing multiple tasks normal and abnormal activity, regardless of the specific such as face detection, bounding box regression, and face anomaly type. This approach leverages the fact that norlandmark localization, MTCNN ensures a thorough and mal behavior typically occurs far more frequently than accurate face identification. In particular, it excels at anomalies. detecting faces at diferent scales and orientations while maintaining impressive computational eficiency, making 4.2. Methods it ideal for real-time applications.

For face classification, an extremely efective approach combines two diferent algorithms. The first is a face alignment with an ensemble of regression trees [10]. Using an ensemble of regression trees, the algorithm predicts the positions of facial landmarks directly from image data, bypassing traditional optimization methods. The second uses FaceNet[11], which eficiently maps a face into a continuous embedding space, i.e., converting it into a 128-feature embedding vector. This vector is then matched to a face in the database using a one-shot approach (see Fig. 4 as an example of qualitative result on a test image.) this approach is labor-intensive and requires continuous monitoring. 4.3.2. Methods

Our method addresses the previous issues by enabling

fall detection with a simpler setup, a standard RGB camera, eliminating the need for specialized equipment, and AI-powered detection that uses a single AI module running on a GPU to analyze the video stream for instances of falls. This approach eliminates the need for wearable devices and reduces the reliance on human intervention.

Detection is to be performed in pedestrian areas and parks, and so a dataset was created to fit this particular environment and to train the model on data representing the final context. The training dataset is illustrative of Figure 5: Anomaly detection in a parking lot. all the normal poses that people take while walking in places like pedestrian areas and parks. The scenes were therefore captured with a fixed camera about 3 meters deviate significantly from the learned representation of above the ground, pointing across an open space and normal video data. The trained model compares the re- covering walking, standing poses, and running events constructed video frames with the original input. For nor- in all directions and with or without obstacles/occlumal events, the reconstructed frames closely resemble the sions. The idea of training a model with only ”normal originals, with minimal diferences in pixel values. How- events” is important because, in nature, abnormal events ever, when anomalies occur, the network’s reconstruc- (falls) are very rare and therefore expensive to acquire. tion becomes less accurate. This is reflected in blurry or The data preprocessing pipeline involves using the opendistorted frames compared to the originals. By analyzing pose framework [14] to extract skeleton key points from this reconstruction error (the diference between original each frame. From each skeleton, irrelevant keypoints and reconstructed frames), one can identify anomalies in are removed as they are considered noisy and skeletons real-time. with significant missing keypoints are filtered out. Keypoint coordinates are then normalized using min/max 4.3. Fall Detection normalization and discretized into coarser bins to provide numerical stability for the training phase. Finally, Falls, especially outdoors where help might be delayed, the data are shaped into time windows using sequences can lead to serious injuries. Traditional fall detection of 75 skeletal frames. Such windows form the basic unit systems often rely on wearable sensors, which can be on which the AI model operates. Since the video stream inconvenient or impractical. The SE4I project addresses is supposed to be captured at 25 FPS, working with 75 this challenge with a camera-based fall detection sys- frame windows means analyzing human behavior over tem using an LSTM Autoencoder. This system leverages 3-second actions. An overlap of 25 frames between conanomaly detection techniques within a computer vision secutive windows is also included to maintain continuity framework. It essentially learns what ”normal” move- between the windows themselves. The model is based on ment looks like and identifies deviations from this norm a LSTM autoencoder [15], [16]. The execution time when as potential falls. This approach ofers several advan- running on a consumer GPU allows for real-time perfortages, specifically, no need for wearable Sensors, only mance (see Fig. 6). Once the model has learned normal camera-based detection working with existing surveillance infrastructure, and the use of real-time alerts, enabling faster response times. 4.3.1. Challenges Traditional fall detection systems typically rely on wear- Figure 6: Pipeline components execution times. able sensors or specialized depth cameras. These methods can be intrusive to users and costly to deploy on a large scale. On the other hand, relying solely on human human behavior patterns, it can be used to reconstruct observation through video footage is an option. However, time windows. Reconstruction and input data are then compared; if the reconstruction error exceeds a certain threshold and deviates significantly from normal data, the input is intuitively flagged as a fall event. Overall, the results highlight the efectiveness of using learned temporal skeletal patterns for robust anomaly detection in the context of outdoor fall detection.

5. Parking Detection

Parking detection for SE4I requires the development of an automatic system that searches for free parking space in one of the parking areas within the industrial area and provides information to drivers who have requested a parking space. The Parking Guide and Information (PGI) system [17] has been adopted as a solution for the parking detection task using a monitoring system. The proposed PGI system consists of two main parts. The former is based on deep learning instance segmentation model to detect all available free spaces in a parking lot. The latter is a client-server architecture that automatically guides drivers to the closest parking lot with the highest number of available spaces.

5.1. Challenges

Parking lot detection systems using video surveillance face several dificulties: (a) the impact of weather, e.g., low visibility caused by fog, rain, and snow can significantly decrease the accuracy of these systems, or harsh weather conditions can obscure parking lot boundaries in the video feed; (b) diverse parking lot data, i.e., training robust parking detection models requires a large dataset with a wide variety of scenarios, including variations in parking space layouts, weather conditions, camera angles, obstructions, parking lot types (e.g., open-air, multistory), and lighting conditions (day/night); (c) real-time processing, that is, for practical applications, the system needs to operate in real-time, this necessitates developing a light parking detection model that can run eficiently on available hardware.

5.2. Methods

This work conceived a model for parking lot detection using an instance segmentation approach. Yolact++ [18], which is an extension of Yolact [19], was trained with successful results on a novel dataset appropriately designed 6. Integration and Infrastructure for this task. The dataset consists of 1395 images and 23600 manually annotated parking lots, and it was built The five intelligent modules of the SE4I project, plate by using a web-scarping approach. The images, taken detection, face detection, anomaly detection, fall detection, from public access cameras, were selected to represent a and parking detection, are part of a larger system powered variety of conditions, i.e. weather and lighting conditions, by a peer-to-peer network of NVIDIA Jetson Xavier features, i.e, diferent camera angles , occlusions, shadows, devices mounted on multifunctional light poles. This presence of people or animals, camera heights, satellite im- setup ensures eficient and real-time processing of the agery in 2D and 3D, diferent types of lines and colors, data collected by the surveillance cameras, as the comand diferent backgrounds. putation is performed in the field and each device shares

Parking lot detection and car detection are performed

simultaneously to classify occupied or free parking lots on the basis of the IoU between parking lot and car masks detected by the Yolact++ module, respectively. For IoU values greater than an IoU threshold, the system classiifes parking lots as busy lots, as free lots, otherwise.

The Yolact++ architecture is based on the RetinaNet architecture [20], using pre-trained ResNet-101 stages. In addition, Yolact++ introduces three improvements over the base model: Fast Mask Re-Scoring Network Stage, Deformable Convolutions with Intervals, and a Optimized Prediction Head. The selection of the Yolact++ architecture for the parking lot detection problem was motivated by the runtime requirements and the accuracy achieved by this instance segmentation model.

A client-server system called PGI has been developed.

The clients include the drivers, administrators, and machine learning systems. Drivers can search for parking lots, while administrators can add, remove, and monitor parking lots.System operations are performed on the server side, which is built using PHP and MySQL for database storage. Clients connect to the server through a server interface using a Java Android app. The app provides various functionalities, such as guiding drivers to the nearest parking lot with available spaces using GPS, and monitoring areas using the Google StreetView API. The system presents favorable results with low loss values and acceptable mAP for both the box and the mask, determined using a 0.5 IoU threshold (see Table 1 and Fig. 7).

Metrics Box Localization Loss Class Confidence Loss Mask Loss Semantic Segmentation Loss I Loss Total Loss mAP@0.50 Box mAP@0.50 Mask [4] A. Bochkovskiy, C. Wang, H. M. Liao, Yolov4: Optimal speed and accuracy of object detection, CoRR abs/2004.10934 (2020). arXiv:2004.10934. [5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,

C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, in: ECCV 2016, Springer Intl. Pub., 2016, pp. 21–37. [6] M. Jaderberg, K. Simonyan, A. Zisserman,

K. Kavukcuoglu, Spatial transformer networks,

CoRR abs/1506.02025 (2015). [7] R. Smith, An overview of the tesseract ocr engine,

in: ICDAR 2007, volume 2, 2007, pp. 629–633. [8] S. Hochreiter, J. Schmidhuber, Long short-term

memory, Neural computation 9 (1997) 1735–80. [9] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detecFigure 7: Visual results on test images. Masks are applied to tion and alignment using multitask cascaded convothe lots, each with a diferent color to better distinguish each lutional networks, IEEE Signal Processing Letters instance. The associated probability score is printed on each 23 (2016) 1499–1503. mask. [10] V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: 2014 IEEE Conference on Computer Vision and Pattern data and JSON output with devices on other poles using Recognition, 2014, pp. 1867–1874. a ZMQ publisher/subscriber pattern. Therefore, a dedi- [11] F. Schrof, D. Kalenichenko, J. Philbin, Facenet: A cated module manages the surveillance camera stream unified embedding for face recognition and clusterand associated metadata, such as brightness, frame rate, ing, in: 2015 IEEE Conference on Computer Vision contrast, etc. All modules are containerized as Docker and Pattern Recognition (CVPR), 2015, pp. 815–823. solutions, allowing for flexible portability, easy installa- [12] M. Hasan, J. Choi, J. Neumann, A. K. Roytion, resilience, and scalable performance. The whole Chowdhury, L. S. Davis, Learning temporal regularsystem is based on Python and C++ programming lan- ity in video sequences, 2016. arXiv:1604.04574. guages and PyTorch, OpenCV, OpenPose, Onvif libraries [13] Y. S. Chong, Y. H. Tay, Abnormal event detection are used. This infrastructure guarantees the real-time in videos using spatiotemporal autoencoder, 2017. requirements and the privacy of the video-monitored arXiv:1701.01546. areas. [14] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, Openpose: Realtime multi-person 2d pose estimation using part afinity fields, 2019. doi: 10.1109/ Acknowledgments TPAMI.2019.2929257. arXiv:1812.08008. [15] S. Hochreiter, J. Schmidhuber, Long short-term This research was conducted as part of the Smart En- memory, Neural Comput. 9 (1997). ergy Eficiency & Environment for Industry (SE4I) project, [16] M. A. Kramer, Nonlinear principal component analCUP 6618000230005 , funded by ”Progetti di ricerca ysis using autoassociative neural networks, Aiche industriale e lo Sviluppo sperimentale nelle 12 aree di Journal 37 (1991) 233–243. specializzazione individuate nel PNR 2015-2020, di cui al [17] D. Acharya, W. Yan, K. Khoshelham, Real-time D.D. del 13 luglio 2017 n. 1735”. image-based parking occupancy detection using deep learning., Research@ Locate 4 (2018) 33–40.

References [18] C. Zhou, Yolact++ Better Real-Time Instance Segmentation, University of California, Davis, 2020. [19] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee, Yolact: Realtime instance segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9157–9166. [20] T. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, CoRR abs/1708.02002 (2017). URL: http://arxiv.org/abs/ 1708.02002. arXiv:1708.02002.

[1]

Ferone ,

Maratea , Adaptive quick reduct for feature drift detection , Algorithms 14 ( 2021 ).

[2]

Maratea ,

Ferone , Deep neural networks and explainable machine learning , in: WILF 2018 , volume 11291 LNAI, 2019 , p. 253 - 256 .

[3]

Montazzolli ,

Jung , License plate detection and recognition in unconstrained scenarios , in: ECCV 2018 , Springer Intl. Pub., 2018 , pp. 593 - 609 .