=Paper=
{{Paper
|id=Vol-3762/526
|storemode=property
|title=An integrated intelligent surveillance system for Industrial areas
|pdfUrl=https://ceur-ws.org/Vol-3762/526.pdf
|volume=Vol-3762
|authors=Francesco Camastra,Angelo Ciaramella,Angelo Casolaro,Pasquale De Trino,Alessio Ferone,Giovanni Hauber,Gennaro Iannuzzo,Vincenzo Mariano Scarrica,Antonio Junior Spoleto,Antonino Staiano,Maria Concetta Vitale
|dblpUrl=https://dblp.org/rec/conf/ital-ia/CamastraCCTFHIS24
}}
==An integrated intelligent surveillance system for Industrial areas==
An integrated intelligent surveillance system for Industrial
areas
Francesco Camastra, Angelo Ciaramella, Angelo Casolaro, Pasquale De Trino, Alessio Ferone,
Giovanni Hauber, Gennaro Iannuzzo, Vincenzo Mariano Scarrica, Antonio Junior Spoleto,
Antonino Staiano∗ and Maria Concetta Vitale
Department of Science and Technology, Parthenope University of Naples, Centro Direzionale Isola C4, Naples, 80143, Italy
Abstract
This paper presents the design and implementation phases of a software prototype developed by the University of Parthenope
for the SE4I project (Smart Energy Efficiency & Environment for Industry), funded by the ”Progetti di ricerca industriale e
lo Sviluppo sperimentale” (PNR 2015-2020). The prototype leverages advanced computer vision techniques based on deep
learning architectures to address industrial security and monitoring needs. Specifically, the prototype tackles three key
functionalities, (1) personnel and vehicle identification: The system recognizes authorized personnel and vehicle license
plates within video streams captured in restricted industrial areas; (2) anomaly detection: The software can detect various
anomalies in video feeds, including falls of personnel in monitored zones and unattended objects left in unauthorized areas;
(3) smart parking management: The prototype identifies vacant parking spaces within camera-monitored zones, enabling
efficient parking management. These functionalities are integrated into the software prototype, and its performance has been
thoroughly evaluated.
Keywords
Plate Detection, Face Detection, Fall Detection, Parking Detection
1. Introduction with barriers. Upon arrival, an employee’s car triggers
the system. The camera mounted on the smart pole cap-
The SE4I project aims to improve safety within a desig- tures the RGB video stream of the scene, using AI to
nated industrial area by implementing a real-time video identify the license plate and the driver’s face. Access
monitoring system. This system uses strategically placed is then granted only after successful recognition. The
smart poles equipped with RGB cameras to capture video combined recognition of license plate and driver verifies
streams. The project focuses on three key functionalities that the vehicle and driver are authorized. If recognition
(see Fig. 1): (a) authorized access control: The system is successful, access is allowed; if not, access is denied;
will recognize individuals and vehicle license plates. This (b) anomaly detection: This use case focuses on detect-
ensures that only authorized personnel and vehicles can ing abnormal behavior or events in the video streams
enter the area, likely through controlled access points captured by the pole-mounted cameras. These anoma-
lies can range from environmental violations, such as
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- illegal dumping of waste to personnel safety issues, e.g.,
nized by CINI, May 29-30, 2024, Naples, Italy
∗
Corresponding author. persons’ falls, and unattended objects left in restricted
Envelope-Open francesco.camastra@uniparthenope.it (F. Camastra); areas. An RGB video stream from a smart pole camera
angelo.ciaramella@uniparthenope.it (A. Ciaramella); continuously feeds the scene, which may include both
angelo.casolaro001@studenti.uniparthenope.it (A. Casolaro); facility personnel and outsiders, to a hardware compo-
pasquale.detrino001@studenti.uniparthenope.it (P. D. Trino); nent equipped with AI modules. The intelligent module
alessio.ferone@uniparthenope.it (A. Ferone);
giovanni.hauber@studenti.uniparthenope.it (G. Hauber);
analyzes the video to identify unusual elements. Upon
gennaro.iannuzzo001@studenti.uniparthenope.it (G. Iannuzzo); detection, an alert is sent to a central control station that
vincenzomariano.scarrica001@studenti.uniparthenope.it specifies the type and location of the event. This allows
(V. M. Scarrica); immediate assistance to personnel experiencing medical
antoniojunior.spoleto001@studenti.uniparthenope.it (A. J. Spoleto); emergencies or accidents and swift intervention in case
antonino.staiano@uniparthenope.it (A. Staiano);
mariaconcetta.vitale001@studenti.uniparthenope.it (M. C. Vitale)
of suspicious activity; (c) smart parking management:
Orcid 0000-0003-4439-7583 (F. Camastra); 0000-0001-5592-7995 This use case addresses the detection and management of
(A. Ciaramella); 0000-0002-7577-6765 (A. Casolaro); parking availability within the vast industrial area. Due
0009-0003-0680-4501 (P. D. Trino); 0000-0002-4883-0164 (A. Ferone); to its size, automated and intelligent parking lot mon-
0009-0007-0137-3182 (G. Hauber); 0009-0003-5962-8302 itoring is crucial. This system will inform users about
(G. Iannuzzo); 0009-0008-4640-2693 (V. M. Scarrica);
0009-0007-4037-7821 (A. J. Spoleto); 0000-0002-4708-5860
free parking spaces as they approach designated park-
(A. Staiano); 0000-0002-5538-9952 (M. C. Vitale) ing spaces. In the context of performing learning tasks
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
from video streams in surveillance and security appli-
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
region, the Warped Planar Object Detection Network
(WPOD-NET) [3] is used to search for license plates. The
WPOD-NET performs affine transformations to rectify
the LP area to resemble a frontal view. These detections
are then passed to an OCR network for accurate charac-
ter recognition and extraction. To balance computation
time and performance, we chose YoloV4 [4]. For the clas-
sification problem, we treated the network as a closed
system, consolidating the outputs specifically related to
vehicles such as cars, buses, and motorcycles, while ig-
Figure 1: Project tasks: (a) people/vehicles recognition; (b) noring outputs related to other classes. The design of
fall detection; (c) anomaly detection; (d) parking lot handling.
cations, the state of the art is represented by computer
vision techniques based on the use of deep learning tech-
niques [1, 2].In the subsequent sections, we will discuss
the proposed solutions for each of the aforementioned
tasks.
Figure 2: The proposed pipeline at work.
2. Plate Detection
The objective is to recognize vehicles and their license WPOD-NET, which is responsible for warping the license
plates from a surveillance video feed and retrieve the as- plate into a rectangular shape, was influenced by insights
sociated alphanumeric sequence. This sequence is subse- from YOLO, SSD [5], and Spatial Transformer Networks
quently utilized for additional vehicle recognition within (STN) [6]. Finally, in our OCR module, we used Tesser-
the system. act [7], a fine-tuned optical character recognition engine
trained on our license plate character dataset. Tesser-
act’s advantage over a simple CNN lies in its recurrent
2.1. Challenges neural network (RNN) architecture [8], which takes into
Automatic license plate recognition faces several hurdles, account the sequential nature of the characters on a li-
specifically, (a) Variable Lighting: Extreme brightness, cense plate. This allows for accurate recognition as the
low light, and shadows can significantly reduce plate RNN captures the contextual dependencies between char-
visibility. The systems address this with techniques like acters. Tesseract’s extensive training on diverse datasets
adaptive thresholding and contrast enhancement; (b) Car makes it robust, handling different font styles, sizes, and
Position: Vehicles approach cameras at various angles noise levels commonly found in license plate images.
and distances. Sophisticated algorithms are required for
accurate plate localization and perspective correction to 2.3. Execution
account for these variations; (c) Occlusions: Objects like
bumpers, dirt, or even other vehicles can partially or fully The modules are designed for real-time execution in an
obscure the plate. Robust object detection and diverse embedded system environment, given the strict time con-
training data are crucial to overcome these occlusions; straints imposed by vehicle identification. Fig. 3 illus-
(d) Font Diversity: License plate formats and fonts differ trates the time required for the execution steps.
significantly across countries and even regions. Train-
ing models on a wide variety of datasets is essential for
generalization across different plate styles.
2.2. Methods
The proposed approach consists of three key steps: ve- Figure 3: Execution time of the pipeline.
hicle detection, license plate (LP) detection, and optical
character recognition (OCR), as shown in Figure 2. In
the first step, the system detects vehicles in the scene
using a dedicated module. Within each detected vehicle
3. Face Recognition
Our research team has developed a framework for secure
access control in the industrial area as a smart city envi-
ronment. The framework leverages surveillance cameras
positioned at entry points to industrial areas. It aims to
match detected faces with the license plates of associ-
ated vehicles. This ensures that the driver corresponds
to the registered vehicle owner, improving access control
security.
3.1. Challenges Figure 4: Visual results on a customized test.
The framework faces different issues, particularly with
on-board processing. In addition, reflective surfaces and
occlusions caused by sunlight on vehicle windshields can 4.1. Challenges
hinder facial recognition.
Anomaly recognition in video streams presents a signifi-
cant challenge: identifying rare and short-lived events
3.2. Methods that deviate from the norm. These anomalies often occur
The face recognition process is divided into two main for just a few seconds, making them difficult for humans
steps, namely, face localization and face classification. to detect and nearly impossible to capture in a single,
In the localization step, faces are accurately localized universal model. The vast number of possible anomaly
within an image by a Multi-task Cascaded Convolutional types, locations, and contexts makes defining a compre-
Networks (MTCNN) algorithm [9]. Its unique cascade hensive model impractical. It would require an enormous
structure consists of three stages: Proposal Network (P- amount of data and manual effort. A more effective ap-
Net), Refinement Network (R-Net), and Output Network proach is to train models that can differentiate between
(O-Net). By simultaneously performing multiple tasks normal and abnormal activity, regardless of the specific
such as face detection, bounding box regression, and face anomaly type. This approach leverages the fact that nor-
landmark localization, MTCNN ensures a thorough and mal behavior typically occurs far more frequently than
accurate face identification. In particular, it excels at anomalies.
detecting faces at different scales and orientations while
maintaining impressive computational efficiency, making 4.2. Methods
it ideal for real-time applications.
For face classification, an extremely effective approach We have used a reconstruction-based method, where
combines two different algorithms. The first is a face a model is trained to learn the normal patterns of the
alignment with an ensemble of regression trees [10]. Us- data, to be able to reconstruct them when new frames
ing an ensemble of regression trees, the algorithm pre- are presented. The model is designed to extract the spa-
dicts the positions of facial landmarks directly from image tiotemporal structure of the video stream to accurately
data, bypassing traditional optimization methods. The learn the pattern of normality for a scene without anoma-
second uses FaceNet[11], which efficiently maps a face lies [12](without the abnormal situation highlighted in
into a continuous embedding space, i.e., converting it red in Figure 5). During testing, the model provides in-
into a 128-feature embedding vector. This vector is then formation about the most anomalous areas of the video
matched to a face in the database using a one-shot ap- by comparing input frames with reconstructed frames.
proach (see Fig. 4 as an example of qualitative result on A relatively low score indicates a normal scene, while
a test image.) a high score indicates the presence of anomalies. The
goal is to create an end-to-end model capable of learn-
ing the spatio-temporal patterns of the analyzed data
4. Anomaly Detection and predicting when an event is anomalous compared
to the learned normality. The model used for this un-
Anomaly detection involves the identification of unusual supervised analysis is called CLSTM-AE, which stands
items, events, or observations that are significantly differ- for Long-Short Term Memory Convolutional-Transpose
ent from the norm or expected behavior, or that indicate Convolutional Autoencoder. The architecture is a spe-
unusual conditions. The activities conducted in the SE4I cial type of autoencoder [13]. This approach enables
project focused on identifying waste dumping where ac- real-time anomaly detection by identifying events that
cess is prohibited, and fall detection.
this approach is labor-intensive and requires continuous
monitoring.
4.3.2. Methods
Our method addresses the previous issues by enabling
fall detection with a simpler setup, a standard RGB cam-
era, eliminating the need for specialized equipment, and
AI-powered detection that uses a single AI module run-
ning on a GPU to analyze the video stream for instances
of falls. This approach eliminates the need for wearable
devices and reduces the reliance on human intervention.
Detection is to be performed in pedestrian areas and
parks, and so a dataset was created to fit this particular
environment and to train the model on data representing
the final context. The training dataset is illustrative of
Figure 5: Anomaly detection in a parking lot.
all the normal poses that people take while walking in
places like pedestrian areas and parks. The scenes were
therefore captured with a fixed camera about 3 meters
deviate significantly from the learned representation of above the ground, pointing across an open space and
normal video data. The trained model compares the re- covering walking, standing poses, and running events
constructed video frames with the original input. For nor- in all directions and with or without obstacles/occlu-
mal events, the reconstructed frames closely resemble the sions. The idea of training a model with only ”normal
originals, with minimal differences in pixel values. How- events” is important because, in nature, abnormal events
ever, when anomalies occur, the network’s reconstruc- (falls) are very rare and therefore expensive to acquire.
tion becomes less accurate. This is reflected in blurry or The data preprocessing pipeline involves using the open-
distorted frames compared to the originals. By analyzing pose framework [14] to extract skeleton key points from
this reconstruction error (the difference between original each frame. From each skeleton, irrelevant keypoints
and reconstructed frames), one can identify anomalies in are removed as they are considered noisy and skeletons
real-time. with significant missing keypoints are filtered out. Key-
point coordinates are then normalized using min/max
4.3. Fall Detection normalization and discretized into coarser bins to pro-
vide numerical stability for the training phase. Finally,
Falls, especially outdoors where help might be delayed, the data are shaped into time windows using sequences
can lead to serious injuries. Traditional fall detection of 75 skeletal frames. Such windows form the basic unit
systems often rely on wearable sensors, which can be on which the AI model operates. Since the video stream
inconvenient or impractical. The SE4I project addresses is supposed to be captured at 25 FPS, working with 75
this challenge with a camera-based fall detection sys- frame windows means analyzing human behavior over
tem using an LSTM Autoencoder. This system leverages 3-second actions. An overlap of 25 frames between con-
anomaly detection techniques within a computer vision secutive windows is also included to maintain continuity
framework. It essentially learns what ”normal” move- between the windows themselves. The model is based on
ment looks like and identifies deviations from this norm a LSTM autoencoder [15], [16]. The execution time when
as potential falls. This approach offers several advan- running on a consumer GPU allows for real-time perfor-
tages, specifically, no need for wearable Sensors, only mance (see Fig. 6). Once the model has learned normal
camera-based detection working with existing surveil-
lance infrastructure, and the use of real-time alerts, en-
abling faster response times.
4.3.1. Challenges
Traditional fall detection systems typically rely on wear- Figure 6: Pipeline components execution times.
able sensors or specialized depth cameras. These meth-
ods can be intrusive to users and costly to deploy on a
large scale. On the other hand, relying solely on human human behavior patterns, it can be used to reconstruct
observation through video footage is an option. However, time windows. Reconstruction and input data are then
compared; if the reconstruction error exceeds a certain
threshold and deviates significantly from normal data, Parking lot detection and car detection are performed
the input is intuitively flagged as a fall event. Overall, simultaneously to classify occupied or free parking lots
the results highlight the effectiveness of using learned on the basis of the IoU between parking lot and car masks
temporal skeletal patterns for robust anomaly detection detected by the Yolact++ module, respectively. For IoU
in the context of outdoor fall detection. values greater than an IoU threshold, the system classi-
fies parking lots as busy lots, as free lots, otherwise.
The Yolact++ architecture is based on the RetinaNet ar-
5. Parking Detection chitecture [20], using pre-trained ResNet-101 stages. In
addition, Yolact++ introduces three improvements over
Parking detection for SE4I requires the development of
the base model: Fast Mask Re-Scoring Network Stage, De-
an automatic system that searches for free parking space
formable Convolutions with Intervals, and a Optimized
in one of the parking areas within the industrial area and
Prediction Head. The selection of the Yolact++ architec-
provides information to drivers who have requested a
ture for the parking lot detection problem was motivated
parking space. The Parking Guide and Information (PGI)
by the runtime requirements and the accuracy achieved
system [17] has been adopted as a solution for the parking
by this instance segmentation model.
detection task using a monitoring system. The proposed
A client-server system called PGI has been developed.
PGI system consists of two main parts. The former is
The clients include the drivers, administrators, and ma-
based on deep learning instance segmentation model to
chine learning systems. Drivers can search for parking
detect all available free spaces in a parking lot. The latter
lots, while administrators can add, remove, and moni-
is a client-server architecture that automatically guides
tor parking lots.System operations are performed on the
drivers to the closest parking lot with the highest number
server side, which is built using PHP and MySQL for
of available spaces.
database storage. Clients connect to the server through
a server interface using a Java Android app. The app
5.1. Challenges provides various functionalities, such as guiding drivers
Parking lot detection systems using video surveillance to the nearest parking lot with available spaces using
face several difficulties: (a) the impact of weather, e.g., GPS, and monitoring areas using the Google StreetView
low visibility caused by fog, rain, and snow can signifi- API. The system presents favorable results with low loss
cantly decrease the accuracy of these systems, or harsh values and acceptable mAP for both the box and the mask,
weather conditions can obscure parking lot boundaries in determined using a 0.5 IoU threshold (see Table 1 and Fig.
the video feed; (b) diverse parking lot data, i.e., training 7).
robust parking detection models requires a large dataset Metrics Average Values
with a wide variety of scenarios, including variations in Box Localization Loss 2.027
parking space layouts, weather conditions, camera an- Class Confidence Loss 1.604
gles, obstructions, parking lot types (e.g., open-air, multi- Mask Loss 3.185
story), and lighting conditions (day/night); (c) real-time Semantic Segmentation Loss 0.125
processing, that is, for practical applications, the system I Loss 0.116
needs to operate in real-time, this necessitates developing Total Loss 7.058
a light parking detection model that can run efficiently mAP@0.50 Box 80.5
on available hardware. mAP@0.50 Mask 76.62
Table 1
5.2. Methods Results on Yolact++ after fine-tuning and testing on the cus-
tom dataset.
This work conceived a model for parking lot detection
using an instance segmentation approach. Yolact++ [18],
which is an extension of Yolact [19], was trained with suc-
cessful results on a novel dataset appropriately designed 6. Integration and Infrastructure
for this task. The dataset consists of 1395 images and
23600 manually annotated parking lots, and it was built The five intelligent modules of the SE4I project, plate
by using a web-scarping approach. The images, taken detection, face detection, anomaly detection, fall detection,
from public access cameras, were selected to represent a and parking detection, are part of a larger system powered
variety of conditions, i.e. weather and lighting conditions, by a peer-to-peer network of NVIDIA Jetson Xavier
features, i.e, different camera angles, occlusions, shadows, devices mounted on multifunctional light poles. This
presence of people or animals, camera heights, satellite im- setup ensures efficient and real-time processing of the
agery in 2D and 3D, different types of lines and colors, data collected by the surveillance cameras, as the com-
and different backgrounds. putation is performed in the field and each device shares
[4] A. Bochkovskiy, C. Wang, H. M. Liao, Yolov4: Opti-
mal speed and accuracy of object detection, CoRR
abs/2004.10934 (2020). arXiv:2004.10934 .
[5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,
C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox
detector, in: ECCV 2016, Springer Intl. Pub., 2016,
pp. 21–37.
[6] M. Jaderberg, K. Simonyan, A. Zisserman,
K. Kavukcuoglu, Spatial transformer networks,
CoRR abs/1506.02025 (2015).
[7] R. Smith, An overview of the tesseract ocr engine,
in: ICDAR 2007, volume 2, 2007, pp. 629–633.
[8] S. Hochreiter, J. Schmidhuber, Long short-term
memory, Neural computation 9 (1997) 1735–80.
[9] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detec-
Figure 7: Visual results on test images. Masks are applied to tion and alignment using multitask cascaded convo-
the lots, each with a different color to better distinguish each lutional networks, IEEE Signal Processing Letters
instance. The associated probability score is printed on each 23 (2016) 1499–1503.
mask. [10] V. Kazemi, J. Sullivan, One millisecond face align-
ment with an ensemble of regression trees, in: 2014
IEEE Conference on Computer Vision and Pattern
data and JSON output with devices on other poles using Recognition, 2014, pp. 1867–1874.
a ZMQ publisher/subscriber pattern. Therefore, a dedi- [11] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A
cated module manages the surveillance camera stream unified embedding for face recognition and cluster-
and associated metadata, such as brightness, frame rate, ing, in: 2015 IEEE Conference on Computer Vision
contrast, etc. All modules are containerized as Docker and Pattern Recognition (CVPR), 2015, pp. 815–823.
solutions, allowing for flexible portability, easy installa- [12] M. Hasan, J. Choi, J. Neumann, A. K. Roy-
tion, resilience, and scalable performance. The whole Chowdhury, L. S. Davis, Learning temporal regular-
system is based on Python and C++ programming lan- ity in video sequences, 2016. arXiv:1604.04574 .
guages and PyTorch, OpenCV, OpenPose, Onvif libraries [13] Y. S. Chong, Y. H. Tay, Abnormal event detection
are used. This infrastructure guarantees the real-time in videos using spatiotemporal autoencoder, 2017.
requirements and the privacy of the video-monitored arXiv:1701.01546 .
areas. [14] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh,
Openpose: Realtime multi-person 2d pose estima-
tion using part affinity fields, 2019. doi:10.1109/
Acknowledgments TPAMI.2019.2929257 . arXiv:1812.08008 .
[15] S. Hochreiter, J. Schmidhuber, Long short-term
This research was conducted as part of the Smart En- memory, Neural Comput. 9 (1997).
ergy Efficiency & Environment for Industry (SE4I) project, [16] M. A. Kramer, Nonlinear principal component anal-
CUP 𝐼 66𝐺18000230005, funded by ”Progetti di ricerca ysis using autoassociative neural networks, Aiche
industriale e lo Sviluppo sperimentale nelle 12 aree di Journal 37 (1991) 233–243.
specializzazione individuate nel PNR 2015-2020, di cui al [17] D. Acharya, W. Yan, K. Khoshelham, Real-time
D.D. del 13 luglio 2017 n. 1735”. image-based parking occupancy detection using
deep learning., Research@ Locate 4 (2018) 33–40.
[18] C. Zhou, Yolact++ Better Real-Time Instance Seg-
References mentation, University of California, Davis, 2020.
[1] A. Ferone, A. Maratea, Adaptive quick reduct for [19] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee, Yolact: Real-
feature drift detection, Algorithms 14 (2021). time instance segmentation, in: Proceedings of the
[2] A. Maratea, A. Ferone, Deep neural networks and IEEE/CVF international conference on computer
explainable machine learning, in: WILF 2018, vol- vision, 2019, pp. 9157–9166.
ume 11291 LNAI, 2019, p. 253 – 256. [20] T. Lin, P. Goyal, R. B. Girshick, K. He, P. Dol-
[3] S. Montazzolli, C. Jung, License plate detection and lár, Focal loss for dense object detection, CoRR
recognition in unconstrained scenarios, in: ECCV abs/1708.02002 (2017). URL: http://arxiv.org/abs/
2018, Springer Intl. Pub., 2018, pp. 593–609. 1708.02002. arXiv:1708.02002 .