<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Ital-IA</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An integrated intelligent surveillance system for Industrial areas</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Camastra</string-name>
          <email>francesco.camastra@uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Ciaramella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Casolaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale De Trino</string-name>
          <email>pasquale.detrino001@studenti.uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Ferone</string-name>
          <email>alessio.ferone@uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Hauber</string-name>
          <email>giovanni.hauber@studenti.uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gennaro Iannuzzo</string-name>
          <email>gennaro.iannuzzo001@studenti.uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Mariano Scarrica</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Junior Spoleto</string-name>
          <email>antoniojunior.spoleto001@studenti.uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonino Staiano</string-name>
          <email>antonino.staiano@uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Concetta Vitale</string-name>
          <email>mariaconcetta.vitale001@studenti.uniparthenope.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Plate Detection, Face Detection, Fall Detection, Parking Detection</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Science and Technology, Parthenope University of Naples, Centro Direzionale Isola C4</institution>
          ,
          <addr-line>Naples, 80143</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>4</volume>
      <fpage>29</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>This paper presents the design and implementation phases of a software prototype developed by the University of Parthenope for the SE4I project (Smart Energy Eficiency &amp; Environment for Industry lo Sviluppo sperimentale” (PNR 2015-2020). The prototype leverages advanced computer vision techniques based on deep learning architectures to address industrial security and monitoring needs. Specifically, the prototype tackles three key functionalities, (1) personnel and vehicle identification: The system recognizes authorized personnel and vehicle license plates within video streams captured in restricted industrial areas; (2) anomaly detection: The software can detect various anomalies in video feeds, including falls of personnel in monitored zones and unattended objects left in unauthorized areas; (3) smart parking management: The prototype identifies vacant parking spaces within camera-monitored zones, enabling eficient parking management. These functionalities are integrated into the software prototype, and its performance has been thoroughly evaluated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>
        ceur-ws.org
1. Introduction
nated industrial area by implementing a real-time video
monitoring system. This system uses strategically placed
smart poles equipped with RGB cameras to capture video
streams. The project focuses on three key functionalities
(see Fig. 1): (a) authorized access control: The system
will recognize individuals and vehicle license plates. This
ensures that only authorized personnel and vehicles can
enter the area, likely through controlled access points
nEvelop-O
itoring is crucial. This system will inform users about
free parking spaces as they approach designated
parking spaces. In the context of performing learning tasks
0009-0003-0680-4501 (P. D. Trino); 0000-0002-4883-0164 (A. Ferone); to its size, automated and intelligent parking lot
mon© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License from video streams in surveillance and security
appliAttribution 4.0 International (CC BY 4.0).
cations, the state of the art is represented by computer
vision techniques based on the use of deep learning
techniques [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].In the subsequent sections, we will discuss
the proposed solutions for each of the aforementioned
tasks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Plate Detection</title>
      <p>
        region, the Warped Planar Object Detection Network
(WPOD-NET) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is used to search for license plates. The
WPOD-NET performs afine transformations to rectify
the LP area to resemble a frontal view. These detections
are then passed to an OCR network for accurate
character recognition and extraction. To balance computation
time and performance, we chose YoloV4 [4]. For the
classification problem, we treated the network as a closed
system, consolidating the outputs specifically related to
vehicles such as cars, buses, and motorcycles, while
ignoring outputs related to other classes. The design of
      </p>
      <p>The objective is to recognize vehicles and their license WPOD-NET, which is responsible for warping the license
plates from a surveillance video feed and retrieve the as- plate into a rectangular shape, was influenced by insights
sociated alphanumeric sequence. This sequence is subse- from YOLO, SSD [5], and Spatial Transformer Networks
quently utilized for additional vehicle recognition within (STN) [6]. Finally, in our OCR module, we used
Tesserthe system. act [7], a fine-tuned optical character recognition engine
trained on our license plate character dataset.
Tesseract’s advantage over a simple CNN lies in its recurrent
2.1. Challenges neural network (RNN) architecture [8], which takes into
Automatic license plate recognition faces several hurdles, account the sequential nature of the characters on a
lispecifically, (a) Variable Lighting: Extreme brightness, cense plate. This allows for accurate recognition as the
low light, and shadows can significantly reduce plate RNN captures the contextual dependencies between
charvisibility. The systems address this with techniques like acters. Tesseract’s extensive training on diverse datasets
adaptive thresholding and contrast enhancement; (b) Car makes it robust, handling diferent font styles, sizes, and
Position: Vehicles approach cameras at various angles noise levels commonly found in license plate images.
and distances. Sophisticated algorithms are required for
accurate plate localization and perspective correction to 2.3. Execution
account for these variations; (c) Occlusions: Objects like
bumpers, dirt, or even other vehicles can partially or fully The modules are designed for real-time execution in an
obscure the plate. Robust object detection and diverse embedded system environment, given the strict time
contraining data are crucial to overcome these occlusions; straints imposed by vehicle identification. Fig. 3
illus(d) Font Diversity: License plate formats and fonts difer trates the time required for the execution steps.
significantly across countries and even regions.
Training models on a wide variety of datasets is essential for
generalization across diferent plate styles.</p>
      <sec id="sec-2-1">
        <title>2.2. Methods</title>
        <sec id="sec-2-1-1">
          <title>The proposed approach consists of three key steps: ve</title>
          <p>hicle detection, license plate (LP) detection, and optical
character recognition (OCR), as shown in Figure 2. In
the first step, the system detects vehicles in the scene
using a dedicated module. Within each detected vehicle</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Face Recognition</title>
      <p>Our research team has developed a framework for secure
access control in the industrial area as a smart city
environment. The framework leverages surveillance cameras
positioned at entry points to industrial areas. It aims to
match detected faces with the license plates of
associated vehicles. This ensures that the driver corresponds
to the registered vehicle owner, improving access control
security.</p>
      <sec id="sec-3-1">
        <title>3.1. Challenges</title>
        <sec id="sec-3-1-1">
          <title>The framework faces diferent issues, particularly with on-board processing. In addition, reflective surfaces and occlusions caused by sunlight on vehicle windshields can hinder facial recognition.</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Anomaly recognition in video streams presents a signifi</title>
          <p>cant challenge: identifying rare and short-lived events
3.2. Methods that deviate from the norm. These anomalies often occur
The face recognition process is divided into two main for just a few seconds, making them dificult for humans
steps, namely, face localization and face classification. to detect and nearly impossible to capture in a single,
In the localization step, faces are accurately localized universal model. The vast number of possible anomaly
within an image by a Multi-task Cascaded Convolutional types, locations, and contexts makes defining a
compreNetworks (MTCNN) algorithm [9]. Its unique cascade hensive model impractical. It would require an enormous
structure consists of three stages: Proposal Network (P- amount of data and manual efort. A more efective
apNet), Refinement Network (R-Net), and Output Network proach is to train models that can diferentiate between
(O-Net). By simultaneously performing multiple tasks normal and abnormal activity, regardless of the specific
such as face detection, bounding box regression, and face anomaly type. This approach leverages the fact that
norlandmark localization, MTCNN ensures a thorough and mal behavior typically occurs far more frequently than
accurate face identification. In particular, it excels at anomalies.
detecting faces at diferent scales and orientations while
maintaining impressive computational eficiency, making 4.2. Methods
it ideal for real-time applications.</p>
          <p>For face classification, an extremely efective approach
combines two diferent algorithms. The first is a face
alignment with an ensemble of regression trees [10].
Using an ensemble of regression trees, the algorithm
predicts the positions of facial landmarks directly from image
data, bypassing traditional optimization methods. The
second uses FaceNet[11], which eficiently maps a face
into a continuous embedding space, i.e., converting it
into a 128-feature embedding vector. This vector is then
matched to a face in the database using a one-shot
approach (see Fig. 4 as an example of qualitative result on
a test image.)
this approach is labor-intensive and requires continuous
monitoring.
4.3.2. Methods</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Our method addresses the previous issues by enabling</title>
          <p>fall detection with a simpler setup, a standard RGB
camera, eliminating the need for specialized equipment, and
AI-powered detection that uses a single AI module
running on a GPU to analyze the video stream for instances
of falls. This approach eliminates the need for wearable
devices and reduces the reliance on human intervention.</p>
          <p>Detection is to be performed in pedestrian areas and
parks, and so a dataset was created to fit this particular
environment and to train the model on data representing
the final context. The training dataset is illustrative of
Figure 5: Anomaly detection in a parking lot. all the normal poses that people take while walking in
places like pedestrian areas and parks. The scenes were
therefore captured with a fixed camera about 3 meters
deviate significantly from the learned representation of above the ground, pointing across an open space and
normal video data. The trained model compares the re- covering walking, standing poses, and running events
constructed video frames with the original input. For nor- in all directions and with or without
obstacles/occlumal events, the reconstructed frames closely resemble the sions. The idea of training a model with only ”normal
originals, with minimal diferences in pixel values. How- events” is important because, in nature, abnormal events
ever, when anomalies occur, the network’s reconstruc- (falls) are very rare and therefore expensive to acquire.
tion becomes less accurate. This is reflected in blurry or The data preprocessing pipeline involves using the
opendistorted frames compared to the originals. By analyzing pose framework [14] to extract skeleton key points from
this reconstruction error (the diference between original each frame. From each skeleton, irrelevant keypoints
and reconstructed frames), one can identify anomalies in are removed as they are considered noisy and skeletons
real-time. with significant missing keypoints are filtered out.
Keypoint coordinates are then normalized using min/max
4.3. Fall Detection normalization and discretized into coarser bins to
provide numerical stability for the training phase. Finally,
Falls, especially outdoors where help might be delayed, the data are shaped into time windows using sequences
can lead to serious injuries. Traditional fall detection of 75 skeletal frames. Such windows form the basic unit
systems often rely on wearable sensors, which can be on which the AI model operates. Since the video stream
inconvenient or impractical. The SE4I project addresses is supposed to be captured at 25 FPS, working with 75
this challenge with a camera-based fall detection sys- frame windows means analyzing human behavior over
tem using an LSTM Autoencoder. This system leverages 3-second actions. An overlap of 25 frames between
conanomaly detection techniques within a computer vision secutive windows is also included to maintain continuity
framework. It essentially learns what ”normal” move- between the windows themselves. The model is based on
ment looks like and identifies deviations from this norm a LSTM autoencoder [15], [16]. The execution time when
as potential falls. This approach ofers several advan- running on a consumer GPU allows for real-time
perfortages, specifically, no need for wearable Sensors, only mance (see Fig. 6). Once the model has learned normal
camera-based detection working with existing
surveillance infrastructure, and the use of real-time alerts,
enabling faster response times.
4.3.1. Challenges
Traditional fall detection systems typically rely on wear- Figure 6: Pipeline components execution times.
able sensors or specialized depth cameras. These
methods can be intrusive to users and costly to deploy on a
large scale. On the other hand, relying solely on human human behavior patterns, it can be used to reconstruct
observation through video footage is an option. However, time windows. Reconstruction and input data are then
compared; if the reconstruction error exceeds a certain
threshold and deviates significantly from normal data,
the input is intuitively flagged as a fall event. Overall,
the results highlight the efectiveness of using learned
temporal skeletal patterns for robust anomaly detection
in the context of outdoor fall detection.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Parking Detection</title>
      <p>Parking detection for SE4I requires the development of
an automatic system that searches for free parking space
in one of the parking areas within the industrial area and
provides information to drivers who have requested a
parking space. The Parking Guide and Information (PGI)
system [17] has been adopted as a solution for the parking
detection task using a monitoring system. The proposed
PGI system consists of two main parts. The former is
based on deep learning instance segmentation model to
detect all available free spaces in a parking lot. The latter
is a client-server architecture that automatically guides
drivers to the closest parking lot with the highest number
of available spaces.</p>
      <sec id="sec-4-1">
        <title>5.1. Challenges</title>
        <p>Parking lot detection systems using video surveillance
face several dificulties: (a) the impact of weather, e.g.,
low visibility caused by fog, rain, and snow can
significantly decrease the accuracy of these systems, or harsh
weather conditions can obscure parking lot boundaries in
the video feed; (b) diverse parking lot data, i.e., training
robust parking detection models requires a large dataset
with a wide variety of scenarios, including variations in
parking space layouts, weather conditions, camera
angles, obstructions, parking lot types (e.g., open-air,
multistory), and lighting conditions (day/night); (c) real-time
processing, that is, for practical applications, the system
needs to operate in real-time, this necessitates developing
a light parking detection model that can run eficiently
on available hardware.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Methods</title>
        <p>This work conceived a model for parking lot detection
using an instance segmentation approach. Yolact++ [18],
which is an extension of Yolact [19], was trained with
successful results on a novel dataset appropriately designed 6. Integration and Infrastructure
for this task. The dataset consists of 1395 images and
23600 manually annotated parking lots, and it was built The five intelligent modules of the SE4I project, plate
by using a web-scarping approach. The images, taken detection, face detection, anomaly detection, fall detection,
from public access cameras, were selected to represent a and parking detection, are part of a larger system powered
variety of conditions, i.e. weather and lighting conditions, by a peer-to-peer network of NVIDIA Jetson Xavier
features, i.e, diferent camera angles , occlusions, shadows, devices mounted on multifunctional light poles. This
presence of people or animals, camera heights, satellite im- setup ensures eficient and real-time processing of the
agery in 2D and 3D, diferent types of lines and colors, data collected by the surveillance cameras, as the
comand diferent backgrounds. putation is performed in the field and each device shares</p>
        <sec id="sec-4-2-1">
          <title>Parking lot detection and car detection are performed</title>
          <p>simultaneously to classify occupied or free parking lots
on the basis of the IoU between parking lot and car masks
detected by the Yolact++ module, respectively. For IoU
values greater than an IoU threshold, the system
classiifes parking lots as busy lots, as free lots, otherwise.</p>
          <p>The Yolact++ architecture is based on the RetinaNet
architecture [20], using pre-trained ResNet-101 stages. In
addition, Yolact++ introduces three improvements over
the base model: Fast Mask Re-Scoring Network Stage,
Deformable Convolutions with Intervals, and a Optimized
Prediction Head. The selection of the Yolact++
architecture for the parking lot detection problem was motivated
by the runtime requirements and the accuracy achieved
by this instance segmentation model.</p>
          <p>A client-server system called PGI has been developed.</p>
          <p>The clients include the drivers, administrators, and
machine learning systems. Drivers can search for parking
lots, while administrators can add, remove, and
monitor parking lots.System operations are performed on the
server side, which is built using PHP and MySQL for
database storage. Clients connect to the server through
a server interface using a Java Android app. The app
provides various functionalities, such as guiding drivers
to the nearest parking lot with available spaces using
GPS, and monitoring areas using the Google StreetView
API. The system presents favorable results with low loss
values and acceptable mAP for both the box and the mask,
determined using a 0.5 IoU threshold (see Table 1 and Fig.
7).</p>
          <p>Metrics
Box Localization Loss
Class Confidence Loss
Mask Loss
Semantic Segmentation Loss
I Loss
Total Loss
mAP@0.50 Box
mAP@0.50 Mask
[4] A. Bochkovskiy, C. Wang, H. M. Liao, Yolov4:
Optimal speed and accuracy of object detection, CoRR
abs/2004.10934 (2020). arXiv:2004.10934.
[5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,</p>
          <p>C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox
detector, in: ECCV 2016, Springer Intl. Pub., 2016,
pp. 21–37.
[6] M. Jaderberg, K. Simonyan, A. Zisserman,</p>
          <p>K. Kavukcuoglu, Spatial transformer networks,</p>
          <p>CoRR abs/1506.02025 (2015).
[7] R. Smith, An overview of the tesseract ocr engine,</p>
          <p>in: ICDAR 2007, volume 2, 2007, pp. 629–633.
[8] S. Hochreiter, J. Schmidhuber, Long short-term</p>
          <p>memory, Neural computation 9 (1997) 1735–80.
[9] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face
detecFigure 7: Visual results on test images. Masks are applied to tion and alignment using multitask cascaded
convothe lots, each with a diferent color to better distinguish each lutional networks, IEEE Signal Processing Letters
instance. The associated probability score is printed on each 23 (2016) 1499–1503.
mask. [10] V. Kazemi, J. Sullivan, One millisecond face
alignment with an ensemble of regression trees, in: 2014
IEEE Conference on Computer Vision and Pattern
data and JSON output with devices on other poles using Recognition, 2014, pp. 1867–1874.
a ZMQ publisher/subscriber pattern. Therefore, a dedi- [11] F. Schrof, D. Kalenichenko, J. Philbin, Facenet: A
cated module manages the surveillance camera stream unified embedding for face recognition and
clusterand associated metadata, such as brightness, frame rate, ing, in: 2015 IEEE Conference on Computer Vision
contrast, etc. All modules are containerized as Docker and Pattern Recognition (CVPR), 2015, pp. 815–823.
solutions, allowing for flexible portability, easy installa- [12] M. Hasan, J. Choi, J. Neumann, A. K.
Roytion, resilience, and scalable performance. The whole Chowdhury, L. S. Davis, Learning temporal
regularsystem is based on Python and C++ programming lan- ity in video sequences, 2016. arXiv:1604.04574.
guages and PyTorch, OpenCV, OpenPose, Onvif libraries [13] Y. S. Chong, Y. H. Tay, Abnormal event detection
are used. This infrastructure guarantees the real-time in videos using spatiotemporal autoencoder, 2017.
requirements and the privacy of the video-monitored arXiv:1701.01546.
areas. [14] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh,
Openpose: Realtime multi-person 2d pose
estimation using part afinity fields, 2019. doi: 10.1109/
Acknowledgments TPAMI.2019.2929257. arXiv:1812.08008.
[15] S. Hochreiter, J. Schmidhuber, Long short-term
This research was conducted as part of the Smart En- memory, Neural Comput. 9 (1997).
ergy Eficiency &amp; Environment for Industry (SE4I) project, [16] M. A. Kramer, Nonlinear principal component
analCUP  6618000230005 , funded by ”Progetti di ricerca ysis using autoassociative neural networks, Aiche
industriale e lo Sviluppo sperimentale nelle 12 aree di Journal 37 (1991) 233–243.
specializzazione individuate nel PNR 2015-2020, di cui al [17] D. Acharya, W. Yan, K. Khoshelham, Real-time
D.D. del 13 luglio 2017 n. 1735”. image-based parking occupancy detection using
deep learning., Research@ Locate 4 (2018) 33–40.</p>
          <p>References [18] C. Zhou, Yolact++ Better Real-Time Instance
Segmentation, University of California, Davis, 2020.
[19] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee, Yolact:
Realtime instance segmentation, in: Proceedings of the
IEEE/CVF international conference on computer
vision, 2019, pp. 9157–9166.
[20] T. Lin, P. Goyal, R. B. Girshick, K. He, P.
Dollár, Focal loss for dense object detection, CoRR
abs/1708.02002 (2017). URL: http://arxiv.org/abs/
1708.02002. arXiv:1708.02002.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maratea</surname>
          </string-name>
          ,
          <article-title>Adaptive quick reduct for feature drift detection</article-title>
          ,
          <source>Algorithms</source>
          <volume>14</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maratea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferone</surname>
          </string-name>
          ,
          <article-title>Deep neural networks and explainable machine learning</article-title>
          ,
          <source>in: WILF</source>
          <year>2018</year>
          , volume
          <volume>11291</volume>
          LNAI,
          <year>2019</year>
          , p.
          <fpage>253</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Montazzolli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <article-title>License plate detection and recognition in unconstrained scenarios</article-title>
          ,
          <source>in: ECCV 2018</source>
          , Springer Intl. Pub.,
          <year>2018</year>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>609</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>