<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Increased frame rate for Crowd Counting in Enclosed Spaces using GANs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adriano Puglisi</string-name>
          <email>puglisi@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Fiani</string-name>
          <email>fiani@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio De Magistris</string-name>
          <email>demagistris@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sapienza University of Rome</institution>
          ,
          <addr-line>via Ariosto 25, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>39</fpage>
      <lpage>45</lpage>
      <abstract>
        <p>An eficient computer system for regulating and monitoring the density of people in confined areas is very helpful. It becomes imperative to implement a solution that takes into account the processing power and pre-installed hardware in these places. Using computer vision, in particular, to make use of regular CCTV cameras that have been augmented by neural networks, solves the problem of precisely counting individuals in enclosed spaces. We describe a control system specifically designed for this goal, maximizing the capabilities of current infrastructure and enhancing neural networks to achieve higher frame rates.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Computer vision</kwd>
        <kwd>Tracking</kwd>
        <kwd>YOLO</kwd>
        <kwd>SORT</kwd>
        <kwd>Generative Adversarial Network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        in the neural networks field. It’s possible to adapt such
a framework to a series of diferent tasks, in
particuIn many enclosed spaces, crowd capacity management lar, it’s broadly used in the Super-Resolution of signals,
is a common challenge due to strict occupancy limits. such as images, videos, and audio and, generally
speakThese limits are critical for safety and regulatory compli- ing, in recreating or reconstructing parts of lost signals.
ance. To address this issue, we propose to leverage CCTV Given the potential of this framework, we decided to
cameras as a solution to more accurately count people implement a GAN regarding the frame-rate increase of
within a confined area. Leveraging advanced video ana- CCTV. The whole project tries to exploit the best
techlytics, our system aims to provide real-time monitoring, niques that require a saving of hardware resources, thus
helping companies and institutions maintain optimal au- allowing it to be used in as many environments as
posdience density and ensure a safe environment [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The sible and with a medium-low computing power. The
main solutions proposed in recent years for indoor hu- security in closed spaces and the tracking of people are
man tracking use cameras with depths for the acquisition having an ever greater impact on the management of
of the position, however, this technology is in some cases common spaces and crowded places, the use of advanced
expensive or in any case not available. The use of modern IT systems can allow greater, more efective, and eficient
algorithms in computer vision allows the development control. Maintaining a significant trade-of between the
of systems capable of using a simple two-dimensional necessary hardware resources and the results obtained
camera also to calculate the depth and therefore the posi- was an important point in developing our work.
tion of some objects, or people, in space [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These types
of cameras usually have poor FPS values to save
storage space, this tool is combined with a neural network 2. Related Works
based on the GAN framework, to increase the frame rate
of such cameras. The interpolation of frames through 2.1. Human tracking
the use of neural networks is an important and complex
problem to solve, the datasets used are often very large
and the networks very deep. These networks, even if
they achieve remarkable results, have a very high
computational cost and can often be trained only on expensive
or unavailable hardware. For this reason, we choose to
bias the neural network using a specific dataset for the
task, that contains only working pedestrians, to obtain a
faster convergence of our network. In the last few years,
the GAN framework [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] brought a little revolution
The problem of human tracking and positioning is a
wellknown subject in computer vision. It can be useful in
diferent situations, such as crowd control, monitoring
public areas, security, and so on [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">7, 8, 9, 10, 11</xref>
        ]. We want
to focus on the usage in an indoor environment mainly.
      </p>
      <p>
        Some research [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] uses top-view depth cameras,
subtracting the average obtained image, consisting of the
lfoor and the furniture, segmenting the moving objects,
and trying to match them with a top-view model of a
person. After that the projection distortion is corrected,
obtaining the position on the plane. Similarly in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
fisheye top-view cameras segment the moving object from
the static background using adaptive GMM and
correcting the projective distortion to find the position. Even
though those approaches could be efective, we want to widely used in computer vision for its speed and accuracy
use cameras that are usually positioned on the wall in- in the detection. We tested a set of YOLO pre-trained
stead of the ceiling. Other papers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] use 3D cameras to models, to pick up the most suitable one for our goal. Our
obtain an ortho-image to find objects in a scene; while goal was to achieve good accuracy while maintaining a
this approach could be extended to our needs, it requires reasonable number of FPS to work with in real time. The
more sophisticated cameras with depth vision, which models we tested are trained on a custom public dataset
CCTV cameras are not equipped with. specific for crowded human places [19]. The models we
tested are:
      </p>
      <sec id="sec-1-1">
        <title>2.2. Frame-rate increase</title>
        <p>
          • YOLOv8n trained with 416 × 416 images
• YOLOv8s trained with 416 × 416 images
• YOLOv8m trained with 416 × 416 images
The computer vision community has given significant
attention to the necessity of increasing the frame rate
and, consequently, the video frame interpolation. Many
uses for this issue exist, including the creation of slow 3.2. Tracking
motion and frame recovery for video streaming and
gaming. High-frame rate videos are visually more pleasing to To track people in the scene as reliably as possible, it’s
watch because they may avoid typical glitches like tem- needed a good balance between accuracy and speed;
poral jittering and motion blurriness. Several techniques while the chosen model ofers a good speed in the
detechave been used to overcome the issue of getting interme- tion, it lacks accuracy. To make up for this lack, we
cordiate frames from a limited collection, including frame rected and smoothed the predictions made by YOLO
usinterpolation and, more recently, DNNs. In Frame Inter- ing the SORT algorithm [20], which corrects and smooths
polation techniques, intermediate frames are generated the position of the bounding boxes using a Kalman filter
between the present frames using interpolation, as in the [21].
methods proposed by Choi et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], based on Bilateral The Hungarian algorithm is utilized to monitor every
Motion Estimation and Adaptive Overlapped Block Mo- detection inside a scene. A list of detections is stored,
tion Compensation. Also, a wide variety of DNN methods the positions of the detections are predicted using the
were proposed; recently Flow-Agnostic Video Represen- Kalman filter for each iteration, the Intersection over
tations for Fast Frame Interpolation [FLAVR [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]] solved Union (IOU) is calculated using an updated set of
detecthe problem using an autoencoder based on 3D space- tions, the Hungarian algorithm is used to find the best
time convolutions, to enable end-to-end learning and in- matches, and the detections are categorized as matched
ference. With no extra inputs needed in the form of depth or unmatched. For every bounding box, a new Kalman
maps or optical flow, this technique efectively learns to iflter is created in case of mismatched detections. The
reason about non-linear movements, complicated occlu- algorithm updates the Kalman filter for matching
detecsions, and temporal abstractions, leading to enhanced tions. Ultimately, a list of tagged detections is produced.
performance. Depth-Aware Video Frame Interpolation The state used for the Kalman filter is defined as:
[17] is another notable DNN technique that synthesizes
intermediate flows that sample items closer to the viewer  = [, , , , ̇, ̇, ̇]
preferentially by introducing a depth-aware flow
projection layer. To synthesize the output frame, this approach where  and  represent the horizontal and vertical
pouses the optical flow and local interpolation kernels to sitions in pixels, and  and  denote the scale (area) and
warp input frames, depth maps, and contextual features. aspect ratio of the bounding box. Notably, the aspect
Hierarchical features are utilized to extract contextual ratio lacks a corresponding velocity in the state, as it is
information from nearby pixels. assumed to be constant.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Proposed method</title>
      <p>This section outlines the approach for obtaining the
camIn this section, we describe the methods and the algo- era matrix and the algorithm employed to determine the
rithms used to analyze the images and detect people 2D position of a person in the scene.
inside the scene, and after that increase the frame rate.</p>
      <sec id="sec-2-1">
        <title>3.3. Spatial Localization</title>
      </sec>
      <sec id="sec-2-2">
        <title>3.1. Detection</title>
        <p>YOLO [18] is the neural network framework we used for
detecting persons in the scene, it is extremely popular and</p>
        <sec id="sec-2-2-1">
          <title>3.3.1. Camera Model</title>
          <p>The finite projective camera, denoted as  , is
characterized by its intrinsic and extrinsic parameters, given
by:
 = [ | −  ˜] = [| − ˜]
Here,  describes the orientation of the camera and ˜
is the world position of the camera center.  is the
calibration matrix and since the resolution is the same in
both the x and y directions, the calibration matrix can be
defined as:
to train them at the same time, improving their
performances to obtain a good model that generates the missing
frames.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>3.4.1. Network architecture</title>
          <p>
            The generator takes as input two pictures of size ( ×
 × 3). To minimize its dimensions, the encoder
employs two-dimensional convolutional layers with a stride
with  being the focal length andcan be obtained using of two using a UNet [24]. LeakyReLU is the activation
the formula: function, and its slope is 0.2. On the other hand, the
 decoder uses the LeakyReLU activation function with a
 = 2 * 2(   ) slope of 0.2 and consists of several 2-dimensional
con2 volutional layers with a stride of 2. The ℎ activation
Where   is the field of view. Typically, obtain- function is used in the final output layer to make sure that
ing these parameters requires camera calibration using the outputs are inside the [
            <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
            ] range. The same input
methods like Zhang’s method [22]. However, in a simu- as the generator, concatenated with the produced output
lator environment, all parameters can be derived from  or the genuine frame , is fed into the
discrimithe properties of the involved objects. nator, which is built like a CNN. Table 1 summarizes the
architecture.
          </p>
        </sec>
        <sec id="sec-2-2-3">
          <title>3.3.2. Inverting Projective Transformation</title>
          <p>Summarizing, the 3 × 4 camera matrix 
transforms image coordinates (, , 1) to scene coordinates
(, , , 1) . To obtain the scene coordinates from
image coordinates, we aim to invert  , considering that
perspective projection is not injective. Assuming knowledge
of the distance from the ground (height of the person),
we utilize the pseudo-inverse  + of  . Two points on
the back-projected ray are identified: the camera center
 and the point  +. The ray is expressed as:
( ) =  + +</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>3.4.2. Loss function</title>
          <p>Within our generative adversarial network (GAN), an
adversarial discriminator D seeks to maximize the
objective function, while the generator G strives to decrease
it, resulting in a zero-sum game. The definition of the
objective function is:</p>
          <p>ℒ (, ) =
= E,[(, )] + E,[(1 − (, (, )))]</p>
          <p>The optimal generator denoted as * is determined by:</p>
          <p>For a finite camera with  = [ |4], the camera * = arg min max ℒ (, )
center is ˜ = −  − 14. Back-projection of an image  
point  intersects the plane at infinity at the point  = We enhance the GAN objective function by adding the L1
(( − 1) , 0) , providing a second point on the ray. loss function, which is a conventional loss. The
generaThe line is represented as: tor’s job is now to provide nearly optimum outputs using
this conventional loss function, in addition to tricking
( ) = ︂(  − 1(1 − 4))︂ tdhuetyd.iTschreimLi1nlaotsosr,, dweinthooteudt achsaℒn1giinsgdethfineeddiassc:riminator’s
Solving for  , considering the  coordinate as the
detected height, allows computation of the  and 
coordinates in the scene.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.4. Enhancing Frame Rate</title>
        <p>We decided to implement a GAN solution for our
framework, based on the Image2Image work [23]. The
framework is composed of two models: a generator and a
discriminator; the generator takes as input the frames  and
+1 and tries to infer the missing frame , while the
discriminator takes the same input concatenated either
with the real missing frame  or with the generated
one, to classify them as generated or real. The goal is
ℒ1() = E,,[|| − (, )||]
And now our final objective function is:
* = arg min max ℒ (, ) +  ℒ1()
 
Here  serves as a weighting parameter for the ℒ1 loss.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Implementation</title>
      <p>In this section we will describe the implementation
details of our work, starting with the setup and the
preparation of the simulator, the training phase of the neural
network, and the whole system architecture.</p>
      <p>GAN Network architecture
Layer</p>
      <p>Activation</p>
      <sec id="sec-3-1">
        <title>Filters</title>
      </sec>
      <sec id="sec-3-2">
        <title>Stride</title>
      </sec>
      <sec id="sec-3-3">
        <title>Batch Norm</title>
        <p>Input
Conv
Conv
...</p>
        <p>Conv
Input
Conv
Conv
...</p>
        <p>FC
LeakyReLU
LeakyReLU
...</p>
        <p>Tanh
LeakyReLU
LeakyReLU
...
Where   the average of x;   the average of y;  2
the variance of x;  2 the variance of y;   the
covariance of x and y; 1 = (1)2, 2 = (2)2 two
variables to stabilize the division with weak denominator; L
the dynamic range of the pixel-values (typically this is
2#   − 1); 1 = 0.01, 2 = 0.03 by default.
   = 20 · log10
︂( MAX {} )︂
√MSE
Where MAX {} is the maximum possible pixel value of
the image and with the mean square error (MSE) defined
as:
  =
  =0 =0
1
− 1 − 1
∑︁ ∑︁ ‖I (i , j ) −</p>
        <p>K (i , j )‖2</p>
        <p>Let I represent the original image and K denote the
generated image, both of dimensions MxN. The results
of our network, in comparison with other methodologies,
are presented in Table 2.</p>
        <p>Results compared with S.O.T.A networks
This system can also be used with multiple cameras;
when working with multiple cameras, each camera
receives an image and elaborates that using YOLO and
SORT, to extract the bounding boxes positions. Each</p>
        <sec id="sec-3-3-1">
          <title>4.1. Language and Libraries</title>
          <p>The whole project was developed using Python v3.8.10.
For the detection and tracking part the following libraries
were used:
• OpenCV v4.5.2 compiled from source, to activate
the ability to use CUDA drivers and CUDNN,
obtaining faster results with YOLO.</p>
          <p>• Numpy v1.21.4
For the neural network creation, training, and testing we
used:
• TensorFlow v2
• Keras for the creation of the layers
• OpenCV for the pre-processing of the dataset and
the data augmentation
• Matplotlib to visualize our results</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>4.2. Net training and testing</title>
          <p>For training our network, we used the EPFL [25] dataset,
which includes multiple scenes of moving pedestrians.</p>
          <p>The training data were extracted by taking 3 frames at a
time and adding noise to increase the number available. Table 2
Next, each triplet was saved in a file, with the first and last
frames as input to the generator and the middle frame Net
as reference. The dataset was divided into validation,
training, and testing. The GAN network was trained EpicFlow[26]
using the early stopping technique thus preventing the BeyondMSE[27]
network from overfitting the data. The loss graph is MOCunreNt+eRtwESo[r2k8]
shown in Figure 1 for the Generator and Figure 2 for the
Discriminator.</p>
          <p>To study the results of our neural network, we
computed the SSIM and PSNR values which are used to mea- 4.3. System architecture
sure the similarity between two images and are defined
as:
SSIM</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>frame is passed to the detection thread and can be stored,
to be processed later by the Neural Network. The points
centered in the top part of the bounding boxes generated In this section, we will show the results obtained.
by the detection threads are passed to the camera
models, to obtain the position of the persons on the plane. 5.1. Frame Rate and Crowd Counting
Those positions are then merged by searching for each
camera the nearest neighbor and in case of a mismatch
between the number of people in the cluster, the bigger
one is chosen; after matching is found, for each person, a
dot is drawn on the map having the average position
between the matched one. The whole system architecture
is represented in the Figure 3.</p>
      <p>As we can see in figure 4, the first and the last frame are
the input, while the middle one was generated by the
Generator of the GAN network.</p>
      <p>After a series of comprehensive tests, our
technology performed smoothly when properly identifying and
counting people in enclosed spaces. With the addition
of computer vision algorithms and the advances made
possible by our improved neural network, accurate
people counting and identification are guaranteed. The
outcomes demonstrate the system’s capacity to monitor and
control crowd density in confined areas eficiently. For a
visual depiction, Figure 5 shows how our model could be
used in a real-case scenario using only one camera.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>In summary, our methodology ofers a dependable and
precise means of detecting and measuring human
beings in enclosed spaces. By utilizing the creative
fusion of GAN-based networks and the efectiveness of
lightweight YOLO models, our system not only ensures
robustness but also demonstrates flexibility to operate
on systems with limited technological resources. This
clever approach strengthens security protocols and
expedites operational workflows in addition to ofering a
ifnancially sensible way to implement occupancy
restrictions in a variety of scenarios. Our method, which makes
use of cutting-edge AI technologies, is a big step toward
improving space management and guaranteeing
adherence to safety laws, making all people’s surroundings
safer and more efective. Moreover, it is a useful tool in
circumstances where precisely counting people is
necessary to avoid crowding, making the environment safer
and more efective for everyone. Our method, which
makes use of cutting-edge AI technologies, is a big step
toward improving space management and guaranteeing
adherence to safety rules, which will eventually improve
the general standard of public areas and facilities.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Dat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ponzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vincelli</surname>
          </string-name>
          ,
          <article-title>Supporting impaired people with a following robotic assistant by means of end-to-end visual target navigation and reinforcement learning approaches</article-title>
          , volume
          <volume>3118</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ponzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bianco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wajda</surname>
          </string-name>
          ,
          <article-title>Psychoeducative social robots for an healthier lifestyle using artificial intelligence: a case-study</article-title>
          , volume
          <volume>3118</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Magistris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caprari</surname>
          </string-name>
          , G. Castro,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>Vision-based holistic scene understanding for context-aware humanrobot interaction 13196 LNAI (</article-title>
          <year>2022</year>
          )
          <fpage>310</fpage>
          -
          <lpage>325</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -08421-8_
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Generative adversarial nets</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>2672</fpage>
          -
          <lpage>2680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pepe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tedeschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brandizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>Human attention assessment using a machine learning approach with gan-based data augmentation technique trained using a custom dataset</article-title>
          ,
          <source>OBM Neurobiology 6</source>
          (
          <year>2022</year>
          ). doi:10.
          <string-name>
            <surname>FLAVR</surname>
          </string-name>
          <article-title>: flow-agnostic video representations for 21926/obm</article-title>
          .neurobiol.
          <volume>2204139</volume>
          . fast frame interpolation, CoRR abs/
          <year>2012</year>
          .08512
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ciancarelli</surname>
          </string-name>
          , G. De Magistris,
          <string-name>
            <surname>S. Cognetta</surname>
          </string-name>
          , (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2012</year>
          .08512.
          <string-name>
            <surname>D. Appetito</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Nardi</surname>
          </string-name>
          , A gan ap- arXiv:
          <year>2012</year>
          .08512.
          <article-title>proach for anomaly detection in spacecraft teleme-</article-title>
          [17]
          <string-name>
            <given-names>W.</given-names>
            <surname>Bao</surname>
          </string-name>
          , W.-S. Lai,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gao</surname>
          </string-name>
          , M.-H. tries 531 LNNS (
          <year>2023</year>
          )
          <fpage>393</fpage>
          -
          <lpage>402</lpage>
          . doi:
          <volume>10</volume>
          .1007/ Yang, Depth-aware
          <source>video frame interpolation</source>
          ,
          <year>2019</year>
          .
          <fpage>978</fpage>
          -3-
          <fpage>031</fpage>
          -18050-7_
          <fpage>38</fpage>
          . URL: http://arxiv.org/abs/
          <year>1904</year>
          .00830.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brandizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          , G. Galati,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , Address- [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sai</surname>
          </string-name>
          <string-name>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Venkata</surname>
          </string-name>
          ,
          <article-title>A ing vehicle sharing through behavioral analysis: A review on yolov8 and its advancements, in: Insolution to user clustering using recency-frequency-</article-title>
          ternational
          <source>Conference on Data Intelligence and monetary and vehicle relocation based on neigh- Cognitive Informatics</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>529</fpage>
          -
          <lpage>545</lpage>
          . borhood splits,
          <source>Information (Switzerland) 13</source>
          (
          <year>2022</year>
          ). [19]
          <string-name>
            <surname>K. D. Team</surname>
          </string-name>
          , Crowdhuman dataset, https: doi:10.3390/info13110511. //universe.roboflow.com/keio-dba-team/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Marcotrigiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Stingi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fregnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mag-</surname>
          </string-name>
          crowdhuman-nur7g,
          <year>2022</year>
          . arelli, P. Pasquale,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Orsi</surname>
          </string-name>
          , M. T. Mon- [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bewley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ramos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Upcroft</surname>
          </string-name>
          , Simtagna,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>An integrated control ple online and realtime tracking, 2016 IEEE Interplan in primary schools: Results of a field investi- national Conference on Image Processing (ICIP) gation on nutritional and hygienic features in the (</article-title>
          <year>2016</year>
          ). URL: http://dx.doi.org/10.1109/ICIP.
          <year>2016</year>
          .
          <article-title>apulia region (southern italy)</article-title>
          ,
          <source>Nutrients</source>
          <volume>13</volume>
          (
          <year>2021</year>
          ). 7533003. doi:
          <volume>10</volume>
          .1109/icip.
          <year>2016</year>
          .
          <volume>7533003</volume>
          . doi:
          <volume>10</volume>
          .3390/nu13093006. [21]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Kalman</surname>
          </string-name>
          , A New Approach to Linear Filtering
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alfarano</surname>
          </string-name>
          , G. De Magistris,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mongelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          , and Prediction Problems, Journal of Basic EngineerJ. Starczewski,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , A novel convmixer trans- ing
          <volume>82</volume>
          (
          <year>1960</year>
          )
          <fpage>35</fpage>
          -
          <lpage>45</lpage>
          . URL: https://doi.org/10.1115/1. former based architecture for violent behavior de- 3662552. doi:
          <volume>10</volume>
          .1115/1.3662552. tection 14126 LNAI (
          <year>2023</year>
          )
          <fpage>3</fpage>
          -
          <lpage>16</lpage>
          . doi:
          <volume>10</volume>
          .1007/ [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>A flexible new technique for camera 978-3</article-title>
          -
          <fpage>031</fpage>
          -42508-
          <issue>0</issue>
          _1. calibration, IEEE Transactions on Pattern Analy-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gabryel</surname>
          </string-name>
          , R. K. Now- sis
          <source>and Machine Intelligence</source>
          <volume>22</volume>
          (
          <year>2000</year>
          )
          <fpage>1330</fpage>
          -
          <lpage>1334</lpage>
          . icki, C. Napoli, E. Tramontana, Can we pro- doi:10.1109/34.888718.
          <article-title>cess 2d images using artificial bee colony?</article-title>
          , vol- [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          , J.-Y. Zhu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          , Image-toume
          <volume>9119</volume>
          ,
          <year>2015</year>
          , pp.
          <fpage>660</fpage>
          -
          <lpage>671</lpage>
          . doi:
          <volume>10</volume>
          .1007/ image translation with
          <source>conditional adversarial net978-3-319-19324-3</source>
          _
          <fpage>59</fpage>
          . works,
          <source>in: Proceedings of the IEEE Conference on</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <article-title>A comprehensive solution for Computer Vision and Pattern Recognition (CVPR), psychological treatment and therapeutic path plan- 2017. ning based on knowledge base and expertise shar-</article-title>
          [24]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net: Convoing, volume
          <volume>2472</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>47</lpage>
          .
          <article-title>lutional networks for biomedical image segmen-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>T.-E. Tseng</surname>
            , A.-S. Liu,
            <given-names>P.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Hsiao</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-M. Huang</surname>
          </string-name>
          , L.- tation,
          <source>CoRR abs/1505</source>
          .04597 (
          <year>2015</year>
          ).
          <article-title>URL: http: C. Fu, Real-time people detection and tracking for //arxiv</article-title>
          .org/abs/1505.04597. arXiv:
          <volume>1505</volume>
          .04597.
          <article-title>indoor surveillance using multiple top-view depth</article-title>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Fleuret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berclaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lengagne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fua</surname>
          </string-name>
          , Multicameras, in: 2014 IEEE/RSJ International Confer
          <article-title>- camera people tracking with a probabilistic occuence on Intelligent Robots and Systems,</article-title>
          <year>2014</year>
          , pp.
          <source>pancy map</source>
          ,
          <source>IEEE Transactions on Pattern Anal4077-4082</source>
          . doi:
          <volume>10</volume>
          .1109/IROS.
          <year>2014</year>
          .
          <volume>6943136</volume>
          .
          <source>ysis and Machine Intelligence</source>
          <volume>30</volume>
          (
          <year>2008</year>
          )
          <fpage>267</fpage>
          -
          <lpage>282</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Al Machot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mahr</surname>
          </string-name>
          , C. Bobda, doi:10.1109/TPAMI.
          <year>2007</year>
          .
          <volume>1174</volume>
          .
          <article-title>Camera-based system for tracking and</article-title>
          position es- [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Revaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Weinzaepfel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Harchaoui</surname>
          </string-name>
          , timation of humans,
          <year>2010</year>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>67</lpage>
          . doi:
          <volume>10</volume>
          .1109/ C. Schmid, Epicflow: Edge-preserving
          <year>interDASIP</year>
          .
          <year>2010</year>
          .
          <volume>5706247</volume>
          .
          <article-title>polation of correspondences for optical flow</article-title>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M.-A. Mittet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Landes</surname>
          </string-name>
          , P. Grussenmeyer, arXiv:
          <fpage>1501</fpage>
          .02565.
          <article-title>Localization using rgb-d cameras orthoim-</article-title>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Couprie</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>LeCun, Deep multi-scale ages</article-title>
          ,
          <source>volume XL-5</source>
          ,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .5194/ video prediction beyond mean square error,
          <year>2016</year>
          . isprsarchives-XL-5
          <string-name>
            <surname>-</surname>
          </string-name>
          425-
          <year>2014</year>
          . arXiv:
          <volume>1511</volume>
          .
          <fpage>05440</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>B.-D. Choi</surname>
            , J.-W. Han, C.-S. Kim,
            <given-names>S.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Ko</surname>
            , Motion- [28]
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Decompensated frame interpolation using bilateral composing motion and content for natural video motion estimation and adaptive overlapped block sequence prediction</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <volume>1706</volume>
          .08033. motion compensation,
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>17</volume>
          (
          <year>2007</year>
          )
          <fpage>407</fpage>
          -
          <lpage>416</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCSVT.
          <year>2007</year>
          .
          <volume>893835</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kalluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chandraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>