1. Introduction

Increased frame rate for Crowd Counting in Enclosed Spaces using GANs

Adriano Puglisi

puglisi@diag.uniroma1.it 0

Francesca Fiani

fiani@diag.uniroma1.it 0

Giorgio De Magistris

demagistris@diag.uniroma1.it 0 0 Sapienza University of Rome , via Ariosto 25, Rome , Italy

39 45

An eficient computer system for regulating and monitoring the density of people in confined areas is very helpful. It becomes imperative to implement a solution that takes into account the processing power and pre-installed hardware in these places. Using computer vision, in particular, to make use of regular CCTV cameras that have been augmented by neural networks, solves the problem of precisely counting individuals in enclosed spaces. We describe a control system specifically designed for this goal, maximizing the capabilities of current infrastructure and enhancing neural networks to achieve higher frame rates.

eol>Computer vision Tracking YOLO SORT Generative Adversarial Network

1. Introduction

in the neural networks field. It’s possible to adapt such a framework to a series of diferent tasks, in particuIn many enclosed spaces, crowd capacity management lar, it’s broadly used in the Super-Resolution of signals, is a common challenge due to strict occupancy limits. such as images, videos, and audio and, generally speakThese limits are critical for safety and regulatory compli- ing, in recreating or reconstructing parts of lost signals. ance. To address this issue, we propose to leverage CCTV Given the potential of this framework, we decided to cameras as a solution to more accurately count people implement a GAN regarding the frame-rate increase of within a confined area. Leveraging advanced video ana- CCTV. The whole project tries to exploit the best techlytics, our system aims to provide real-time monitoring, niques that require a saving of hardware resources, thus helping companies and institutions maintain optimal au- allowing it to be used in as many environments as posdience density and ensure a safe environment [ 1, 2 ]. The sible and with a medium-low computing power. The main solutions proposed in recent years for indoor hu- security in closed spaces and the tracking of people are man tracking use cameras with depths for the acquisition having an ever greater impact on the management of of the position, however, this technology is in some cases common spaces and crowded places, the use of advanced expensive or in any case not available. The use of modern IT systems can allow greater, more efective, and eficient algorithms in computer vision allows the development control. Maintaining a significant trade-of between the of systems capable of using a simple two-dimensional necessary hardware resources and the results obtained camera also to calculate the depth and therefore the posi- was an important point in developing our work. tion of some objects, or people, in space [ 3 ]. These types of cameras usually have poor FPS values to save storage space, this tool is combined with a neural network 2. Related Works based on the GAN framework, to increase the frame rate of such cameras. The interpolation of frames through 2.1. Human tracking the use of neural networks is an important and complex problem to solve, the datasets used are often very large and the networks very deep. These networks, even if they achieve remarkable results, have a very high computational cost and can often be trained only on expensive or unavailable hardware. For this reason, we choose to bias the neural network using a specific dataset for the task, that contains only working pedestrians, to obtain a faster convergence of our network. In the last few years, the GAN framework [ 4, 5, 6 ] brought a little revolution The problem of human tracking and positioning is a wellknown subject in computer vision. It can be useful in diferent situations, such as crowd control, monitoring public areas, security, and so on [ 7, 8, 9, 10, 11 ]. We want to focus on the usage in an indoor environment mainly.

Some research [ 12 ] uses top-view depth cameras, subtracting the average obtained image, consisting of the lfoor and the furniture, segmenting the moving objects, and trying to match them with a top-view model of a person. After that the projection distortion is corrected, obtaining the position on the plane. Similarly in [ 13 ] fisheye top-view cameras segment the moving object from the static background using adaptive GMM and correcting the projective distortion to find the position. Even though those approaches could be efective, we want to widely used in computer vision for its speed and accuracy use cameras that are usually positioned on the wall in- in the detection. We tested a set of YOLO pre-trained stead of the ceiling. Other papers [ 14 ] use 3D cameras to models, to pick up the most suitable one for our goal. Our obtain an ortho-image to find objects in a scene; while goal was to achieve good accuracy while maintaining a this approach could be extended to our needs, it requires reasonable number of FPS to work with in real time. The more sophisticated cameras with depth vision, which models we tested are trained on a custom public dataset CCTV cameras are not equipped with. specific for crowded human places [19]. The models we tested are:

2.2. Frame-rate increase

• YOLOv8n trained with 416 × 416 images • YOLOv8s trained with 416 × 416 images • YOLOv8m trained with 416 × 416 images The computer vision community has given significant attention to the necessity of increasing the frame rate and, consequently, the video frame interpolation. Many uses for this issue exist, including the creation of slow 3.2. Tracking motion and frame recovery for video streaming and gaming. High-frame rate videos are visually more pleasing to To track people in the scene as reliably as possible, it’s watch because they may avoid typical glitches like tem- needed a good balance between accuracy and speed; poral jittering and motion blurriness. Several techniques while the chosen model ofers a good speed in the detechave been used to overcome the issue of getting interme- tion, it lacks accuracy. To make up for this lack, we cordiate frames from a limited collection, including frame rected and smoothed the predictions made by YOLO usinterpolation and, more recently, DNNs. In Frame Inter- ing the SORT algorithm [20], which corrects and smooths polation techniques, intermediate frames are generated the position of the bounding boxes using a Kalman filter between the present frames using interpolation, as in the [21]. methods proposed by Choi et al. [ 15 ], based on Bilateral The Hungarian algorithm is utilized to monitor every Motion Estimation and Adaptive Overlapped Block Mo- detection inside a scene. A list of detections is stored, tion Compensation. Also, a wide variety of DNN methods the positions of the detections are predicted using the were proposed; recently Flow-Agnostic Video Represen- Kalman filter for each iteration, the Intersection over tations for Fast Frame Interpolation [FLAVR [ 16 ]] solved Union (IOU) is calculated using an updated set of detecthe problem using an autoencoder based on 3D space- tions, the Hungarian algorithm is used to find the best time convolutions, to enable end-to-end learning and in- matches, and the detections are categorized as matched ference. With no extra inputs needed in the form of depth or unmatched. For every bounding box, a new Kalman maps or optical flow, this technique efectively learns to iflter is created in case of mismatched detections. The reason about non-linear movements, complicated occlu- algorithm updates the Kalman filter for matching detecsions, and temporal abstractions, leading to enhanced tions. Ultimately, a list of tagged detections is produced. performance. Depth-Aware Video Frame Interpolation The state used for the Kalman filter is defined as: [17] is another notable DNN technique that synthesizes intermediate flows that sample items closer to the viewer = [, , , , ̇, ̇, ̇] preferentially by introducing a depth-aware flow projection layer. To synthesize the output frame, this approach where and represent the horizontal and vertical pouses the optical flow and local interpolation kernels to sitions in pixels, and and denote the scale (area) and warp input frames, depth maps, and contextual features. aspect ratio of the bounding box. Notably, the aspect Hierarchical features are utilized to extract contextual ratio lacks a corresponding velocity in the state, as it is information from nearby pixels. assumed to be constant.

3. Proposed method

This section outlines the approach for obtaining the camIn this section, we describe the methods and the algo- era matrix and the algorithm employed to determine the rithms used to analyze the images and detect people 2D position of a person in the scene. inside the scene, and after that increase the frame rate.

3.3. Spatial Localization 3.1. Detection

YOLO [18] is the neural network framework we used for detecting persons in the scene, it is extremely popular and

3.3.1. Camera Model

The finite projective camera, denoted as , is characterized by its intrinsic and extrinsic parameters, given by: = [ | − ˜] = [| − ˜] Here, describes the orientation of the camera and ˜ is the world position of the camera center. is the calibration matrix and since the resolution is the same in both the x and y directions, the calibration matrix can be defined as: to train them at the same time, improving their performances to obtain a good model that generates the missing frames.

3.4.1. Network architecture

The generator takes as input two pictures of size ( × × 3). To minimize its dimensions, the encoder employs two-dimensional convolutional layers with a stride with being the focal length andcan be obtained using of two using a UNet [24]. LeakyReLU is the activation the formula: function, and its slope is 0.2. On the other hand, the decoder uses the LeakyReLU activation function with a = 2 * 2( ) slope of 0.2 and consists of several 2-dimensional con2 volutional layers with a stride of 2. The ℎ activation Where is the field of view. Typically, obtain- function is used in the final output layer to make sure that ing these parameters requires camera calibration using the outputs are inside the [ − 1, 1 ] range. The same input methods like Zhang’s method [22]. However, in a simu- as the generator, concatenated with the produced output lator environment, all parameters can be derived from or the genuine frame , is fed into the discrimithe properties of the involved objects. nator, which is built like a CNN. Table 1 summarizes the architecture.

3.3.2. Inverting Projective Transformation

Summarizing, the 3 × 4 camera matrix transforms image coordinates (, , 1) to scene coordinates (, , , 1) . To obtain the scene coordinates from image coordinates, we aim to invert , considering that perspective projection is not injective. Assuming knowledge of the distance from the ground (height of the person), we utilize the pseudo-inverse + of . Two points on the back-projected ray are identified: the camera center and the point +. The ray is expressed as: ( ) = + +

3.4.2. Loss function

Within our generative adversarial network (GAN), an adversarial discriminator D seeks to maximize the objective function, while the generator G strives to decrease it, resulting in a zero-sum game. The definition of the objective function is:

ℒ (, ) = = E,[(, )] + E,[(1 − (, (, )))]

The optimal generator denoted as * is determined by:

For a finite camera with = [ |4], the camera * = arg min max ℒ (, ) center is ˜ = − − 14. Back-projection of an image point intersects the plane at infinity at the point = We enhance the GAN objective function by adding the L1 (( − 1) , 0) , providing a second point on the ray. loss function, which is a conventional loss. The generaThe line is represented as: tor’s job is now to provide nearly optimum outputs using this conventional loss function, in addition to tricking ( ) = ︂( − 1(1 − 4))︂ tdhuetyd.iTschreimLi1nlaotsosr,, dweinthooteudt achsaℒn1giinsgdethfineeddiassc:riminator’s Solving for , considering the coordinate as the detected height, allows computation of the and coordinates in the scene.

3.4. Enhancing Frame Rate

We decided to implement a GAN solution for our framework, based on the Image2Image work [23]. The framework is composed of two models: a generator and a discriminator; the generator takes as input the frames and +1 and tries to infer the missing frame , while the discriminator takes the same input concatenated either with the real missing frame or with the generated one, to classify them as generated or real. The goal is ℒ1() = E,,[|| − (, )||] And now our final objective function is: * = arg min max ℒ (, ) + ℒ1() Here serves as a weighting parameter for the ℒ1 loss.

4. Implementation

In this section we will describe the implementation details of our work, starting with the setup and the preparation of the simulator, the training phase of the neural network, and the whole system architecture.

GAN Network architecture Layer

Activation

Filters Stride Batch Norm

Input Conv Conv ...

Conv Input Conv Conv ...

FC LeakyReLU LeakyReLU ...

Tanh LeakyReLU LeakyReLU ... Where the average of x; the average of y; 2 the variance of x; 2 the variance of y; the covariance of x and y; 1 = (1)2, 2 = (2)2 two variables to stabilize the division with weak denominator; L the dynamic range of the pixel-values (typically this is 2# − 1); 1 = 0.01, 2 = 0.03 by default. = 20 · log10 ︂( MAX {} )︂ √MSE Where MAX {} is the maximum possible pixel value of the image and with the mean square error (MSE) defined as: = =0 =0 1 − 1 − 1 ∑︁ ∑︁ ‖I (i , j ) −

K (i , j )‖2

Let I represent the original image and K denote the generated image, both of dimensions MxN. The results of our network, in comparison with other methodologies, are presented in Table 2.

Results compared with S.O.T.A networks This system can also be used with multiple cameras; when working with multiple cameras, each camera receives an image and elaborates that using YOLO and SORT, to extract the bounding boxes positions. Each

4.1. Language and Libraries

The whole project was developed using Python v3.8.10. For the detection and tracking part the following libraries were used: • OpenCV v4.5.2 compiled from source, to activate the ability to use CUDA drivers and CUDNN, obtaining faster results with YOLO.

• Numpy v1.21.4 For the neural network creation, training, and testing we used: • TensorFlow v2 • Keras for the creation of the layers • OpenCV for the pre-processing of the dataset and the data augmentation • Matplotlib to visualize our results

4.2. Net training and testing

For training our network, we used the EPFL [25] dataset, which includes multiple scenes of moving pedestrians.

The training data were extracted by taking 3 frames at a time and adding noise to increase the number available. Table 2 Next, each triplet was saved in a file, with the first and last frames as input to the generator and the middle frame Net as reference. The dataset was divided into validation, training, and testing. The GAN network was trained EpicFlow[26] using the early stopping technique thus preventing the BeyondMSE[27] network from overfitting the data. The loss graph is MOCunreNt+eRtwESo[r2k8] shown in Figure 1 for the Generator and Figure 2 for the Discriminator.

To study the results of our neural network, we computed the SSIM and PSNR values which are used to mea- 4.3. System architecture sure the similarity between two images and are defined as: SSIM

5. Results

frame is passed to the detection thread and can be stored, to be processed later by the Neural Network. The points centered in the top part of the bounding boxes generated In this section, we will show the results obtained. by the detection threads are passed to the camera models, to obtain the position of the persons on the plane. 5.1. Frame Rate and Crowd Counting Those positions are then merged by searching for each camera the nearest neighbor and in case of a mismatch between the number of people in the cluster, the bigger one is chosen; after matching is found, for each person, a dot is drawn on the map having the average position between the matched one. The whole system architecture is represented in the Figure 3.

As we can see in figure 4, the first and the last frame are the input, while the middle one was generated by the Generator of the GAN network.

After a series of comprehensive tests, our technology performed smoothly when properly identifying and counting people in enclosed spaces. With the addition of computer vision algorithms and the advances made possible by our improved neural network, accurate people counting and identification are guaranteed. The outcomes demonstrate the system’s capacity to monitor and control crowd density in confined areas eficiently. For a visual depiction, Figure 5 shows how our model could be used in a real-case scenario using only one camera.

6. Conclusions

In summary, our methodology ofers a dependable and precise means of detecting and measuring human beings in enclosed spaces. By utilizing the creative fusion of GAN-based networks and the efectiveness of lightweight YOLO models, our system not only ensures robustness but also demonstrates flexibility to operate on systems with limited technological resources. This clever approach strengthens security protocols and expedites operational workflows in addition to ofering a ifnancially sensible way to implement occupancy restrictions in a variety of scenarios. Our method, which makes use of cutting-edge AI technologies, is a big step toward improving space management and guaranteeing adherence to safety laws, making all people’s surroundings safer and more efective. Moreover, it is a useful tool in circumstances where precisely counting people is necessary to avoid crowding, making the environment safer and more efective for everyone. Our method, which makes use of cutting-edge AI technologies, is a big step toward improving space management and guaranteeing adherence to safety rules, which will eventually improve the general standard of public areas and facilities.

[1]

N. N.

Dat ,

Ponzi ,

Russo ,

Vincelli , Supporting impaired people with a following robotic assistant by means of end-to-end visual target navigation and reinforcement learning approaches , volume 3118 , 2021 , pp. 51 - 63 .

[2]

Ponzi ,

Russo ,

Bianco ,

Napoli ,

Wajda , Psychoeducative social robots for an healthier lifestyle using artificial intelligence: a case-study , volume 3118 , 2021 , pp. 26 - 33 .

[3]

De Magistris ,

Caprari , G. Castro,

Russo ,

Iocchi ,

Nardi ,

Napoli , Vision-based holistic scene understanding for context-aware humanrobot interaction 13196 LNAI ( 2022 ) 310 - 325 . doi: 10 .1007/978-3- 031 -08421-8_ 21 .

[4]

Goodfellow ,

Pouget-Abadie ,

Mirza ,

Xu ,

Warde-Farley ,

Ozair ,

Courville ,

Bengio , Generative adversarial nets , in: Advances in neural information processing systems , 2014 , pp. 2672 - 2680 .

[5]

Pepe ,

Tedeschi ,

Brandizzi ,

Russo ,

Iocchi ,

Napoli , Human attention assessment using a machine learning approach with gan-based data augmentation technique trained using a custom dataset , OBM Neurobiology 6 ( 2022 ). doi:10. FLAVR : flow-agnostic video representations for 21926/obm .neurobiol. 2204139 . fast frame interpolation, CoRR abs/ 2012 .08512

[6]

Ciancarelli , G. De Magistris, S. Cognetta , ( 2020 ). URL: https://arxiv.org/abs/ 2012 .08512. D. Appetito , C.

Napoli , D.

Nardi , A gan ap- arXiv: 2012 .08512. proach for anomaly detection in spacecraft teleme- [17]

Bao , W.-S. Lai,

Ma ,

Zhang ,

Gao , M.-H. tries 531 LNNS ( 2023 ) 393 - 402 . doi: 10 .1007/ Yang, Depth-aware video frame interpolation , 2019 . 978 -3- 031 -18050-7_ 38 . URL: http://arxiv.org/abs/ 1904 .00830.

[7]

Brandizzi ,

Russo , G. Galati,

Napoli , Address- [18]

Sohan ,

Sai Ram ,

Reddy ,

Venkata , A ing vehicle sharing through behavioral analysis: A review on yolov8 and its advancements, in: Insolution to user clustering using recency-frequency- ternational Conference on Data Intelligence and monetary and vehicle relocation based on neigh- Cognitive Informatics , Springer, 2024 , pp. 529 - 545 . borhood splits, Information (Switzerland) 13 ( 2022 ). [19] K. D. Team , Crowdhuman dataset, https: doi:10.3390/info13110511. //universe.roboflow.com/keio-dba-team/

[8]

Marcotrigiano ,

G. D.

Stingi ,

Fregnan ,

Mag- crowdhuman-nur7g, 2022 . arelli, P. Pasquale,

Russo ,

G. B.

Orsi , M. T. Mon- [20]

Bewley ,

Ge ,

Ott ,

Ramos ,

Upcroft , Simtagna,

Napoli ,

Napoli , An integrated control ple online and realtime tracking, 2016 IEEE Interplan in primary schools: Results of a field investi- national Conference on Image Processing (ICIP) gation on nutritional and hygienic features in the ( 2016 ). URL: http://dx.doi.org/10.1109/ICIP. 2016 . apulia region (southern italy) , Nutrients 13 ( 2021 ). 7533003. doi: 10 .1109/icip. 2016 . 7533003 . doi: 10 .3390/nu13093006. [21]

R. E.

Kalman , A New Approach to Linear Filtering

[9]

Alfarano , G. De Magistris,

Mongelli ,

Russo , and Prediction Problems, Journal of Basic EngineerJ. Starczewski,

Napoli , A novel convmixer trans- ing 82 ( 1960 ) 35 - 45 . URL: https://doi.org/10.1115/1. former based architecture for violent behavior de- 3662552. doi: 10 .1115/1.3662552. tection 14126 LNAI ( 2023 ) 3 - 16 . doi: 10 .1007/ [22]

Zhang , A flexible new technique for camera 978-3 - 031 -42508- 0 _1. calibration, IEEE Transactions on Pattern Analy-

[10]

Woźniak ,

Połap ,

Gabryel , R. K. Now- sis and Machine Intelligence 22 ( 2000 ) 1330 - 1334 . icki, C. Napoli, E. Tramontana, Can we pro- doi:10.1109/34.888718. cess 2d images using artificial bee colony? , vol- [23]

Isola , J.-Y. Zhu,

Zhou ,

A. A.

Efros , Image-toume 9119 , 2015 , pp. 660 - 671 . doi: 10 .1007/ image translation with conditional adversarial net978-3-319-19324-3 _ 59 . works, in: Proceedings of the IEEE Conference on

[11]

Russo ,

Napoli , A comprehensive solution for Computer Vision and Pattern Recognition (CVPR), psychological treatment and therapeutic path plan- 2017. ning based on knowledge base and expertise shar- [24]

Ronneberger ,

Fischer ,

Brox , U-net: Convoing, volume 2472 , 2019 , pp. 41 - 47 . lutional networks for biomedical image segmen-

[12] T.-E. Tseng , A.-S. Liu, P.-H. Hsiao , C.-M. Huang , L.- tation, CoRR abs/1505 .04597 ( 2015 ). URL: http: C. Fu, Real-time people detection and tracking for //arxiv .org/abs/1505.04597. arXiv: 1505 .04597. indoor surveillance using multiple top-view depth [25]

Fleuret ,

Berclaz ,

Lengagne ,

Fua , Multicameras, in: 2014 IEEE/RSJ International Confer - camera people tracking with a probabilistic occuence on Intelligent Robots and Systems, 2014 , pp. pancy map , IEEE Transactions on Pattern Anal4077-4082 . doi: 10 .1109/IROS. 2014 . 6943136 . ysis and Machine Intelligence 30 ( 2008 ) 267 - 282 .

[13]

Hartmann ,

Al Machot ,

Mahr , C. Bobda, doi:10.1109/TPAMI. 2007 . 1174 . Camera-based system for tracking and position es- [26]

Revaud ,

Weinzaepfel ,

Harchaoui , timation of humans, 2010 , pp. 62 - 67 . doi: 10 .1109/ C. Schmid, Epicflow: Edge-preserving interDASIP . 2010 . 5706247 . polation of correspondences for optical flow , 2015 .

[14] M.-A. Mittet , T. Landes , P. Grussenmeyer, arXiv: 1501 .02565. Localization using rgb-d cameras orthoim- [27]

Mathieu ,

Couprie , Y. LeCun, Deep multi-scale ages , volume XL-5 , 2014 . doi: 10 .5194/ video prediction beyond mean square error, 2016 . isprsarchives-XL-5 - 425- 2014 . arXiv: 1511 . 05440 .

[15] B.-D. Choi , J.-W. Han, C.-S. Kim, S.-J.

Ko , Motion- [28] R.

Villegas , J.

Yang , S.

Hong , X.

Lin , H.

Lee , Decompensated frame interpolation using bilateral composing motion and content for natural video motion estimation and adaptive overlapped block sequence prediction , 2018 . arXiv: 1706 .08033. motion compensation, IEEE Transactions on Circuits and Systems for Video Technology 17 ( 2007 ) 407 - 416 . doi: 10 .1109/TCSVT. 2007 . 893835 .

[16]

Kalluri ,

Pathak ,

Chandraker ,

Tran ,