-

1613-0073

Hallway Tracker: Hospital Contact Tracing During the COVID-19 Pandemic

Christian Marinoni

christian.marinoni@uniroma1.it 1

Valerio Ponzi

ponzi@diag.uniroma1.it 0 2

Danilo Comminiello

danilo.comminiello@uniroma1.it 1 0 Department of Computer, Control and Management Engineering, Sapienza University of Rome , Via Ariosto 25, Roma, 00185 , Italy 1 Dpt. of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome , Via Eudossiana 18, Roma, 00184 , Italy 2 Institute for Systems Analysis and Computer Science, Italian National Research Council , Via dei Taurini 19, Roma, 00185 , Italy

51 61

During the COVID-19 pandemic, the use of a people tracking system could have been crucial, particularly in sensitive environments, such as hospitals. DPPL Hallway Tracker is a framework that uses security camera footage to determine which rooms in a corridor a person has entered. It generates a database containing all the people identified and allows quick identification of potential cases of infection based on the time spent in a room and its maximum capacity. DPPL Hallway Tracker is structured in two phases: detection and re-identification. In the first phase, it exploits Mask RCNN to identify people and room doors. In the second one, it uses the deep association metric model from DeepSORT to re-identify a person as he leaves a room.

COVID-19 Pandemic

1. Introduction

Managing a pandemic has proved to be a dificult challenge despite the technological developments of the past decades. Containment measures based on restrictions on personal mobility (such as lockdowns) have proved to be very efective for infection containment [ 1, 2, 3 ]. However, these turn out to be short-term solutions that are not extendable throughout the whole virus’s life cycle.

As with Covid-19, the presence of a potentially infected

individual in a closed environment is a central problem Face masks, in combination with good room ventilation, help to reduce the risk of transmission. However, it is not suficient to eliminate all the risks. Tracking operations are required to ensure the identification of the chain of tagion. Tracking turns out to be even more essential in public settings, such as public ofices and hospitals [ 4, 5 ]. cific tracking apps (respectively, Immuni and CoronaWarn-App) for a Bluetooth-based contact estimation [ 6 ]. These solutions, although potentially efective, have shown evident limitations, such as low difusion in the population, constraints on the version of the smartphone tives. While they may be efective in the short term since they are employable on a big scale, other solutions prove and segmented using Mask R-CNN; then, their mask is passed to a Re-ID network to obtain an identifier (an

The descriptors are finally compared with those of the

people already known to verify the person’s identity. Another contribution, in addition to the general approach networks, built from scratch or starting from existing ones.

DPPL Hallway Tracker appears to be very efective

and the risk of contagion increases with exposure time. the level of saturation of the room given its maximum contacts and the estimation of the relative risk of con- people’s movements, exploiting the characterization of

Some countries, such as Italy and Germany, used spe- tion features to determine a distribution of the positions OS, poor estimation of distances and related false posi- array) that “describes” the way they appear in the scene. CEUR

ceur-ws.org in tracking people entering and leaving rooms facing a corridor. The use of appearance features turns out to be suficiently robust to allow correct identification, even if it is less efective in recognizing people who reappear in the corridor without leaving a room.

This report describes the project’s workflow, from the description of the datasets to the results’ analysis. 2. Related works multiple people entering the same room collapse at the same value, thus providing no valuable information for the ID attribution when a person leaves the room. On the contrary, the use of a re-identification network based on appearance features in DeepSORT is functional for the current application and is therefore also implemented in this project.

In today’s literature, at best of our knowledge, there are no studies aimed at analyzing the specific context of tracking and re-identifying people who enter and leave rooms. Pedestrians on streets or people moving around indoors are usually the focus of most approaches. Other works specialize in counting people in some particular environments. For example, Rabaud and Belongie [ 13 ] investigate the possibility of counting people passing through crowded environments; [ 14 ], [ 15 ], [ 16 ] focus on counting passengers getting in/out of a bus and [ 17 ] of a metropolitan train; [ 18 ] counts people walking through a corridor or a door, without keeping into account their identities.

The absence of a similar application makes the comparison between the implementation proposed in this project with a baseline more complex. Therefore, in the following Sections, the individual modules that constitute it are compared with corresponding existing solutions, in the attempt to ofer an objective yardstick on the choices made. 3.1.1. Door detection

To provide door detection, Mask R-CNN[19] was fine

tuned with a dedicated dataset, assembled for the purpose.

It includes a selection of 2773 out of 3000 RGB images of the DeepDoors2 dataset [24], which is freely available online. These images represent one or multiple doors in diferent outdoors and indoors scenarios, which do not necessarily correspond to a corridor: in fact, the large majority of them represent doors from the front.

They also include obstacles that partially occlude part of the doors. The annotations in the DeepDoors2 data set are provided as additional images where each one has a black background and diferent coloured masks for the doors. Being interested in this project more in the Figure 1: General scheme of the R-CNN Mask framework. portion of space occupied by the door than in the profile The layers indicated with the letters C and P are convolutional of the door itself, all the images are re-masked to segment layers that represent the backbone network. The classic pyra- exclusively the door casing. Hence, almost all images mid architecture improves the detection of objects of various have quadrilateral-shaped masks (thus with four vertices sizes. only). Moreover, the generated annotation files are no more encoded as images like in the original DeepDoors2 dataset, but they are fully compatible with the COCO task. More specifically, it employs the Mask RNN frame- dataset specifications [ 25]. In fact, the annotation files work [ 19, 20 ], which derives from Faster RNN [ 11, 21 ] (in are JSON files containing: (1) references to all images, turn, one of the evolutions of the original R-CNN [22]) each having a unique ID, as shown in the first row of but adds a third parallel head used to generate the masks. Table 1; (2) a mask and bounding box (bbox) associated It also introduces further improvements, like the support to each image (second row of Table 1). to pixel-to-pixel alignment between network inputs and outputs (ROI-Align). Figure 1 shows the diferent stages {"images": [ that characterize the network. {"id": 514, "width": 1080,

Initially, the image is passed as input to a convolution- "height":1920, "file_name":"frame.jpg"}, based Feature Pyramid Network [23], which has the task ... of extracting meaningful information from diferently- ] sized feature maps. An object can appear in the fore- } ground (and therefore very large in the image) or further away from the camera; hence, this pyramidal structure {"annotations": [ facilitates its detection. The features thus extracted are {"id": 519, "iscrowd": 0, passed to the Region Proposal Network (RPN), which "image_id": 514, "category_id": 1, produces several Regions Of Interest (ROI), each with its "segmentation": [[587.52,...,1097.77]], bounding box. At this point, the first-mentioned ROI- ""abrbeoax"":: 1[2416076.82.08,75}8,1.407,295.90,809.02], Align is applied and its result is passed to the second ... stage of the network, from which a series of fully con- ] nected layers allow to refine the position of the bounding } box, the class of the object it contains and its mask.

Moreover, assuming the camera to be static and, there- Table 1 fore, the position of the doors to be fixed over time, this An example of the formatting of JSON files containing image project exploits two distinct models: one for the door annotations according to COCO specifications is represented detection only and the other for people detection. Door in this table. The first row shows the data structure used detection is applied just in the starting phase of the frame- to list all the images in the dataset, the second row instead work while, from then on, people detection is performed. shows the one used to specify the annotations associated with The process of generating the two models and the related each image, thus including the mask (“segmentation”) and the results are analyzed below. bounding box (“bbox”). The “category_id” field is always set to 1, as there is only one category (door or person, depending on the dataset).

The dataset is split into training, validation and test

sets. These subsets are disjoint; the training set contains 70% (1941) of the images, while the remaining 30% is equally divided between the validation and test sets (416 each).

With the new dataset available, called Dppl, we finetuned the model pre-trained with the COCO dataset, mated mask is considered to be True if its IoU is greater which is available on the framework’s GitHub reposi- or equal than k, false otherwise. tory. Consequently, ResNet101 was used as the backbone, The primary challenge metric for the COCO dataset is and training was done in the same manner as the frame- AP@[.50:.05:.95] (usually referred to simply as AP), which work’s authors. In particular, we trained the head only is the average AP for IoU (Intersection over Union) from for the first ten epochs; for the following thirty epochs, 0.5 to 0.95 with a step size of 0.05. This metric is also used we fine-tuned stages four and above of the backbone too; to evaluate the results of our test set. In particular, with ifnally, in the last ten epochs, we extended the training the Dppl dataset and the training procedure described to the entire network. Unlike [ 19 ], the learning rate is above, we got an AP of 85.7 and AP@.75 of 95.8. We initially set to 0.001 (rather than 0.02) to keep the weights also report the Average Accuracy, which is calculated by from exploding; moreover, it is divided by a factor of 10 counting how many pixels out of those belonging to a during phases two and three of the training. The other specific area are correctly classified. In this case, rather parameters are left unchanged, such as the weight decay than the whole image, the considered area is the smallest of 0.0001 and momentum of 0.9. Finally, mini-masks were rectangular portion of the image that contains both the used (i.e. the masks were resized to the size of 56x56 px) ground-truth mask and the one produced by the model. to lessen the risk of memory problems. Data augmen- In numerical terms, we obtained an Average Accuracy of tation (horizontal flipping) was also applied. Figure 2 95.34% in the case of Door Detection. shows the train and validation losses got during training. Figure 3 displays the situation in a corridor not in

On the test set, the AP metric was used to assess the cluded in the dataset: the door on the right that is parquality of the results produced by the training. AP, the ticularly “thinned” from the perspective is indeed not acronym for Average Precision, computes the average detected. Precisely for this reason, the framework proprecision value for recall values over 0 to 1. In practice, vides a specific graphical interface that allows adding AP is computed as the mean of precision values at a set of new door positions, as shown in Section 4.3. equally spaced recall levels, as defined by the following formula 3.1.2. People detection

Similarly to what was done with the doors, a model for

people detection is also generated. Mask R-CNN with where, given (⋅) the precision, ( ) = max ∶̃ ≥̃ ( )̃ the weights of COCO is already alone able to detect and and = 101 in COCO. AP@k stands for the average pre- segment people with acceptable accuracy. However, finecision for IoU (Intersection over Union, i.e. how much tuning was done using a dedicated dataset built specifithe predicted mask overlaps with the ground truth) of k. cally for the occasion from videos captured along a hallMore specifically, in the computation of AP@k, an esti- way. More in detail, the dataset contains 793 frames captured in a corridor by a 1080x1920 px resolution camera that was positioned a few centimeters from the ceiling (approximately 2.9 meters from the floor) with a vertical image layout. In the scene, six people appear walking down the hallway and entering/exiting the adjoining rooms. They wear various types of clothing (including a white coat to simulate the presence of a doctor); they are of diferent ages and all wear face masks. One of the people has a foot cast and crutches. All frames are handannotated to generate high-quality masks, accurately respecting the person’s shape. The related annotation ifles follow the COCO specifications, as described before.

The split of files between training (555 images), validation (119) and test (119) sets follows the same proportion as the Dppl dataset.

With this second dataset available, called dPPL, we once again fine-tuned the model pre-trained with the COCO dataset. All the Mask R-CNN’s parameters are kept the same, but Gamma Contrast is used as a data augmentation technique in conjunction with horizontal lfipping in this case.

Figure 4 shows the graph of the training and validation losses. As for the performance on the test set, Table 2 shows the comparative Average Precision values between the use of a model trained only with COCO and that obtained by doing fine-tuning with the dPPL dataset.

This second option provides better results for both AP and AP@.75. The same applies for the Average Accuracy. These good results should be evaluated considering

Method

COCO only COCO+fine-tuning on dPPL

AP 70.5 76.3

AP@.75 92.9 95.5

Acc. 99.08% 99.74% the not very high number of images that compose the dataset. Indeed, environments with completely diferent illumination and compositions will certainly attenuate the good performances provided by this model.

3.2. People Re-identification The detection of doors and people in the scene does not

sufice to ensure accurate tracking. As mentioned above, one can use additional information extracted from the images within more or less complex systems, which may exploit appearance, movement and shape features. An example is DeepSORT [ 7 ], which uses the Kalman filter to predict the position of a person in the next frame and integrates appearance information based on a deep appearance descriptor. Despite DeepSORT being a powerful tool, the use of the Kalman Filter turns out to be less efective when the subject disappears for long periods from the camera view. Indeed, the Kalman Filter modulates the state estimate of the system (in this case, the position in the frame of a subject) as a Gaussian distribution whose variance strictly depends on the observations over time. When a person disappears from the scene, the degree of uncertainty increases and the same happens to the distribution variance. Furthermore, the Kalman Filter would be practically useless if several people enter the same room: the states of those subjects would collapse into the same value, making this information useless to distinguish a person from the others when they leave the room. Nevertheless, the solution undertaken in DeepSORT on the use of appearance features turns out to be quite efective whenever the Kalman Filter is not since it relies on visual cues. For this reason, DPPL Tracker is primarily based on appearance features, though it also takes advantage of some assumptions related to the work environment (a corridor).

In this project, Deep Cosine Metric Learning [26], the same used in DeepSORT for appearance re-identification, is used. It applies a variation of Softmax classifier called Cosine Softmax Classifier, which allows obtaining a different representation space in which compact clusters are formed based on the appearance features. This is achieved by first applying the 2 normalization, which uses the 2 -norm to normalize the input values so that, if squared and summed, they would result in the value 1, and, secondly, by normalizing the weights. Finally, the cosine softmax classifier is applied, which is defined as follows: ( = | ) = exp( ⋅ ̃ ) ∑=1 exp( ⋅ ̃ ) where is a free scaling parameter. 1 × 10−8; moreover, the input images are scaled to 128x64 px.

The use of the masked MARS dataset proves to be

beneficial for the network training since it provides improved results according to the CMC Rank@K and mAP metrics1, as shown in Table 4. The table also shows the results of two state-of-the-art solutions on the original MARS dataset. Both largely outperform the solution proposed in this project, however, they also use much more

1Computed through the MARS evaluation tool, available at

distractors to make it more realistic. The goal of the Re- sophisticated methods or networks with many more padataset after applying object instance segmentation.

Method

Rank1

Rank5

DCML on MARS

DCML on masked MARS B-BOT + Attention & CL loss

MGH

4. DPPL Tracker framework People tracking is ofered through a specific framework

that employs Mask R-CNN and the above-mentioned re-identification network. It also provides additional features to improve the user experience and optimize the search for people. More precisely, the workflow is the following: the first frame is first passed as input to Mask R-CNN for doors detection. Once doors are located, that frame and the following ones are passed to the same network (with diferent weights) for people detection. The portion of the image containing each person is then multiplied by the corresponding mask (to have a black background) and, after being resized to 128 x 64 px, is passed to the re-identification network. The latter has its head cut of so that it outputs an array of size 128 (generated by the last Dense layer). This array is a descriptor of the person’s appearance and is used by the framework’s main algorithm to associate a unique identity ID with each person.

4.1. Main algorithm Algorithm 1: Main algorithm

Data: _ ,

Result: People identified 1 _ ← [] ; 2 for in _ do 3 , ← ; 4 ← [[0] ∶

[ 2 ], [ 1 ] ∶ [ 3 ]] ; 5 _ ← ∗ 6 ←

get_person_identifier ( _) 7 , ← ifnd_nearest( ); if pID == -1 then

// New person appeared , ; After selecting the video, the first frame is analyzed // Person in the corridor or exited from a through mask-RCNN to locate the doors in the scene. room If one or more doors are not detected, the user can man- 12 end ually add additional ones, as shown in Section 4.3. Only 13 _ ← at that point, the analysis of the following frames begins. 14 end Pseudocode 1 shows the main steps. As previously de- 15 for in _ _ _() do scribed, Mask R-CNN is again used to identify people, 16 if not in _ then while the re-ID network provides the people appearance 17 if close to a room then descriptors. At that point, for each person, the find_near- 18 // Person entered in a room est function allows identifying the already-known closest 19 else identifier to the detected descriptor, if any. In this way, 20 // Person disappeared from the scene it is possible to determine whether that person already (may due to an occlusion) appeared in the past and, depending on their position 21 end and on the knowledge derived from past frames, a log is 22 end added to the database if they are leaving a room. If there is no similar person, the algorithm adds a new one to the 23 end scene. The final for loop finds all people who were in the environment up to the previous frame but are now missing. In this case, there are two alternatives: the per- people who last left the corridor, then moving on to all son may either have entered a room (if in the preceding the known people. The similarity between two identifiers frame they were suficiently close to the relative door) ID and ID is computed with the cosine similarity, as or may have disappeared, for example, because they left follows the hallway or are temporarily occluded. To improve = ID ⋅ ID the eficacy of the algorithm, the framework starts track- ‖ID‖‖ ID‖ ing a person when he appears entirely in the scene and Two identifiers are more similar as the cosine similarhis bounding box is at a minimum distance from the ity goes to one. Hence the need to define, for each of image edges. Furthermore, it uses the area of the bbox the listed searches, a threshold that defines when two to interrupt (temporarily or not) the tracking when an descriptors must be considered suficiently similar (and object/person occludes the subject or when the tracked therefore belonging to the same person) or not. The person has nearly entirely entered a room. choice of the threshold heavily influences the tracking

A fundamental step is the one implemented by the efectiveness. In the various phases a diferent threshold ifnd_nearest function, shown in Pseudocode 2. It uses is used, more specifically: (1) if a person is walking along diferentiated searches to find the already-known person the corridor without other people in the close vicinity with the most similar identity to the one passed as input. and, if compared to the previous frame, that person has First, it searches among the people visible in the scene in not moved too far from their previous position in the the previous frame. In case of failure, if the detection is scene, then a greater dissimilarity between the descripclose enough to a door - according to a given threshold tors is tolerated; (2) in other cases, the threshold is set to - it searches among the people who are known to be in a value between 0.85 and 0.9. Section 5 discusses some that room. As a last chance, it starts searching among the critical issues regarding the choice of the threshold.

Algorithm 2: Find nearest identity

a room (Figure 7). In the latter case, the interface highlights the riskiest situations (for example, if the room capacity has been exceeded) in addition to providing all records linked to the entered ID.

frameID personID roomID "in/out/new" where frameID is an incremental value representing the currently processed frame, personID is a unique integer associated to a person (diferent from the identifier representing the way that person looks in the scene), roomID is the ID of the room the person is entering/leaving - if any - and it is equal to −1 otherwise. The last label has the value “in” or “out” when “roomID” is diferent from −1, while it assumes the value “new” when a new person appears in the scene.

For simplicity, the database is implemented via a simple CSV file containing all the logs, but more complex and scalable solutions (such as NoSQL) are also possible. Knowing the video framerate, the framework derives an estimate of the time spent in the room, to highlight possible dangerous situations. The same is done by counting the number of people in the same room and alerting when the maximum capacity is exceeded.

5. Analysis and results The behaviour of the framework is evaluated in two difer

4.3. GUI ent setups of incremental dificulty. In the first setup, peoA simple user interface, implemented with the PySim- ple walk down a corridor one after the other, in a perfect pleGUI library, is also available to provide the user with lfow that limits the occasions when two or more people more flexible interaction with the framework. The user are simultaneously in the same room. This modality alcan select a file or directory containing the needed frame lows focusing mainly on an inter-frame re-identification images, as well as add new doors that Mask R-CNN did and on the correct detection of people entering and leavnot detect. In this second case (shown in Figure 6), by ing the rooms. In the second setup, multiple people can using a simple library such as Matplotlib, it is possible to enter the same room. The challenge, in this case, is to be ofer a response in real-time on the location of the new able to identify the identity of a person when he leaves doors and their heights (used by the algorithm). Finally, the room. The results show that the algorithm can handle at the end of the processing of all frames, the user can a wide range of situations with ease, producing results search all the times a particular ID has entered and left that are similar - if not identical - to the ground truth.

First of all, it is beneficial to analyze how accurately on to analyze the accuracy of people tracking. In particthe framework can detect the presence of one or more ular, the inter-frame re-identification of a person in the people in the scene. To calculate the overall accuracy of scene scores 100% accuracy, even in the case of several the detections we used two methods. The first consists people in the corridor; the same happens when the perof considering only those frames in which a person is son leaves a room, even when more than one is inside shown entirely (i.e., he is not hidden - even partially - by it. The criticalities are mainly two: (1) the dificulty in objects or other people). The second way is to consider defining an eficient threshold for cosine similarity, since all frames, including all borderline cases in which only the method adopted is susceptible to sudden changes in a portion of a person’s arm or leg appears in the frame. the person’s position (such as front and rear vision of Figure 8 shows an example of the frames considered with the person); (2) the influence of the quality of masks proboth methods. The results - obviously better in numerical duced by Mask R-CNN on the re-identification network. terms in the first case - are shown in Table 5. A sudden change in the portion of the image taken into consideration (even without sudden movements of the

Overall (Detection) Accuracy subject) can reduce the cosine similarity.

Method 1 100% Cosine similarity can be a powerful tool for guiding the Method 2 91.76% re-identification task: limiting the search to the people Table 5 inside the room and using the cosine similarity always The accuracy of people detection computed with two methods leads to correct identifications. Nevertheless, the weakis shown here. With the first one, we only considered those nesses listed above heavily reduce its efectiveness when frames in which people bodies are shown wholly in the image. it is necessary to recognize a person who had previously The second method also includes those frames in which a left the corridor (without entering any room) and who person is only partially visible. reappears later on. Indeed, the choice of a high threshold (i.e., ≥ 0.9) makes it dificult to assign the same ID in the situation under analysis, because usually, the person will reappear in a completely diferent pose (for example, from behind and not the front) which will reduce the value of the cosine similarity. In this case, there will be no ID switches between diferent people, but each time one reappears in the scene it will be assigned a new ID.

On the contrary, lowering the threshold facilitates the ID switches, creating some cascading problems in the framework (an ID already assigned - even if incorrectly - to a person will not be re-assigned as long as the person is in the scene, not even if the one it was originally assigned to reappears). However, these problems do not afect the recognition of people leaving the rooms: the identifier produced by the Re-ID network and the similarity computed with the cosine similarity is suficient for the correct attribution of the ID. Compared to the baseline (Re-ID network trained on the original MARS dataset), it can be observed that the cosine similarity of the same person in two diferent situations (frames) is greater (by 1-2%) when assessed with our method.

Figure 8: The frame on the left is an example of those con- As a final benchmark, the accuracy of the logs (seen sidered with Method 1 for calculating the Overall Detection as the ratio of the logs equal to ones of the ground truth Accuracy. The person’s body is entirely included in the scene. over the total number of them) produced in the tests is The frame on the right is instead an example of those consid- equal to 50%. The accuracy goes up to 84% if we also ered with Method 2, that takes also into account all borderline include those logs with labels “in” and “out” that difer cases in which only a portion of a person’s arm or leg appears only in the person ID from the ground truth (but only in the frame. In this case, the two people in the scene are only if that ID is a new one, and therefore if there is no ID dpeatreticatlelyd vbiysitbhlee amnoddtehl.e arm of the uppermost person is not switch with a previously known identity). When a person enters a room, the relative log at the exit is always correct, as already mentioned above. As for performance, an

Having ascertained that the framework can detect the Nvidia Tesla K80 is capable of processing 1.4-1.5 frames presence of people with good reliability, we then move per second.

We also ran a test in a setup with slightly diferent specifications. In fact, the recording device was placed at eye level, tilted almost parallel to the floor and with an image ratio of 16:9. The results obtained are comparable to those indicated above, although tracking people in areas very distant from the camera (and therefore at lower resolution) turns out to be more critical. Under these conditions, it is quite easy for two diferent subjects to appear very similar even to the human eye. An example is shown in Figure 9. Ultimately, the framework is most efective when the distance to the doors is not excessively large.

6. Conclusion DPPL Hallway Tracker turns out to be a good starting

point for developing a framework capable of tracking people entering and leaving multiple rooms. The use of a re-ID network that exploits the masks produced in the detection and segmentation phase leads, even in the tests performed, to improvements in identification.

A project extension might be able to address some of the remaining issues: (1) the enrichment of the datasets of people and doors could lead to better detection in several more challenging contexts: for example, as discussed above, the detection and segmentation of doors “thinned” from perspective remains dificult; (2) using a dynamic threshold and investigating complementary solutions to 51–61 the re-identification network could alleviate the dificulty of assigning the same ID to a person who reappears in the corridor without leaving a room. The study of solutions for tracing people entering and leaving the rooms is of great importance for the application developments that it can have. It not only allows contact tracing in the event of pandemics but it can be also used for other contexts, as for the analysis of the movements of patients and medical operators and the optimization of hospital wards.

[1]

Alfano , S. Ercolano, The eficacy of lockdown against covid-19: a cross-country panel analysis , Applied health economics and health policy 18 ( 2020 ) 509 - 517 .

[2]

Pepe ,

Tedeschi ,

Brandizzi ,

Russo ,

Iocchi ,

Napoli , Human attention assessment using a machine learning approach with gan-based data augmentation technique trained using a custom dataset , OBM Neurobiology 6 ( 2022 ). doi: 10 . 21926/obm.neurobiol. 2204139 .

[3]

Ponzi ,

Russo ,

Wajda ,

Brociek ,

Napoli , Analysis pre and post covid-19 pandemic rorschach test data of using em algorithms and gmm models , volume 3360 , 2022 , pp. 55 - 63 .

[4]

Marcotrigiano ,

G. D.

Stingi ,

Fregnan ,

Magarelli ,

Pasquale ,

Russo ,

G. B.

Orsi ,

M. T.

Montagna ,

Napoli ,

Napoli , An integrated control plan in primary schools: Results of a field investigation on nutritional and hygienic features in the apulia region (southern italy) , Nutrients 13 ( 2021 ). doi: 10 .3390/nu13093006.

[5]

De Magistris ,

Romano ,

Starczewski ,

Napoli , A novel dwt-based encoder for human pose estimation , volume 3360 , 2022 , pp. 33 - 40 .

[6]

Bano ,

Arora ,

Zowghi ,

Ferrari , The rise and fall of covid-19 contact-tracing apps: when nfrs collide with pandemic , in: 2021 IEEE 29th International Requirements Engineering Conference (RE) , 2021 , pp. 106 - 116 . doi: 10 .1109/RE51729. 2021 . 00017 .

[7]

Wojke ,

Bewley ,

Paulus , Simple online and realtime tracking with a deep association metric , in: 2017 IEEE international conference on image processing (ICIP) , IEEE, 2017 , pp. 3645 - 3649 .

[8]

Alfarano , G. De Magistris,

Mongelli ,

Russo ,

Starczewski ,

Napoli , A novel convmixer transformer based architecture for violent behavior detection 14126 LNAI ( 2023 ) 3 - 16 . doi: 10 .1007/ 978- 3- 031 - 42508- 0 _ 1 .

[9]

Yang ,

Nevatia , Multi-target tracking by online learning of non-linear motion patterns and robust appearance models , in: 2012 IEEE Conference on Computer Vision and Pattern Recognition , 2012 , pp. - cascade neural network based approach , 2014 , pp. 1918 - 1925 . doi: 10 .1109/CVPR. 2012 . 6247892 . 355 - 362 . doi: 10 .1109/SPEEDAM. 2014 . 6872103 .

[10]

Bewley ,

Ge ,

Ott ,

Ramos ,

Upcroft , [21]

Bonanno , G. Capizzi,

G. L.

Sciuto ,

Napoli , Simple online and realtime tracking, 2016 IEEE Wavelet recurrent neural network with semiInternational Conference on Image Processing parametric input data preprocessing for micro-wind (ICIP) ( 2016 ). URL: http://dx.doi.org/10.1109/ICIP. power forecasting in integrated generation sys2016.7533003. doi:10 .1109/icip. 2016 . 7533003 . tems, 2015 , pp. 602 - 609 . doi: 10 .1109/ICCEP.

[11]

Ren ,

He ,

Girshick ,

Sun , Faster r-cnn: 2015 .7177554. Towards real -time object detection with region pro- [22]

Girshick ,

Donahue ,

Darrell , J. Malik, Rich posal networks, Advances in neural information feature hierarchies for accurate object detection processing systems 28 ( 2015 ) 91 - 99 . and semantic segmentation, in: Proceedings of the

[12]

R. E.

Kalman , A new approach to linear filtering IEEE conference on computer vision and pattern and prediction problems ( 1960 ). recognition, 2014 , pp. 580 - 587 .

[13]

Rabaud ,

Belongie , Counting crowded mov- [23] T.-Y. Lin , P.

Dollár , R.

Girshick , K.

He , B.

Hariharan , ing objects, in: 2006 IEEE Computer Society Con- S. Belongie, Feature pyramid networks for object ference on Computer Vision and Pattern Recog- detection, 2017 . arXiv: 1612 .03144. nition (CVPR'06) , volume 1 , 2006 , pp. 705 - 711 . [24]

Ramôa ,

Lopes ,

Alexandre ,

Mogo , Real-time doi: 10 .1109/CVPR. 2006 . 92 . 2d -3d door detection and state classification on a

[14]

Labit-Bonis ,

Thomas ,

Lerasle ,

Madrigal , low-power device , SN Applied Sciences 3 ( 2021 ). Fast tracking-by-detection of bus passengers with doi : 10 .1007/s42452-021-04588-3. siamese cnns, in: 2019 16th IEEE International [25] T.-Y. Lin , M.

Maire , S.

Belongie , J.

Hays , P. Perona, Conference on Advanced Video and Signal Based D. Ramanan , P.

Dollár , C. L.

Zitnick , Microsoft Surveillance (AVSS) , 2019 , pp. 1 - 8 . doi: 10 .1109/ coco: Common objects in context, in: European AVSS . 2019 .8909843. conference on computer vision , Springer, 2014 , pp.

[15] C.-H. Chen , Y.-C.

Chang , T.-Y. Chen, D.-J.

Wang , 740 - 755 . People counting system for getting in/out of a bus [26]

Wojke ,

Bewley , Deep cosine metric learnbased on video processing , in: 2008 Eighth In - ing for person re-identification , in: IEEE Winternational Conference on Intelligent Systems De- ter Conference on Applications of Computer Visign and Applications , volume 3 , 2008 , pp. 565 - 569 . sion (WACV), IEEE, 2018 . URL: https://elib.dlr.de/ doi:10.1109/ISDA. 2008 . 335 . 116408/.

[16] J.-W. Perng , T.- Y.

Wang , Y.-W.

Hsu , B.-F.

Wu , The [27] MARS: A Video Benchmark for Large-Scale Person design and implementation of a vision-based people Re- identification, Springer, 2016 . counting system in buses , in: 2016 International [28]

Zheng ,

Shen ,

Tian ,

Wang ,

Tian , Conference on System Science and Engineering Scalable person re-identification: A benchmark , in: (ICSSE) , 2016 , pp. 1 - 3 . doi: 10 .1109/ICSSE. 2016 . Proceedings of the IEEE International Conference 7551620. on Computer Vision (ICCV), 2015 .

[17]

S. A.

Velastin ,

Fernández ,

J. E.

Espinosa , [29]

Pathak ,

A. E.

Eshratifar ,

Gormish , Video perA. Bay, Detecting, tracking and counting peo- son re-id: Fantastic techniques and where to find ple getting on/of a metropolitan train using them , 2019 . arXiv: 1912 . 05295. a standard video camera , Sensors 20 ( 2020 ). [30]

Yan , J. Qin1 ,

Chen ,

Liu ,

Zhu ,

Tai , URL: https://www.mdpi.com/1424-8220/20/21/6251. L. Shao , Learning multi-granular hypergraphs doi:10.3390/s20216251. for video-based person re-identification , 2021 .

[18]

S. D.

Pore ,

B. F.

Momin , Bidirectional people count- arXiv: 2104 .14913. ing system in video surveillance , in: 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT) , 2016 , pp. 724 - 727 . doi: 10 .1109/RTEICT. 2016 . 7807919 .

[19]

He , G. Gkioxari,

Dollár ,

Girshick , Mask rcnn , in: Proceedings of the IEEE international conference on computer vision , 2017 , pp. 2961 - 2969 .

[20]

Bonanno , G. Capizzi,

Coco ,

Napoli ,

Laudani ,

G. L.

Sciuto , Optimal thicknesses determination in a multilayer structure to improve the spp eficiency for photovoltaic devices by an hybrid fem