DPPL Hallway Tracker: Hospital Contact Tracing During the COVID-19 Pandemic Christian Marinoni1 , Valerio Ponzi2,3 and Danilo Comminiello1 1 Dpt. of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Via Eudossiana 18, Roma, 00184, Italy 2 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy 3 Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy Abstract During the COVID-19 pandemic, the use of a people tracking system could have been crucial, particularly in sensitive environments, such as hospitals. DPPL Hallway Tracker is a framework that uses security camera footage to determine which rooms in a corridor a person has entered. It generates a database containing all the people identified and allows quick identification of potential cases of infection based on the time spent in a room and its maximum capacity. DPPL Hallway Tracker is structured in two phases: detection and re-identification. In the first phase, it exploits Mask RCNN to identify people and room doors. In the second one, it uses the deep association metric model from DeepSORT to re-identify a person as he leaves a room. Keywords People Tracking, COVID-19 tracking systems 1. Introduction to be more effective in the long run. Among these, the security cameras already installed in many public-private Managing a pandemic has proved to be a difficult chal- contexts can represent an excellent solution in terms of lenge despite the technological developments of the past scalability and minimum requirements for the citizen. In- decades. Containment measures based on restrictions on deed, they allow for the estimation of people’s distances personal mobility (such as lockdowns) have proved to be as well as the detection of room entrances and exits. very effective for infection containment [1, 2, 3]. How- This project aims to create an offline framework for ever, these turn out to be short-term solutions that are tracing the entrances and exits of people in one or mul- not extendable throughout the whole virus’s life cycle. tiple rooms facing a hallway. In this way, it is possible As with Covid-19, the presence of a potentially infected to extract some valuable information for estimating the individual in a closed environment is a central problem risk of infection, such as the duration of the stay and and the risk of contagion increases with exposure time. the level of saturation of the room given its maximum Face masks, in combination with good room ventilation, capacity. The methodology described relies solely on help to reduce the risk of transmission. However, it is not Deep Learning solutions, and it employs two networks sufficient to eliminate all the risks. Tracking operations to detect doors and people and assign them appearance are required to ensure the identification of the chain of descriptors. A specific algorithm is in charge of tracking contacts and the estimation of the relative risk of con- people’s movements, exploiting the characterization of tagion. Tracking turns out to be even more essential in the hallway environment and the descriptors generated. public settings, such as public offices and hospitals [4, 5]. In particular, unlike other solutions that exploit mo- Some countries, such as Italy and Germany, used spe- tion features to determine a distribution of the positions cific tracking apps (respectively, Immuni and Corona- where a subject can stay in the next frame [7, 8, 9], this Warn-App) for a Bluetooth-based contact estimation project - named DPPL Hallway Tracker - uses only ap- [6]. These solutions, although potentially effective, have pearance features. A person is first identified in the scene shown evident limitations, such as low diffusion in the and segmented using Mask R-CNN; then, their mask is population, constraints on the version of the smartphone passed to a Re-ID network to obtain an identifier (an OS, poor estimation of distances and related false posi- array) that “describes” the way they appear in the scene. tives. While they may be effective in the short term since The descriptors are finally compared with those of the they are employable on a big scale, other solutions prove people already known to verify the person’s identity. An- other contribution, in addition to the general approach SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- adopted, is the use of three new datasets to fine-tune the neering and Mathematics, Rome, December 3-6, 2023 Envelope-Open christian.marinoni@uniroma1.it (C. Marinoni); networks, built from scratch or starting from existing ponzi@diag.uniroma1.it (V. Ponzi); ones. danilo.comminiello@uniroma1.it (D. Comminiello) DPPL Hallway Tracker appears to be very effective © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 51 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Christian Marinoni et al. CEUR Workshop Proceedings 51–61 in tracking people entering and leaving rooms facing a multiple people entering the same room collapse at the corridor. The use of appearance features turns out to be same value, thus providing no valuable information for sufficiently robust to allow correct identification, even if the ID attribution when a person leaves the room. On the it is less effective in recognizing people who reappear in contrary, the use of a re-identification network based on the corridor without leaving a room. appearance features in DeepSORT is functional for the This report describes the project’s workflow, from the current application and is therefore also implemented in description of the datasets to the results’ analysis. this project. In today’s literature, at best of our knowledge, there are no studies aimed at analyzing the specific context of 2. Related works tracking and re-identifying people who enter and leave rooms. Pedestrians on streets or people moving around The object tracking problem is one of the classic problems indoors are usually the focus of most approaches. Other in Computer Vision. Being able to determine the posi- works specialize in counting people in some particular tion of an object, even in the presence of partial or total environments. For example, Rabaud and Belongie [13] occlusions, can be beneficial in many contexts, such as au- investigate the possibility of counting people passing tomated surveillance, video indexing, human-computer through crowded environments; [14], [15], [16] focus on interaction, traffic monitoring, vehicle navigation and counting passengers getting in/out of a bus and [17] of a many others. A solution to the object tracking problem metropolitan train; [18] counts people walking through should manage multiple complexities: the loss of infor- a corridor or a door, without keeping into account their mation caused by the projection of the 3D world on a 2D identities. image, the complexity of the movement of objects, the The absence of a similar application makes the com- presence of occlusions and changes in the scene illumi- parison between the implementation proposed in this nation can make this task highly challenging. project with a baseline more complex. Therefore, in the The approaches can be divided into several categories following Sections, the individual modules that constitute based on their implementation and conceptual charac- it are compared with corresponding existing solutions, in teristics. In this Section, some solutions based on the the attempt to offer an objective yardstick on the choices “tracking-by-detection” strategy are mentioned. This made. strategy consists in doing a type-specific object detection or motion detection and then conducting (sequential or batch) tracking to link detection hypotheses into actual 3. People and Door detection trajectories. An example of an application is the one proposed by The fundamental principle behind this project is the Bewley et al. [10], known as SORT (Simple Online and search for practical but effective solutions for tracking Realtime Tracking). It uses CNN-based detection - more people entering and leaving rooms. As said in Section specifically, Faster R-CNN [11]- to identify people in 2, in the “tracking-by-detection” strategy the first main the scene. At that point, SORT associates a state 𝑥 = challenge is object detection, i.e., producing a bounding [𝑢, 𝑣, 𝑠, 𝑟, 𝑢,̇ 𝑣,̇ 𝑠]̇ 𝑇 with each target, where 𝑢 and 𝑣 represent box (and, eventually, a mask) for both people and doors the horizontal and vertical pixel location of the centre of in the image. The framework can thereby determine the the target, 𝑠 and 𝑟 are the scale (area) and the constant position of a person at each frame and their relative dis- aspect ratio of the target’s bounding box and, finally, 𝑢,̇ tance from the doors detected in the scene. This Section 𝑣,̇ 𝑠 ̇ are the corresponding first derivatives (velocities) of describes the datasets used, as well as the implementation 𝑢,𝑣 and 𝑠. The state gets updated at every new frame choices and the results obtained. based on the related new detection and a Kalman Filter framework [12]. 3.1. Object semantic segmentation A related work is DeepSORT [7]. It expands the SORT framework by providing a re-identification network that In order to obtain people tracking, it is crucial to identify takes as input the portion of the image showing the per- the position of people and doors to understand which son and returns an appearance descriptor (a vector of size room they enter and leave. There are generally two ways 128). This vector makes it easier to correctly assign iden- to accomplish this task: object detection and image seg- tities to people by reducing the number of inter-frame mentation. Object detection focuses on defining the posi- ID switches. tion of objects in an image, whereas image segmentation SORT and DeepSORT, as well as other methods that locates an object and defines a mask of pixels that repre- use motion features, are effective tools for people track- sent it. This project exploits the second one - and, more in ing; however, they are not the best option in case of particular, its subclass known as instance segmentation - people entering and leaving rooms. Indeed, the states of because of the benefits it provides in the re-identification 52 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 3.1.1. Door detection To provide door detection, Mask R-CNN[19] was fine- tuned with a dedicated dataset, assembled for the purpose. It includes a selection of 2773 out of 3000 RGB images of the DeepDoors2 dataset [24], which is freely available online. These images represent one or multiple doors in different outdoors and indoors scenarios, which do not necessarily correspond to a corridor: in fact, the large majority of them represent doors from the front. They also include obstacles that partially occlude part of the doors. The annotations in the DeepDoors2 data set are provided as additional images where each one has a black background and different coloured masks for the doors. Being interested in this project more in the Figure 1: General scheme of the R-CNN Mask framework. portion of space occupied by the door than in the profile The layers indicated with the letters C and P are convolutional of the door itself, all the images are re-masked to segment layers that represent the backbone network. The classic pyra- exclusively the door casing. Hence, almost all images mid architecture improves the detection of objects of various have quadrilateral-shaped masks (thus with four vertices sizes. only). Moreover, the generated annotation files are no more encoded as images like in the original DeepDoors2 dataset, but they are fully compatible with the COCO task. More specifically, it employs the Mask RNN frame- dataset specifications [25]. In fact, the annotation files work [19, 20], which derives from Faster RNN [11, 21] (in are JSON files containing: (1) references to all images, turn, one of the evolutions of the original R-CNN [22]) each having a unique ID, as shown in the first row of but adds a third parallel head used to generate the masks. Table 1; (2) a mask and bounding box (bbox) associated It also introduces further improvements, like the support to each image (second row of Table 1). to pixel-to-pixel alignment between network inputs and outputs (ROI-Align). Figure 1 shows the different stages {"images": [ that characterize the network. {"id": 514, "width": 1080, Initially, the image is passed as input to a convolution- "height":1920, "file_name":"frame.jpg"}, based Feature Pyramid Network [23], which has the task ... of extracting meaningful information from differently- ] sized feature maps. An object can appear in the fore- } ground (and therefore very large in the image) or further away from the camera; hence, this pyramidal structure {"annotations": [ facilitates its detection. The features thus extracted are {"id": 519, "iscrowd": 0, "image_id": 514, "category_id": 1, passed to the Region Proposal Network (RPN), which "segmentation": [[587.52,...,1097.77]], produces several Regions Of Interest (ROI), each with its "bbox": [467.20,581.407,295.90,809.02], bounding box. At this point, the first-mentioned ROI- "area": 121068.87}, Align is applied and its result is passed to the second ... stage of the network, from which a series of fully con- ] nected layers allow to refine the position of the bounding } box, the class of the object it contains and its mask. Moreover, assuming the camera to be static and, there- Table 1 fore, the position of the doors to be fixed over time, this An example of the formatting of JSON files containing image project exploits two distinct models: one for the door annotations according to COCO specifications is represented detection only and the other for people detection. Door in this table. The first row shows the data structure used detection is applied just in the starting phase of the frame- to list all the images in the dataset, the second row instead work while, from then on, people detection is performed. shows the one used to specify the annotations associated with The process of generating the two models and the related each image, thus including the mask (“segmentation”) and the bounding box (“bbox”). The “category_id” field is always set results are analyzed below. to 1, as there is only one category (door or person, depending on the dataset). The dataset is split into training, validation and test 53 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 Figure 2: Training and validation losses during training with the Dppl dataset. Figure 3: In this example, door detection is performed cor- sets. These subsets are disjoint; the training set contains rectly with two of the three instances. Masks are shown in 70% (1941) of the images, while the remaining 30% is light red, while the center of the door is shown as a red dot. equally divided between the validation and test sets (416 each). With the new dataset available, called Dppl, we fine- tuned the model pre-trained with the COCO dataset, mated mask is considered to be True if its IoU is greater which is available on the framework’s GitHub reposi- or equal than k, false otherwise. tory. Consequently, ResNet101 was used as the backbone, The primary challenge metric for the COCO dataset is and training was done in the same manner as the frame- AP@[.50:.05:.95] (usually referred to simply as AP), which work’s authors. In particular, we trained the head only is the average AP for IoU (Intersection over Union) from for the first ten epochs; for the following thirty epochs, 0.5 to 0.95 with a step size of 0.05. This metric is also used we fine-tuned stages four and above of the backbone too; to evaluate the results of our test set. In particular, with finally, in the last ten epochs, we extended the training the Dppl dataset and the training procedure described to the entire network. Unlike [19], the learning rate is above, we got an AP of 85.7 and AP@.75 of 95.8. We initially set to 0.001 (rather than 0.02) to keep the weights also report the Average Accuracy, which is calculated by from exploding; moreover, it is divided by a factor of 10 counting how many pixels out of those belonging to a during phases two and three of the training. The other specific area are correctly classified. In this case, rather parameters are left unchanged, such as the weight decay than the whole image, the considered area is the smallest of 0.0001 and momentum of 0.9. Finally, mini-masks were rectangular portion of the image that contains both the used (i.e. the masks were resized to the size of 56x56 px) ground-truth mask and the one produced by the model. to lessen the risk of memory problems. Data augmen- In numerical terms, we obtained an Average Accuracy of tation (horizontal flipping) was also applied. Figure 2 95.34% in the case of Door Detection. shows the train and validation losses got during training. Figure 3 displays the situation in a corridor not in- On the test set, the AP metric was used to assess the cluded in the dataset: the door on the right that is par- quality of the results produced by the training. AP, the ticularly “thinned” from the perspective is indeed not acronym for Average Precision, computes the average detected. Precisely for this reason, the framework pro- precision value for recall values over 0 to 1. In practice, vides a specific graphical interface that allows adding AP is computed as the mean of precision values at a set of new door positions, as shown in Section 4.3. 𝑅 equally spaced recall levels, as defined by the following formula 3.1.2. People detection 1 𝐴𝑃 = ∑ 𝑝 (𝑟) 𝑅 𝑟∈{0,...,1} 𝑖𝑛𝑡𝑒𝑟𝑝 Similarly to what was done with the doors, a model for people detection is also generated. Mask R-CNN with where, given 𝑝(⋅) the precision, 𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑟) = max𝑟∶̃ 𝑟≥𝑟 ̃ 𝑝(𝑟)̃ the weights of COCO is already alone able to detect and and 𝑅 = 101 in COCO. AP@k stands for the average pre- segment people with acceptable accuracy. However, fine- cision for IoU (Intersection over Union, i.e. how much tuning was done using a dedicated dataset built specifi- the predicted mask overlaps with the ground truth) of k. cally for the occasion from videos captured along a hall- More specifically, in the computation of AP@k, an esti- way. More in detail, the dataset contains 793 frames 54 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 captured in a corridor by a 1080x1920 px resolution cam- era that was positioned a few centimeters from the ceiling (approximately 2.9 meters from the floor) with a vertical image layout. In the scene, six people appear walking down the hallway and entering/exiting the adjoining rooms. They wear various types of clothing (including a white coat to simulate the presence of a doctor); they are of different ages and all wear face masks. One of the people has a foot cast and crutches. All frames are hand- annotated to generate high-quality masks, accurately respecting the person’s shape. The related annotation files follow the COCO specifications, as described before. The split of files between training (555 images), valida- tion (119) and test (119) sets follows the same proportion as the Dppl dataset. Figure 4: Training and validation losses during training with With this second dataset available, called dPPL, we the dPPL dataset. once again fine-tuned the model pre-trained with the COCO dataset. All the Mask R-CNN’s parameters are kept the same, but Gamma Contrast is used as a data example is DeepSORT [7], which uses the Kalman fil- augmentation technique in conjunction with horizontal ter to predict the position of a person in the next frame flipping in this case. and integrates appearance information based on a deep Figure 4 shows the graph of the training and valida- appearance descriptor. Despite DeepSORT being a pow- tion losses. As for the performance on the test set, Table erful tool, the use of the Kalman Filter turns out to be less 2 shows the comparative Average Precision values be- effective when the subject disappears for long periods tween the use of a model trained only with COCO and from the camera view. Indeed, the Kalman Filter mod- that obtained by doing fine-tuning with the dPPL dataset. ulates the state estimate of the system (in this case, the This second option provides better results for both AP position in the frame of a subject) as a Gaussian distribu- and AP@.75. The same applies for the Average Accu- tion whose variance strictly depends on the observations racy. These good results should be evaluated considering over time. When a person disappears from the scene, the degree of uncertainty increases and the same happens to Method AP AP@.75 Acc. the distribution variance. Furthermore, the Kalman Filter COCO only 70.5 92.9 99.08% would be practically useless if several people enter the COCO+fine-tuning on dPPL 76.3 95.5 99.74% same room: the states of those subjects would collapse Table 2 into the same value, making this information useless to Comparison between the use of Mask R-CNN trained on distinguish a person from the others when they leave the COCO only and the same network trained with COCO and room. Nevertheless, the solution undertaken in Deep- fine-tuned with dPPL dataset. AP stands for Average Preci- SORT on the use of appearance features turns out to be sion; Acc. stands for Average Accuracy (calculated by counting quite effective whenever the Kalman Filter is not since how many pixels out of those belonging to smallest rectangu- it relies on visual cues. For this reason, DPPL Tracker is lar portion of the image that contains both the ground-truth primarily based on appearance features, though it also mask and the one produced by the model are correctly classi- fied). takes advantage of some assumptions related to the work environment (a corridor). In this project, Deep Cosine Metric Learning [26], the the not very high number of images that compose the same used in DeepSORT for appearance re-identification, dataset. Indeed, environments with completely different is used. It applies a variation of Softmax classifier called illumination and compositions will certainly attenuate Cosine Softmax Classifier, which allows obtaining a dif- the good performances provided by this model. ferent representation space in which compact clusters are formed based on the appearance features. This is 3.2. People Re-identification achieved by first applying the 𝑙2 normalization, which uses the 𝑙2-norm to normalize the input values so that, if The detection of doors and people in the scene does not squared and summed, they would result in the value 1, suffice to ensure accurate tracking. As mentioned above, and, secondly, by normalizing the weights. Finally, the one can use additional information extracted from the cosine softmax classifier is applied, which is defined as images within more or less complex systems, which may exploit appearance, movement and shape features. An 55 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 follows: exp(𝜅 ⋅ 𝑤̃ 𝑘𝑇 𝑟𝑖 ) 𝑝(𝑦𝑖 = 𝑘|𝑟𝑖 ) = 𝐶 ∑𝑛=1 exp(𝜅 ⋅ 𝑤̃ 𝑛𝑇 ) where 𝜅 is a free scaling parameter. Table 3 summarizes the entire network, which is made up of convolutional and residual layers. Dropout of 0.4 is used within the Residual layers. Layer Patch Size/Stride Output Conv1 3 ×3/1 32 × 128 × 64 Conv2 3 ×3/1 32 × 128 × 64 Maxpool 3 ×3/2 32 × 64 × 32 (a) Residual 4 3 ×3/1 32 × 64 × 32 Residual 5 3 ×3/1 32 × 64 × 32 Residual 6 3 ×3/2 64 × 32 × 16 Residual 7 3 ×3/1 64 × 32 × 16 Residual 8 3 ×3/2 128 × 16 × 8 Residual 9 3 ×3/1 128 × 16 × 8 Dense 10 - 128 𝑙2 normalization - 128 Table 3 Overview of the CNN architecture of the Re-ID network The dataset used for training the re-ID network is MARS [27], a large scale video-based person re- (b) identification dataset that extends the Market-1501 Figure 5: Examples of the resulting images in the MARS dataset [28]. It consists of 1261 different pedestrians, who dataset after applying object instance segmentation. are captured by at least two of the six near-synchronized cameras placed on the Tsinghua University campus. It also includes over 1 million bounding boxes and 3248 distractors to make it more realistic. The goal of the Re- sophisticated methods or networks with many more pa- Identification network is to provide useful information rameters. on the person’s identity starting from how they appear Method Rank1 Rank5 mAP in the image. In the case of MARS, it will have to try 𝑎 to learn this information from images that also include DCML on MARS 72.93 86.46 56.88 backgrounds of different colours and patterns. To con- DCML on masked MARS𝑏 75.73 90.08 60.72 centrate solely on the subject, we preprocessed the MARS B-BOT + Attention & CL loss𝑐 88.6 96.2 82.9 dataset by using the Mask R-CNN network to detect peo- MGH𝑑 90.0 96.7 85.8 ple. Therefore, the result is a new dataset where each Table 4 image of size 256x128 px represents a segmented person Comparison between the Deep Cosine Metric Learning (ab- and a black background (as shown in Figure 5). breviated to DCML) on the original MARS dataset and the The network has been trained for 100.000 steps, with masked version and some state-of-the-art solutions. 𝑎 Results a constant learning rate of 0.001 and weight decay of from [26] - 𝑏 Proposed in this project - 𝑐 Results from [29] - 𝑑 Re- 1 × 10−8 ; moreover, the input images are scaled to 128x64 sults from [30]. mAP stands for mean Average Precision px. The use of the masked MARS dataset proves to be beneficial for the network training since it provides im- proved results according to the CMC Rank@K and mAP 4. DPPL Tracker framework metrics1 , as shown in Table 4. The table also shows the results of two state-of-the-art solutions on the original People tracking is offered through a specific framework MARS dataset. Both largely outperform the solution pro- that employs Mask R-CNN and the above-mentioned posed in this project, however, they also use much more re-identification network. It also provides additional fea- 1 tures to improve the user experience and optimize the Computed through the MARS evaluation tool, available at search for people. More precisely, the workflow is the https://github.com/liangzheng06/MARS-evaluation 56 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 following: the first frame is first passed as input to Mask Algorithm 1: Main algorithm R-CNN for doors detection. Once doors are located, that Data: 𝑚𝑎𝑠𝑘𝑅𝐶𝑁 𝑁_𝑟𝑒𝑠𝑢𝑙𝑡, 𝑓 𝑟𝑎𝑚𝑒 frame and the following ones are passed to the same Result: People identified network (with different weights) for people detection. 1 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 ← []; The portion of the image containing each person is then 2 for 𝑝𝑒𝑟𝑠𝑜𝑛 in 𝑚𝑎𝑠𝑘𝑅𝐶𝑁 𝑁_𝑟𝑒𝑠𝑢𝑙𝑡 do multiplied by the corresponding mask (to have a black 3 𝑚𝑎𝑠𝑘, 𝑏𝑏𝑜𝑥 ← 𝑝𝑒𝑟𝑠𝑜𝑛; background) and, after being resized to 128 x 64 px, is 4 𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛 ← 𝑓 𝑟𝑎𝑚𝑒[𝑏𝑏𝑜𝑥[0] ∶ passed to the re-identification network. The latter has its 𝑏𝑏𝑜𝑥[2], 𝑏𝑏𝑜𝑥[1] ∶ 𝑏𝑏𝑜𝑥[3]]; head cut off so that it outputs an array of size 128 (gener- 5 𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛_𝑚𝑎𝑠𝑘𝑒𝑑 ← 𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛 ∗ 𝑚𝑎𝑠𝑘; ated by the last Dense layer). This array is a descriptor of 6 𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟 ← the person’s appearance and is used by the framework’s get_person_identifier(𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛_𝑚𝑎𝑠𝑘𝑒𝑑); main algorithm to associate a unique identity ID with 7 𝑝𝑒𝑟𝑠𝑜𝑛𝐼 𝐷, 𝑟𝑜𝑜𝑚𝐼 𝐷 ← find_nearest(𝑝𝑒𝑟𝑠𝑜𝑛, each person. 𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟); 8 if pID == -1 then 4.1. Main algorithm 9 // New person appeared 10 else After selecting the video, the first frame is analyzed 11 // Person in the corridor or exited from a through mask-RCNN to locate the doors in the scene. room If one or more doors are not detected, the user can man- 12 end ually add additional ones, as shown in Section 4.3. Only 13 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 ← 𝑝𝑒𝑟𝑠𝑜𝑛 at that point, the analysis of the following frames begins. 14 end Pseudocode 1 shows the main steps. As previously de- 15 for 𝑝𝑒𝑟𝑠𝑜𝑛 in 𝑔𝑒𝑡_𝑝𝑒𝑜𝑝𝑙𝑒_𝑖𝑛_𝑠𝑐𝑒𝑛𝑒() do scribed, Mask R-CNN is again used to identify people, 16 if 𝑝𝑒𝑟𝑠𝑜𝑛 not in 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 then while the re-ID network provides the people appearance 17 if 𝑝𝑒𝑟𝑠𝑜𝑛 close to a room then descriptors. At that point, for each person, the find_near- 18 // Person entered in a room est function allows identifying the already-known closest 19 else identifier to the detected descriptor, if any. In this way, 20 // Person disappeared from the scene it is possible to determine whether that person already (may due to an occlusion) appeared in the past and, depending on their position 21 end and on the knowledge derived from past frames, a log is 22 end added to the database if they are leaving a room. If there 23 end is no similar person, the algorithm adds a new one to the scene. The final for loop finds all people who were in the environment up to the previous frame but are now missing. In this case, there are two alternatives: the per- people who last left the corridor, then moving on to all son may either have entered a room (if in the preceding the known people. The similarity between two identifiers frame they were sufficiently close to the relative door) ID𝑎 and ID𝑏 is computed with the cosine similarity, as or may have disappeared, for example, because they left follows the hallway or are temporarily occluded. To improve ID𝑎 ⋅ ID𝑏 𝑐𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = the efficacy of the algorithm, the framework starts track- ‖ID𝑎‖‖ID𝑏‖ ing a person when he appears entirely in the scene and Two identifiers are more similar as the cosine similar- his bounding box is at a minimum distance from the ity goes to one. Hence the need to define, for each of image edges. Furthermore, it uses the area of the bbox the listed searches, a threshold that defines when two to interrupt (temporarily or not) the tracking when an descriptors must be considered sufficiently similar (and object/person occludes the subject or when the tracked therefore belonging to the same person) or not. The person has nearly entirely entered a room. choice of the threshold heavily influences the tracking A fundamental step is the one implemented by the effectiveness. In the various phases a different threshold find_nearest function, shown in Pseudocode 2. It uses is used, more specifically: (1) if a person is walking along differentiated searches to find the already-known person the corridor without other people in the close vicinity with the most similar identity to the one passed as input. and, if compared to the previous frame, that person has First, it searches among the people visible in the scene in not moved too far from their previous position in the the previous frame. In case of failure, if the detection is scene, then a greater dissimilarity between the descrip- close enough to a door - according to a given threshold tors is tolerated; (2) in other cases, the threshold is set to - it searches among the people who are known to be in a value between 0.85 and 0.9. Section 5 discusses some that room. As a last chance, it starts searching among the critical issues regarding the choice of the threshold. 57 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 Algorithm 2: Find nearest identity a room (Figure 7). In the latter case, the interface high- Data: 𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟 lights the riskiest situations (for example, if the room Result: Person id capacity has been exceeded) in addition to providing all 1 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 ← []; records linked to the entered ID. 2 if 𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟 in the scene then 3 // Person in the scene, return the ID 4 end 5 for 𝑑𝑜𝑜𝑟 in room do 6 if 𝑝𝑒𝑟𝑠𝑜𝑛 close to 𝑑𝑜𝑜𝑟 then 7 // Look among people inside that room 8 end 9 end 10 // Look among last detected people; 11 // Look among all people; Figure 6: The user can add multiple additional doors through 4.2. Database the user interface. The position of the center of the new door is shown by a red dot, while its height by a dashed line with Whenever a person enters and leaves a room, a corre- two blue dots at the ends. sponding log is added to the database. Each log has the following structure: frameID personID roomID "in/out/new" where frameID is an incremental value representing the currently processed frame, personID is a unique integer associated to a person (different from the identifier repre- senting the way that person looks in the scene), roomID is the ID of the room the person is entering/leaving - if any - and it is equal to −1 otherwise. The last label has the value “in ” or “out ” when “roomID ” is different from −1, while it assumes the value “new ” when a new person appears in the scene. For simplicity, the database is implemented via a sim- ple CSV file containing all the logs, but more complex Figure 7: The user can visually see a list of rooms a particular and scalable solutions (such as NoSQL) are also possible. ID has entered through the user interface. Knowing the video framerate, the framework derives an estimate of the time spent in the room, to highlight possi- ble dangerous situations. The same is done by counting the number of people in the same room and alerting when the maximum capacity is exceeded. 5. Analysis and results The behaviour of the framework is evaluated in two differ- 4.3. GUI ent setups of incremental difficulty. In the first setup, peo- A simple user interface, implemented with the PySim- ple walk down a corridor one after the other, in a perfect pleGUI library, is also available to provide the user with flow that limits the occasions when two or more people more flexible interaction with the framework. The user are simultaneously in the same room. This modality al- can select a file or directory containing the needed frame lows focusing mainly on an inter-frame re-identification images, as well as add new doors that Mask R-CNN did and on the correct detection of people entering and leav- not detect. In this second case (shown in Figure 6), by ing the rooms. In the second setup, multiple people can using a simple library such as Matplotlib, it is possible to enter the same room. The challenge, in this case, is to be offer a response in real-time on the location of the new able to identify the identity of a person when he leaves doors and their heights (used by the algorithm). Finally, the room. The results show that the algorithm can handle at the end of the processing of all frames, the user can a wide range of situations with ease, producing results search all the times a particular ID has entered and left that are similar - if not identical - to the ground truth. 58 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 First of all, it is beneficial to analyze how accurately on to analyze the accuracy of people tracking. In partic- the framework can detect the presence of one or more ular, the inter-frame re-identification of a person in the people in the scene. To calculate the overall accuracy of scene scores 100% accuracy, even in the case of several the detections we used two methods. The first consists people in the corridor; the same happens when the per- of considering only those frames in which a person is son leaves a room, even when more than one is inside shown entirely (i.e., he is not hidden - even partially - by it. The criticalities are mainly two: (1) the difficulty in objects or other people). The second way is to consider defining an efficient threshold for cosine similarity, since all frames, including all borderline cases in which only the method adopted is susceptible to sudden changes in a portion of a person’s arm or leg appears in the frame. the person’s position (such as front and rear vision of Figure 8 shows an example of the frames considered with the person); (2) the influence of the quality of masks pro- both methods. The results - obviously better in numerical duced by Mask R-CNN on the re-identification network. terms in the first case - are shown in Table 5. A sudden change in the portion of the image taken into consideration (even without sudden movements of the Overall (Detection) Accuracy subject) can reduce the cosine similarity. Method 1 100% Cosine similarity can be a powerful tool for guiding the Method 2 91.76% re-identification task: limiting the search to the people Table 5 inside the room and using the cosine similarity always The accuracy of people detection computed with two methods leads to correct identifications. Nevertheless, the weak- is shown here. With the first one, we only considered those nesses listed above heavily reduce its effectiveness when frames in which people bodies are shown wholly in the image. it is necessary to recognize a person who had previously The second method also includes those frames in which a left the corridor (without entering any room) and who person is only partially visible. reappears later on. Indeed, the choice of a high threshold (i.e., ≥ 0.9) makes it difficult to assign the same ID in the situation under analysis, because usually, the person will reappear in a completely different pose (for example, from behind and not the front) which will reduce the value of the cosine similarity. In this case, there will be no ID switches between different people, but each time one reappears in the scene it will be assigned a new ID. On the contrary, lowering the threshold facilitates the ID switches, creating some cascading problems in the framework (an ID already assigned - even if incorrectly - to a person will not be re-assigned as long as the per- son is in the scene, not even if the one it was originally assigned to reappears). However, these problems do not affect the recognition of people leaving the rooms: the identifier produced by the Re-ID network and the simi- larity computed with the cosine similarity is sufficient for the correct attribution of the ID. Compared to the baseline (Re-ID network trained on the original MARS dataset), it can be observed that the cosine similarity of the same person in two different situations (frames) is greater (by 1-2%) when assessed with our method. Figure 8: The frame on the left is an example of those con- As a final benchmark, the accuracy of the logs (seen sidered with Method 1 for calculating the Overall Detection as the ratio of the logs equal to ones of the ground truth Accuracy. The person’s body is entirely included in the scene. over the total number of them) produced in the tests is The frame on the right is instead an example of those consid- equal to 50%. The accuracy goes up to 84% if we also ered with Method 2, that takes also into account all borderline include those logs with labels “in” and “out” that differ cases in which only a portion of a person’s arm or leg appears only in the person ID from the ground truth (but only in the frame. In this case, the two people in the scene are only if that ID is a new one, and therefore if there is no ID partially visible and the arm of the uppermost person is not switch with a previously known identity). When a person detected by the model. enters a room, the relative log at the exit is always correct, as already mentioned above. As for performance, an Having ascertained that the framework can detect the Nvidia Tesla K80 is capable of processing 1.4-1.5 frames presence of people with good reliability, we then move per second. 59 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 We also ran a test in a setup with slightly different the re-identification network could alleviate the difficulty specifications. In fact, the recording device was placed at of assigning the same ID to a person who reappears in the eye level, tilted almost parallel to the floor and with an corridor without leaving a room. The study of solutions image ratio of 16:9. The results obtained are comparable for tracing people entering and leaving the rooms is of to those indicated above, although tracking people in great importance for the application developments that it areas very distant from the camera (and therefore at lower can have. It not only allows contact tracing in the event resolution) turns out to be more critical. Under these of pandemics but it can be also used for other contexts, as conditions, it is quite easy for two different subjects to for the analysis of the movements of patients and medical appear very similar even to the human eye. An example operators and the optimization of hospital wards. is shown in Figure 9. Ultimately, the framework is most effective when the distance to the doors is not excessively large. References [1] V. Alfano, S. Ercolano, The efficacy of lockdown against covid-19: a cross-country panel analysis, Applied health economics and health policy 18 (2020) 509–517. [2] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Ioc- chi, C. Napoli, Human attention assessment us- ing a machine learning approach with gan-based data augmentation technique trained using a cus- tom dataset, OBM Neurobiology 6 (2022). doi:10. 21926/obm.neurobiol.2204139 . [3] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli, Analysis pre and post covid-19 pandemic rorschach test data of using em algorithms and gmm models, volume 3360, 2022, pp. 55 – 63. [4] V. Marcotrigiano, G. D. Stingi, S. Fregnan, P. Mag- arelli, P. Pasquale, S. Russo, G. B. Orsi, M. T. Mon- tagna, C. Napoli, C. Napoli, An integrated control plan in primary schools: Results of a field investi- gation on nutritional and hygienic features in the apulia region (southern italy), Nutrients 13 (2021). Figure 9: Those shown in the figure are two different peo- doi:10.3390/nu13093006 . ple, who however visually appear practically identical. Their [5] G. De Magistris, M. Romano, J. Starczewski, appearance descriptor is therefore very similar and this leads C. Napoli, A novel dwt-based encoder for human the framework to a wrong ID attribution when one of the two pose estimation, volume 3360, 2022, pp. 33 – 40. leaves the room. [6] M. Bano, C. Arora, D. Zowghi, A. Ferrari, The rise and fall of covid-19 contact-tracing apps: when nfrs collide with pandemic, in: 2021 IEEE 29th In- ternational Requirements Engineering Conference 6. Conclusion (RE), 2021, pp. 106–116. doi:10.1109/RE51729. 2021.00017 . DPPL Hallway Tracker turns out to be a good starting [7] N. Wojke, A. Bewley, D. Paulus, Simple online and point for developing a framework capable of tracking realtime tracking with a deep association metric, people entering and leaving multiple rooms. The use of in: 2017 IEEE international conference on image a re-ID network that exploits the masks produced in the processing (ICIP), IEEE, 2017, pp. 3645–3649. detection and segmentation phase leads, even in the tests [8] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo, performed, to improvements in identification. J. Starczewski, C. Napoli, A novel convmixer trans- A project extension might be able to address some of former based architecture for violent behavior de- the remaining issues: (1) the enrichment of the datasets of tection 14126 LNAI (2023) 3 – 16. doi:10.1007/ people and doors could lead to better detection in several 978- 3- 031- 42508- 0_1 . more challenging contexts: for example, as discussed [9] B. Yang, R. Nevatia, Multi-target tracking by online above, the detection and segmentation of doors “thinned” learning of non-linear motion patterns and robust from perspective remains difficult; (2) using a dynamic appearance models, in: 2012 IEEE Conference on threshold and investigating complementary solutions to 60 Christian Marinoni et al. CEUR Workshop Proceedings 51–61 Computer Vision and Pattern Recognition, 2012, pp. - cascade neural network based approach, 2014, pp. 1918–1925. doi:10.1109/CVPR.2012.6247892 . 355 – 362. doi:10.1109/SPEEDAM.2014.6872103 . [10] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, [21] F. Bonanno, G. Capizzi, G. L. Sciuto, C. Napoli, Simple online and realtime tracking, 2016 IEEE Wavelet recurrent neural network with semi- International Conference on Image Processing parametric input data preprocessing for micro-wind (ICIP) (2016). URL: http://dx.doi.org/10.1109/ICIP. power forecasting in integrated generation sys- 2016.7533003. doi:10.1109/icip.2016.7533003 . tems, 2015, pp. 602 – 609. doi:10.1109/ICCEP. [11] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: 2015.7177554 . Towards real-time object detection with region pro- [22] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich posal networks, Advances in neural information feature hierarchies for accurate object detection processing systems 28 (2015) 91–99. and semantic segmentation, in: Proceedings of the [12] R. E. Kalman, A new approach to linear filtering IEEE conference on computer vision and pattern and prediction problems (1960). recognition, 2014, pp. 580–587. [13] V. Rabaud, S. Belongie, Counting crowded mov- [23] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, ing objects, in: 2006 IEEE Computer Society Con- S. Belongie, Feature pyramid networks for object ference on Computer Vision and Pattern Recog- detection, 2017. arXiv:1612.03144 . nition (CVPR’06), volume 1, 2006, pp. 705–711. [24] J. Ramôa, V. Lopes, L. Alexandre, S. Mogo, Real-time doi:10.1109/CVPR.2006.92 . 2d–3d door detection and state classification on a [14] C. Labit-Bonis, J. Thomas, F. Lerasle, F. Madrigal, low-power device, SN Applied Sciences 3 (2021). Fast tracking-by-detection of bus passengers with doi:10.1007/s42452- 021- 04588- 3 . siamese cnns, in: 2019 16th IEEE International [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, Conference on Advanced Video and Signal Based D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft Surveillance (AVSS), 2019, pp. 1–8. doi:10.1109/ coco: Common objects in context, in: European AVSS.2019.8909843 . conference on computer vision, Springer, 2014, pp. [15] C.-H. Chen, Y.-C. Chang, T.-Y. Chen, D.-J. Wang, 740–755. People counting system for getting in/out of a bus [26] N. Wojke, A. Bewley, Deep cosine metric learn- based on video processing, in: 2008 Eighth In- ing for person re-identification, in: IEEE Win- ternational Conference on Intelligent Systems De- ter Conference on Applications of Computer Vi- sign and Applications, volume 3, 2008, pp. 565–569. sion (WACV), IEEE, 2018. URL: https://elib.dlr.de/ doi:10.1109/ISDA.2008.335 . 116408/. [16] J.-W. Perng, T.-Y. Wang, Y.-W. Hsu, B.-F. Wu, The [27] MARS: A Video Benchmark for Large-Scale Person design and implementation of a vision-based people Re-identification, Springer, 2016. counting system in buses, in: 2016 International [28] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Conference on System Science and Engineering Scalable person re-identification: A benchmark, in: (ICSSE), 2016, pp. 1–3. doi:10.1109/ICSSE.2016. Proceedings of the IEEE International Conference 7551620 . on Computer Vision (ICCV), 2015. [17] S. A. Velastin, R. Fernández, J. E. Espinosa, [29] P. Pathak, A. E. Eshratifar, M. Gormish, Video per- A. Bay, Detecting, tracking and counting peo- son re-id: Fantastic techniques and where to find ple getting on/off a metropolitan train using them, 2019. arXiv:1912.05295 . a standard video camera, Sensors 20 (2020). [30] Y. Yan, J. Qin1, J. Chen, L. Liu, F. Zhu, Y. Tai, URL: https://www.mdpi.com/1424-8220/20/21/6251. L. Shao, Learning multi-granular hypergraphs doi:10.3390/s20216251 . for video-based person re-identification, 2021. [18] S. D. Pore, B. F. Momin, Bidirectional people count- arXiv:2104.14913 . ing system in video surveillance, in: 2016 IEEE International Conference on Recent Trends in Elec- tronics, Information Communication Technology (RTEICT), 2016, pp. 724–727. doi:10.1109/RTEICT. 2016.7807919 . [19] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r- cnn, in: Proceedings of the IEEE international con- ference on computer vision, 2017, pp. 2961–2969. [20] F. Bonanno, G. Capizzi, S. Coco, C. Napoli, A. Lau- dani, G. L. Sciuto, Optimal thicknesses determina- tion in a multilayer structure to improve the spp efficiency for photovoltaic devices by an hybrid fem 61