DPPL Hallway Tracker: Hospital Contact Tracing During the
                                COVID-19 Pandemic
                                Christian Marinoni1 , Valerio Ponzi2,3 and Danilo Comminiello1
                                1
                                  Dpt. of Information Engineering, Electronics and Telecommunications, Sapienza University of Rome, Via Eudossiana 18, Roma, 00184, Italy
                                2
                                  Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy
                                3
                                  Institute for Systems Analysis and Computer Science, Italian National Research Council, Via dei Taurini 19, Roma, 00185, Italy


                                               Abstract
                                               During the COVID-19 pandemic, the use of a people tracking system could have been crucial, particularly in sensitive
                                               environments, such as hospitals. DPPL Hallway Tracker is a framework that uses security camera footage to determine
                                               which rooms in a corridor a person has entered. It generates a database containing all the people identified and allows quick
                                               identification of potential cases of infection based on the time spent in a room and its maximum capacity. DPPL Hallway
                                               Tracker is structured in two phases: detection and re-identification. In the first phase, it exploits Mask RCNN to identify
                                               people and room doors. In the second one, it uses the deep association metric model from DeepSORT to re-identify a person
                                               as he leaves a room.

                                               Keywords
                                               People Tracking, COVID-19 tracking systems


                                1. Introduction                                                                                         to be more effective in the long run. Among these, the
                                                                                                                                        security cameras already installed in many public-private
                                Managing a pandemic has proved to be a difficult chal-                                                  contexts can represent an excellent solution in terms of
                                lenge despite the technological developments of the past                                                scalability and minimum requirements for the citizen. In-
                                decades. Containment measures based on restrictions on                                                  deed, they allow for the estimation of people’s distances
                                personal mobility (such as lockdowns) have proved to be                                                 as well as the detection of room entrances and exits.
                                very effective for infection containment [1, 2, 3]. How-                                                   This project aims to create an offline framework for
                                ever, these turn out to be short-term solutions that are                                                tracing the entrances and exits of people in one or mul-
                                not extendable throughout the whole virus’s life cycle.                                                 tiple rooms facing a hallway. In this way, it is possible
                                   As with Covid-19, the presence of a potentially infected                                             to extract some valuable information for estimating the
                                individual in a closed environment is a central problem                                                 risk of infection, such as the duration of the stay and
                                and the risk of contagion increases with exposure time.                                                 the level of saturation of the room given its maximum
                                Face masks, in combination with good room ventilation,                                                  capacity. The methodology described relies solely on
                                help to reduce the risk of transmission. However, it is not                                             Deep Learning solutions, and it employs two networks
                                sufficient to eliminate all the risks. Tracking operations                                              to detect doors and people and assign them appearance
                                are required to ensure the identification of the chain of                                               descriptors. A specific algorithm is in charge of tracking
                                contacts and the estimation of the relative risk of con-                                                people’s movements, exploiting the characterization of
                                tagion. Tracking turns out to be even more essential in                                                 the hallway environment and the descriptors generated.
                                public settings, such as public offices and hospitals [4, 5].                                              In particular, unlike other solutions that exploit mo-
                                   Some countries, such as Italy and Germany, used spe-                                                 tion features to determine a distribution of the positions
                                cific tracking apps (respectively, Immuni and Corona-                                                   where a subject can stay in the next frame [7, 8, 9], this
                                Warn-App) for a Bluetooth-based contact estimation                                                      project - named DPPL Hallway Tracker - uses only ap-
                                [6]. These solutions, although potentially effective, have                                              pearance features. A person is first identified in the scene
                                shown evident limitations, such as low diffusion in the                                                 and segmented using Mask R-CNN; then, their mask is
                                population, constraints on the version of the smartphone                                                passed to a Re-ID network to obtain an identifier (an
                                OS, poor estimation of distances and related false posi-                                                array) that “describes” the way they appear in the scene.
                                tives. While they may be effective in the short term since                                              The descriptors are finally compared with those of the
                                they are employable on a big scale, other solutions prove                                               people already known to verify the person’s identity. An-
                                                                                                                                        other contribution, in addition to the general approach
                                SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi-                                        adopted, is the use of three new datasets to fine-tune the
                                neering and Mathematics, Rome, December 3-6, 2023
                                Envelope-Open christian.marinoni@uniroma1.it (C. Marinoni);
                                                                                                                                        networks, built from scratch or starting from existing
                                ponzi@diag.uniroma1.it (V. Ponzi);                                                                      ones.
                                danilo.comminiello@uniroma1.it (D. Comminiello)                                                            DPPL Hallway Tracker appears to be very effective
                                         © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                         Attribution 4.0 International (CC BY 4.0).


                                                                                                                                   51


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Christian Marinoni et al. CEUR Workshop Proceedings                                                                           51–61


in tracking people entering and leaving rooms facing a                  multiple people entering the same room collapse at the
corridor. The use of appearance features turns out to be                same value, thus providing no valuable information for
sufficiently robust to allow correct identification, even if            the ID attribution when a person leaves the room. On the
it is less effective in recognizing people who reappear in              contrary, the use of a re-identification network based on
the corridor without leaving a room.                                    appearance features in DeepSORT is functional for the
    This report describes the project’s workflow, from the              current application and is therefore also implemented in
description of the datasets to the results’ analysis.                   this project.
                                                                           In today’s literature, at best of our knowledge, there
                                                                        are no studies aimed at analyzing the specific context of
2. Related works                                                        tracking and re-identifying people who enter and leave
                                                                        rooms. Pedestrians on streets or people moving around
The object tracking problem is one of the classic problems
                                                                        indoors are usually the focus of most approaches. Other
in Computer Vision. Being able to determine the posi-
                                                                        works specialize in counting people in some particular
tion of an object, even in the presence of partial or total
                                                                        environments. For example, Rabaud and Belongie [13]
occlusions, can be beneficial in many contexts, such as au-
                                                                        investigate the possibility of counting people passing
tomated surveillance, video indexing, human-computer
                                                                        through crowded environments; [14], [15], [16] focus on
interaction, traffic monitoring, vehicle navigation and
                                                                        counting passengers getting in/out of a bus and [17] of a
many others. A solution to the object tracking problem
                                                                        metropolitan train; [18] counts people walking through
should manage multiple complexities: the loss of infor-
                                                                        a corridor or a door, without keeping into account their
mation caused by the projection of the 3D world on a 2D
                                                                        identities.
image, the complexity of the movement of objects, the
                                                                           The absence of a similar application makes the com-
presence of occlusions and changes in the scene illumi-
                                                                        parison between the implementation proposed in this
nation can make this task highly challenging.
                                                                        project with a baseline more complex. Therefore, in the
    The approaches can be divided into several categories
                                                                        following Sections, the individual modules that constitute
based on their implementation and conceptual charac-
                                                                        it are compared with corresponding existing solutions, in
teristics. In this Section, some solutions based on the
                                                                        the attempt to offer an objective yardstick on the choices
“tracking-by-detection” strategy are mentioned. This
                                                                        made.
strategy consists in doing a type-specific object detection
or motion detection and then conducting (sequential or
batch) tracking to link detection hypotheses into actual                3. People and Door detection
trajectories.
    An example of an application is the one proposed by                 The fundamental principle behind this project is the
Bewley et al. [10], known as SORT (Simple Online and                    search for practical but effective solutions for tracking
Realtime Tracking). It uses CNN-based detection - more                  people entering and leaving rooms. As said in Section
specifically, Faster R-CNN [11]- to identify people in                  2, in the “tracking-by-detection” strategy the first main
the scene. At that point, SORT associates a state 𝑥 =                   challenge is object detection, i.e., producing a bounding
[𝑢, 𝑣, 𝑠, 𝑟, 𝑢,̇ 𝑣,̇ 𝑠]̇ 𝑇 with each target, where 𝑢 and 𝑣 represent    box (and, eventually, a mask) for both people and doors
the horizontal and vertical pixel location of the centre of             in the image. The framework can thereby determine the
the target, 𝑠 and 𝑟 are the scale (area) and the constant               position of a person at each frame and their relative dis-
aspect ratio of the target’s bounding box and, finally, 𝑢,̇             tance from the doors detected in the scene. This Section
𝑣,̇ 𝑠 ̇ are the corresponding first derivatives (velocities) of         describes the datasets used, as well as the implementation
𝑢,𝑣 and 𝑠. The state gets updated at every new frame                    choices and the results obtained.
based on the related new detection and a Kalman Filter
framework [12].                                                         3.1. Object semantic segmentation
    A related work is DeepSORT [7]. It expands the SORT
framework by providing a re-identification network that                 In order to obtain people tracking, it is crucial to identify
takes as input the portion of the image showing the per-                the position of people and doors to understand which
son and returns an appearance descriptor (a vector of size              room they enter and leave. There are generally two ways
128). This vector makes it easier to correctly assign iden-             to accomplish this task: object detection and image seg-
tities to people by reducing the number of inter-frame                  mentation. Object detection focuses on defining the posi-
ID switches.                                                            tion of objects in an image, whereas image segmentation
    SORT and DeepSORT, as well as other methods that                    locates an object and defines a mask of pixels that repre-
use motion features, are effective tools for people track-              sent it. This project exploits the second one - and, more in
ing; however, they are not the best option in case of                   particular, its subclass known as instance segmentation -
people entering and leaving rooms. Indeed, the states of                because of the benefits it provides in the re-identification


                                                                   52
Christian Marinoni et al. CEUR Workshop Proceedings                                                                        51–61


                                                                   3.1.1. Door detection
                                                                   To provide door detection, Mask R-CNN[19] was fine-
                                                                   tuned with a dedicated dataset, assembled for the purpose.
                                                                   It includes a selection of 2773 out of 3000 RGB images
                                                                   of the DeepDoors2 dataset [24], which is freely available
                                                                   online. These images represent one or multiple doors
                                                                   in different outdoors and indoors scenarios, which do
                                                                   not necessarily correspond to a corridor: in fact, the
                                                                   large majority of them represent doors from the front.
                                                                   They also include obstacles that partially occlude part
                                                                   of the doors. The annotations in the DeepDoors2 data
                                                                   set are provided as additional images where each one
                                                                   has a black background and different coloured masks for
                                                                   the doors. Being interested in this project more in the
Figure 1: General scheme of the R-CNN Mask framework.              portion of space occupied by the door than in the profile
The layers indicated with the letters C and P are convolutional    of the door itself, all the images are re-masked to segment
layers that represent the backbone network. The classic pyra-      exclusively the door casing. Hence, almost all images
mid architecture improves the detection of objects of various      have quadrilateral-shaped masks (thus with four vertices
sizes.                                                             only). Moreover, the generated annotation files are no
                                                                   more encoded as images like in the original DeepDoors2
                                                                   dataset, but they are fully compatible with the COCO
task. More specifically, it employs the Mask RNN frame-            dataset specifications [25]. In fact, the annotation files
work [19, 20], which derives from Faster RNN [11, 21] (in          are JSON files containing: (1) references to all images,
turn, one of the evolutions of the original R-CNN [22])            each having a unique ID, as shown in the first row of
but adds a third parallel head used to generate the masks.         Table 1; (2) a mask and bounding box (bbox) associated
It also introduces further improvements, like the support          to each image (second row of Table 1).
to pixel-to-pixel alignment between network inputs and
outputs (ROI-Align). Figure 1 shows the different stages            {"images": [
that characterize the network.                                          {"id": 514, "width": 1080,
   Initially, the image is passed as input to a convolution-            "height":1920, "file_name":"frame.jpg"},
based Feature Pyramid Network [23], which has the task                  ...
of extracting meaningful information from differently-                  ]
sized feature maps. An object can appear in the fore-               }
ground (and therefore very large in the image) or further
away from the camera; hence, this pyramidal structure               {"annotations": [
facilitates its detection. The features thus extracted are              {"id": 519, "iscrowd": 0,
                                                                        "image_id": 514, "category_id": 1,
passed to the Region Proposal Network (RPN), which
                                                                        "segmentation": [[587.52,...,1097.77]],
produces several Regions Of Interest (ROI), each with its
                                                                        "bbox": [467.20,581.407,295.90,809.02],
bounding box. At this point, the first-mentioned ROI-                   "area": 121068.87},
Align is applied and its result is passed to the second                 ...
stage of the network, from which a series of fully con-                 ]
nected layers allow to refine the position of the bounding          }
box, the class of the object it contains and its mask.
   Moreover, assuming the camera to be static and, there-          Table 1
fore, the position of the doors to be fixed over time, this        An example of the formatting of JSON files containing image
project exploits two distinct models: one for the door             annotations according to COCO specifications is represented
detection only and the other for people detection. Door            in this table. The first row shows the data structure used
detection is applied just in the starting phase of the frame-      to list all the images in the dataset, the second row instead
work while, from then on, people detection is performed.           shows the one used to specify the annotations associated with
The process of generating the two models and the related           each image, thus including the mask (“segmentation”) and the
                                                                   bounding box (“bbox”). The “category_id” field is always set
results are analyzed below.
                                                                   to 1, as there is only one category (door or person, depending
                                                                   on the dataset).

                                                                     The dataset is split into training, validation and test


                                                              53
Christian Marinoni et al. CEUR Workshop Proceedings                                                                         51–61


Figure 2: Training and validation losses during training with
the Dppl dataset.


                                                                     Figure 3: In this example, door detection is performed cor-
sets. These subsets are disjoint; the training set contains
                                                                     rectly with two of the three instances. Masks are shown in
70% (1941) of the images, while the remaining 30% is                 light red, while the center of the door is shown as a red dot.
equally divided between the validation and test sets (416
each).
   With the new dataset available, called Dppl, we fine-
tuned the model pre-trained with the COCO dataset,                   mated mask is considered to be True if its IoU is greater
which is available on the framework’s GitHub reposi-                 or equal than k, false otherwise.
tory. Consequently, ResNet101 was used as the backbone,              The primary challenge metric for the COCO dataset is
and training was done in the same manner as the frame-               AP@[.50:.05:.95] (usually referred to simply as AP), which
work’s authors. In particular, we trained the head only              is the average AP for IoU (Intersection over Union) from
for the first ten epochs; for the following thirty epochs,           0.5 to 0.95 with a step size of 0.05. This metric is also used
we fine-tuned stages four and above of the backbone too;             to evaluate the results of our test set. In particular, with
finally, in the last ten epochs, we extended the training            the Dppl dataset and the training procedure described
to the entire network. Unlike [19], the learning rate is             above, we got an AP of 85.7 and AP@.75 of 95.8. We
initially set to 0.001 (rather than 0.02) to keep the weights        also report the Average Accuracy, which is calculated by
from exploding; moreover, it is divided by a factor of 10            counting how many pixels out of those belonging to a
during phases two and three of the training. The other               specific area are correctly classified. In this case, rather
parameters are left unchanged, such as the weight decay              than the whole image, the considered area is the smallest
of 0.0001 and momentum of 0.9. Finally, mini-masks were              rectangular portion of the image that contains both the
used (i.e. the masks were resized to the size of 56x56 px)           ground-truth mask and the one produced by the model.
to lessen the risk of memory problems. Data augmen-                  In numerical terms, we obtained an Average Accuracy of
tation (horizontal flipping) was also applied. Figure 2              95.34% in the case of Door Detection.
shows the train and validation losses got during training.              Figure 3 displays the situation in a corridor not in-
   On the test set, the AP metric was used to assess the             cluded in the dataset: the door on the right that is par-
quality of the results produced by the training. AP, the             ticularly “thinned” from the perspective is indeed not
acronym for Average Precision, computes the average                  detected. Precisely for this reason, the framework pro-
precision value for recall values over 0 to 1. In practice,          vides a specific graphical interface that allows adding
AP is computed as the mean of precision values at a set of           new door positions, as shown in Section 4.3.
𝑅 equally spaced recall levels, as defined by the following
formula                                                              3.1.2. People detection
                          1
                   𝐴𝑃 =       ∑ 𝑝              (𝑟)
                          𝑅 𝑟∈{0,...,1} 𝑖𝑛𝑡𝑒𝑟𝑝                 Similarly to what was done with the doors, a model for
                                                               people detection is also generated. Mask R-CNN with
where, given 𝑝(⋅) the precision, 𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑟) = max𝑟∶̃ 𝑟≥𝑟
                                                       ̃ 𝑝(𝑟)̃ the weights of COCO is already alone able to detect and
and 𝑅 = 101 in COCO. AP@k stands for the average pre- segment people with acceptable accuracy. However, fine-
cision for IoU (Intersection over Union, i.e. how much tuning was done using a dedicated dataset built specifi-
the predicted mask overlaps with the ground truth) of k. cally for the occasion from videos captured along a hall-
More specifically, in the computation of AP@k, an esti- way. More in detail, the dataset contains 793 frames


                                                                54
Christian Marinoni et al. CEUR Workshop Proceedings                                                                        51–61


captured in a corridor by a 1080x1920 px resolution cam-
era that was positioned a few centimeters from the ceiling
(approximately 2.9 meters from the floor) with a vertical
image layout. In the scene, six people appear walking
down the hallway and entering/exiting the adjoining
rooms. They wear various types of clothing (including
a white coat to simulate the presence of a doctor); they
are of different ages and all wear face masks. One of the
people has a foot cast and crutches. All frames are hand-
annotated to generate high-quality masks, accurately
respecting the person’s shape. The related annotation
files follow the COCO specifications, as described before.
The split of files between training (555 images), valida-
tion (119) and test (119) sets follows the same proportion
as the Dppl dataset.                                                Figure 4: Training and validation losses during training with
   With this second dataset available, called dPPL, we              the dPPL dataset.
once again fine-tuned the model pre-trained with the
COCO dataset. All the Mask R-CNN’s parameters are
kept the same, but Gamma Contrast is used as a data                 example is DeepSORT [7], which uses the Kalman fil-
augmentation technique in conjunction with horizontal               ter to predict the position of a person in the next frame
flipping in this case.                                              and integrates appearance information based on a deep
   Figure 4 shows the graph of the training and valida-             appearance descriptor. Despite DeepSORT being a pow-
tion losses. As for the performance on the test set, Table          erful tool, the use of the Kalman Filter turns out to be less
2 shows the comparative Average Precision values be-                effective when the subject disappears for long periods
tween the use of a model trained only with COCO and                 from the camera view. Indeed, the Kalman Filter mod-
that obtained by doing fine-tuning with the dPPL dataset.           ulates the state estimate of the system (in this case, the
This second option provides better results for both AP              position in the frame of a subject) as a Gaussian distribu-
and AP@.75. The same applies for the Average Accu-                  tion whose variance strictly depends on the observations
racy. These good results should be evaluated considering            over time. When a person disappears from the scene, the
                                                                    degree of uncertainty increases and the same happens to
         Method                    AP     AP@.75       Acc.
                                                                    the distribution variance. Furthermore, the Kalman Filter
       COCO only                  70.5     92.9       99.08%        would be practically useless if several people enter the
  COCO+fine-tuning on dPPL        76.3     95.5       99.74%
                                                                    same room: the states of those subjects would collapse
Table 2                                                             into the same value, making this information useless to
Comparison between the use of Mask R-CNN trained on                 distinguish a person from the others when they leave the
COCO only and the same network trained with COCO and                room. Nevertheless, the solution undertaken in Deep-
fine-tuned with dPPL dataset. AP stands for Average Preci-          SORT on the use of appearance features turns out to be
sion; Acc. stands for Average Accuracy (calculated by counting      quite effective whenever the Kalman Filter is not since
how many pixels out of those belonging to smallest rectangu-        it relies on visual cues. For this reason, DPPL Tracker is
lar portion of the image that contains both the ground-truth
                                                                    primarily based on appearance features, though it also
mask and the one produced by the model are correctly classi-
fied).                                                              takes advantage of some assumptions related to the work
                                                                    environment (a corridor).
                                                                       In this project, Deep Cosine Metric Learning [26], the
the not very high number of images that compose the                 same used in DeepSORT for appearance re-identification,
dataset. Indeed, environments with completely different             is used. It applies a variation of Softmax classifier called
illumination and compositions will certainly attenuate              Cosine Softmax Classifier, which allows obtaining a dif-
the good performances provided by this model.                       ferent representation space in which compact clusters
                                                                    are formed based on the appearance features. This is
3.2. People Re-identification                                       achieved by first applying the 𝑙2 normalization, which
                                                                    uses the 𝑙2-norm to normalize the input values so that, if
The detection of doors and people in the scene does not             squared and summed, they would result in the value 1,
suffice to ensure accurate tracking. As mentioned above,            and, secondly, by normalizing the weights. Finally, the
one can use additional information extracted from the               cosine softmax classifier is applied, which is defined as
images within more or less complex systems, which may
exploit appearance, movement and shape features. An


                                                               55
Christian Marinoni et al. CEUR Workshop Proceedings                                                                                   51–61


follows:
                                   exp(𝜅 ⋅ 𝑤̃ 𝑘𝑇 𝑟𝑖 )
               𝑝(𝑦𝑖 = 𝑘|𝑟𝑖 ) =     𝐶
                                 ∑𝑛=1 exp(𝜅 ⋅ 𝑤̃ 𝑛𝑇 )
where 𝜅 is a free scaling parameter.
   Table 3 summarizes the entire network, which is made
up of convolutional and residual layers. Dropout of 0.4
is used within the Residual layers.

            Layer        Patch Size/Stride             Output
            Conv1             3 ×3/1                32 × 128 × 64
            Conv2             3 ×3/1                32 × 128 × 64
           Maxpool            3 ×3/2                32 × 64 × 32                                         (a)
          Residual 4          3 ×3/1                32 × 64 × 32
          Residual 5          3 ×3/1                32 × 64 × 32
          Residual 6          3 ×3/2                64 × 32 × 16
          Residual 7          3 ×3/1                64 × 32 × 16
          Residual 8          3 ×3/2                128 × 16 × 8
          Residual 9          3 ×3/1                128 × 16 × 8
          Dense 10               -                       128
      𝑙2 normalization           -                       128
Table 3
Overview of the CNN architecture of the Re-ID network

   The dataset used for training the re-ID network
is MARS [27], a large scale video-based person re-                                                       (b)
identification dataset that extends the Market-1501
                                                                         Figure 5: Examples of the resulting images in the MARS
dataset [28]. It consists of 1261 different pedestrians, who             dataset after applying object instance segmentation.
are captured by at least two of the six near-synchronized
cameras placed on the Tsinghua University campus. It
also includes over 1 million bounding boxes and 3248
distractors to make it more realistic. The goal of the Re-               sophisticated methods or networks with many more pa-
Identification network is to provide useful information                  rameters.
on the person’s identity starting from how they appear
                                                                                       Method                   Rank1      Rank5      mAP
in the image. In the case of MARS, it will have to try
                                                                                                    𝑎
to learn this information from images that also include                        DCML on MARS                      72.93      86.46     56.88
backgrounds of different colours and patterns. To con-                       DCML on masked MARS𝑏                75.73      90.08     60.72
centrate solely on the subject, we preprocessed the MARS                   B-BOT + Attention & CL loss𝑐          88.6       96.2      82.9
dataset by using the Mask R-CNN network to detect peo-                               MGH𝑑                        90.0       96.7      85.8
ple. Therefore, the result is a new dataset where each                   Table 4
image of size 256x128 px represents a segmented person                   Comparison between the Deep Cosine Metric Learning (ab-
and a black background (as shown in Figure 5).                           breviated to DCML) on the original MARS dataset and the
   The network has been trained for 100.000 steps, with                  masked version and some state-of-the-art solutions. 𝑎 Results
a constant learning rate of 0.001 and weight decay of                    from [26] - 𝑏 Proposed in this project - 𝑐 Results from [29] - 𝑑 Re-
1 × 10−8 ; moreover, the input images are scaled to 128x64               sults from [30]. mAP stands for mean Average Precision
px.
   The use of the masked MARS dataset proves to be
beneficial for the network training since it provides im-
proved results according to the CMC Rank@K and mAP                       4. DPPL Tracker framework
metrics1 , as shown in Table 4. The table also shows the
results of two state-of-the-art solutions on the original                People tracking is offered through a specific framework
MARS dataset. Both largely outperform the solution pro-                  that employs Mask R-CNN and the above-mentioned
posed in this project, however, they also use much more                  re-identification network. It also provides additional fea-
1
                                                                         tures to improve the user experience and optimize the
    Computed through the MARS evaluation tool, available at
                                                                         search for people. More precisely, the workflow is the
    https://github.com/liangzheng06/MARS-evaluation


                                                                    56
Christian Marinoni et al. CEUR Workshop Proceedings                                                                    51–61


following: the first frame is first passed as input to Mask      Algorithm 1: Main algorithm
R-CNN for doors detection. Once doors are located, that           Data: 𝑚𝑎𝑠𝑘𝑅𝐶𝑁 𝑁_𝑟𝑒𝑠𝑢𝑙𝑡, 𝑓 𝑟𝑎𝑚𝑒
frame and the following ones are passed to the same               Result: People identified
network (with different weights) for people detection.          1 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 ← [];
The portion of the image containing each person is then         2 for 𝑝𝑒𝑟𝑠𝑜𝑛 in 𝑚𝑎𝑠𝑘𝑅𝐶𝑁 𝑁_𝑟𝑒𝑠𝑢𝑙𝑡 do
multiplied by the corresponding mask (to have a black           3     𝑚𝑎𝑠𝑘, 𝑏𝑏𝑜𝑥 ← 𝑝𝑒𝑟𝑠𝑜𝑛;
background) and, after being resized to 128 x 64 px, is         4     𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛 ← 𝑓 𝑟𝑎𝑚𝑒[𝑏𝑏𝑜𝑥[0] ∶
passed to the re-identification network. The latter has its             𝑏𝑏𝑜𝑥[2], 𝑏𝑏𝑜𝑥[1] ∶ 𝑏𝑏𝑜𝑥[3]];
head cut off so that it outputs an array of size 128 (gener-    5     𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛_𝑚𝑎𝑠𝑘𝑒𝑑 ← 𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛 ∗ 𝑚𝑎𝑠𝑘;
ated by the last Dense layer). This array is a descriptor of    6     𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟 ←
the person’s appearance and is used by the framework’s                  get_person_identifier(𝑖𝑚𝑔𝑝𝑜𝑟𝑡𝑖𝑜𝑛_𝑚𝑎𝑠𝑘𝑒𝑑);
main algorithm to associate a unique identity ID with           7     𝑝𝑒𝑟𝑠𝑜𝑛𝐼 𝐷, 𝑟𝑜𝑜𝑚𝐼 𝐷 ← find_nearest(𝑝𝑒𝑟𝑠𝑜𝑛,
each person.                                                            𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟);
                                                                8     if pID == -1 then
4.1. Main algorithm                                             9          // New person appeared
                                                               10     else
After selecting the video, the first frame is analyzed
                                                               11          // Person in the corridor or exited from a
through mask-RCNN to locate the doors in the scene.
                                                                             room
If one or more doors are not detected, the user can man-       12     end
ually add additional ones, as shown in Section 4.3. Only       13     𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 ← 𝑝𝑒𝑟𝑠𝑜𝑛
at that point, the analysis of the following frames begins.    14 end
Pseudocode 1 shows the main steps. As previously de-           15 for 𝑝𝑒𝑟𝑠𝑜𝑛 in 𝑔𝑒𝑡_𝑝𝑒𝑜𝑝𝑙𝑒_𝑖𝑛_𝑠𝑐𝑒𝑛𝑒() do
scribed, Mask R-CNN is again used to identify people,          16     if 𝑝𝑒𝑟𝑠𝑜𝑛 not in 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 then
while the re-ID network provides the people appearance         17          if 𝑝𝑒𝑟𝑠𝑜𝑛 close to a room then
descriptors. At that point, for each person, the find_near-    18              // Person entered in a room
est function allows identifying the already-known closest      19          else
identifier to the detected descriptor, if any. In this way,    20              // Person disappeared from the scene
it is possible to determine whether that person already                          (may due to an occlusion)
appeared in the past and, depending on their position          21          end
and on the knowledge derived from past frames, a log is        22     end
added to the database if they are leaving a room. If there
                                                               23 end
is no similar person, the algorithm adds a new one to the
scene. The final for loop finds all people who were in
the environment up to the previous frame but are now
missing. In this case, there are two alternatives: the per-     people who last left the corridor, then moving on to all
son may either have entered a room (if in the preceding         the known people. The similarity between two identifiers
frame they were sufficiently close to the relative door)        ID𝑎 and ID𝑏 is computed with the cosine similarity, as
or may have disappeared, for example, because they left         follows
the hallway or are temporarily occluded. To improve                                               ID𝑎 ⋅ ID𝑏
                                                                             𝑐𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
the efficacy of the algorithm, the framework starts track-                                       ‖ID𝑎‖‖ID𝑏‖
ing a person when he appears entirely in the scene and             Two identifiers are more similar as the cosine similar-
his bounding box is at a minimum distance from the              ity goes to one. Hence the need to define, for each of
image edges. Furthermore, it uses the area of the bbox          the listed searches, a threshold that defines when two
to interrupt (temporarily or not) the tracking when an          descriptors must be considered sufficiently similar (and
object/person occludes the subject or when the tracked          therefore belonging to the same person) or not. The
person has nearly entirely entered a room.                      choice of the threshold heavily influences the tracking
    A fundamental step is the one implemented by the            effectiveness. In the various phases a different threshold
find_nearest function, shown in Pseudocode 2. It uses           is used, more specifically: (1) if a person is walking along
differentiated searches to find the already-known person        the corridor without other people in the close vicinity
with the most similar identity to the one passed as input.      and, if compared to the previous frame, that person has
First, it searches among the people visible in the scene in     not moved too far from their previous position in the
the previous frame. In case of failure, if the detection is     scene, then a greater dissimilarity between the descrip-
close enough to a door - according to a given threshold         tors is tolerated; (2) in other cases, the threshold is set to
- it searches among the people who are known to be in           a value between 0.85 and 0.9. Section 5 discusses some
that room. As a last chance, it starts searching among the      critical issues regarding the choice of the threshold.


                                                           57
Christian Marinoni et al. CEUR Workshop Proceedings                                                                      51–61


  Algorithm 2: Find nearest identity                              a room (Figure 7). In the latter case, the interface high-
   Data: 𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟                                              lights the riskiest situations (for example, if the room
   Result: Person id                                              capacity has been exceeded) in addition to providing all
 1 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑙𝑦_𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 ← [];
                                                                  records linked to the entered ID.
 2 if 𝑖𝑑𝑒𝑛𝑡𝑖𝑓 𝑖𝑒𝑟 in the scene then
 3      // Person in the scene, return the ID
 4 end
 5 for 𝑑𝑜𝑜𝑟 in room do
 6      if 𝑝𝑒𝑟𝑠𝑜𝑛 close to 𝑑𝑜𝑜𝑟 then
 7          // Look among people inside that room
 8      end
 9 end
10 // Look among last detected people;
11 // Look among all people;


                                                                  Figure 6: The user can add multiple additional doors through
4.2. Database                                                     the user interface. The position of the center of the new door
                                                                  is shown by a red dot, while its height by a dashed line with
Whenever a person enters and leaves a room, a corre-              two blue dots at the ends.
sponding log is added to the database. Each log has the
following structure:

      frameID personID roomID "in/out/new"

where frameID is an incremental value representing the
currently processed frame, personID is a unique integer
associated to a person (different from the identifier repre-
senting the way that person looks in the scene), roomID
is the ID of the room the person is entering/leaving - if
any - and it is equal to −1 otherwise. The last label has
the value “in ” or “out ” when “roomID ” is different from
−1, while it assumes the value “new ” when a new person
appears in the scene.
   For simplicity, the database is implemented via a sim-
ple CSV file containing all the logs, but more complex
                                                             Figure 7: The user can visually see a list of rooms a particular
and scalable solutions (such as NoSQL) are also possible.
                                                             ID has entered through the user interface.
Knowing the video framerate, the framework derives an
estimate of the time spent in the room, to highlight possi-
ble dangerous situations. The same is done by counting
the number of people in the same room and alerting when
the maximum capacity is exceeded.                            5. Analysis and results
                                                             The behaviour of the framework is evaluated in two differ-
4.3. GUI                                                     ent setups of incremental difficulty. In the first setup, peo-
A simple user interface, implemented with the PySim- ple walk down a corridor one after the other, in a perfect
pleGUI library, is also available to provide the user with flow that limits the occasions when two or more people
more flexible interaction with the framework. The user are simultaneously in the same room. This modality al-
can select a file or directory containing the needed frame lows focusing mainly on an inter-frame re-identification
images, as well as add new doors that Mask R-CNN did and on the correct detection of people entering and leav-
not detect. In this second case (shown in Figure 6), by ing the rooms. In the second setup, multiple people can
using a simple library such as Matplotlib, it is possible to enter the same room. The challenge, in this case, is to be
offer a response in real-time on the location of the new able to identify the identity of a person when he leaves
doors and their heights (used by the algorithm). Finally, the room. The results show that the algorithm can handle
at the end of the processing of all frames, the user can a wide range of situations with ease, producing results
search all the times a particular ID has entered and left that are similar - if not identical - to the ground truth.


                                                             58
Christian Marinoni et al. CEUR Workshop Proceedings                                                                       51–61


   First of all, it is beneficial to analyze how accurately        on to analyze the accuracy of people tracking. In partic-
the framework can detect the presence of one or more               ular, the inter-frame re-identification of a person in the
people in the scene. To calculate the overall accuracy of          scene scores 100% accuracy, even in the case of several
the detections we used two methods. The first consists             people in the corridor; the same happens when the per-
of considering only those frames in which a person is              son leaves a room, even when more than one is inside
shown entirely (i.e., he is not hidden - even partially - by       it. The criticalities are mainly two: (1) the difficulty in
objects or other people). The second way is to consider            defining an efficient threshold for cosine similarity, since
all frames, including all borderline cases in which only           the method adopted is susceptible to sudden changes in
a portion of a person’s arm or leg appears in the frame.           the person’s position (such as front and rear vision of
Figure 8 shows an example of the frames considered with            the person); (2) the influence of the quality of masks pro-
both methods. The results - obviously better in numerical          duced by Mask R-CNN on the re-identification network.
terms in the first case - are shown in Table 5.                    A sudden change in the portion of the image taken into
                                                                   consideration (even without sudden movements of the
                       Overall (Detection) Accuracy                subject) can reduce the cosine similarity.
         Method 1                  100%
                                                                       Cosine similarity can be a powerful tool for guiding the
         Method 2                 91.76%
                                                                   re-identification task: limiting the search to the people
Table 5                                                            inside the room and using the cosine similarity always
The accuracy of people detection computed with two methods         leads to correct identifications. Nevertheless, the weak-
is shown here. With the first one, we only considered those        nesses listed above heavily reduce its effectiveness when
frames in which people bodies are shown wholly in the image.       it is necessary to recognize a person who had previously
The second method also includes those frames in which a            left the corridor (without entering any room) and who
person is only partially visible.
                                                                   reappears later on. Indeed, the choice of a high threshold
                                                                   (i.e., ≥ 0.9) makes it difficult to assign the same ID in
                                                                   the situation under analysis, because usually, the person
                                                                   will reappear in a completely different pose (for example,
                                                                   from behind and not the front) which will reduce the
                                                                   value of the cosine similarity. In this case, there will be
                                                                   no ID switches between different people, but each time
                                                                   one reappears in the scene it will be assigned a new ID.
                                                                       On the contrary, lowering the threshold facilitates the
                                                                   ID switches, creating some cascading problems in the
                                                                   framework (an ID already assigned - even if incorrectly
                                                                   - to a person will not be re-assigned as long as the per-
                                                                   son is in the scene, not even if the one it was originally
                                                                   assigned to reappears). However, these problems do not
                                                                   affect the recognition of people leaving the rooms: the
                                                                   identifier produced by the Re-ID network and the simi-
                                                                   larity computed with the cosine similarity is sufficient
                                                                   for the correct attribution of the ID. Compared to the
                                                                   baseline (Re-ID network trained on the original MARS
                                                                   dataset), it can be observed that the cosine similarity of
                                                                   the same person in two different situations (frames) is
                                                                   greater (by 1-2%) when assessed with our method.
Figure 8: The frame on the left is an example of those con-            As a final benchmark, the accuracy of the logs (seen
sidered with Method 1 for calculating the Overall Detection        as the ratio of the logs equal to ones of the ground truth
Accuracy. The person’s body is entirely included in the scene.     over the total number of them) produced in the tests is
The frame on the right is instead an example of those consid-      equal to 50%. The accuracy goes up to 84% if we also
ered with Method 2, that takes also into account all borderline    include those logs with labels “in” and “out” that differ
cases in which only a portion of a person’s arm or leg appears     only in the person ID from the ground truth (but only
in the frame. In this case, the two people in the scene are only   if that ID is a new one, and therefore if there is no ID
partially visible and the arm of the uppermost person is not
                                                                   switch with a previously known identity). When a person
detected by the model.
                                                                   enters a room, the relative log at the exit is always correct,
                                                                   as already mentioned above. As for performance, an
  Having ascertained that the framework can detect the             Nvidia Tesla K80 is capable of processing 1.4-1.5 frames
presence of people with good reliability, we then move             per second.


                                                               59
Christian Marinoni et al. CEUR Workshop Proceedings                                                                      51–61


   We also ran a test in a setup with slightly different           the re-identification network could alleviate the difficulty
specifications. In fact, the recording device was placed at        of assigning the same ID to a person who reappears in the
eye level, tilted almost parallel to the floor and with an         corridor without leaving a room. The study of solutions
image ratio of 16:9. The results obtained are comparable           for tracing people entering and leaving the rooms is of
to those indicated above, although tracking people in              great importance for the application developments that it
areas very distant from the camera (and therefore at lower         can have. It not only allows contact tracing in the event
resolution) turns out to be more critical. Under these             of pandemics but it can be also used for other contexts, as
conditions, it is quite easy for two different subjects to         for the analysis of the movements of patients and medical
appear very similar even to the human eye. An example              operators and the optimization of hospital wards.
is shown in Figure 9. Ultimately, the framework is most
effective when the distance to the doors is not excessively
large.                                                             References
                                                                    [1] V. Alfano, S. Ercolano, The efficacy of lockdown
                                                                        against covid-19: a cross-country panel analysis,
                                                                        Applied health economics and health policy 18
                                                                        (2020) 509–517.
                                                                    [2] S. Pepe, S. Tedeschi, N. Brandizzi, S. Russo, L. Ioc-
                                                                        chi, C. Napoli, Human attention assessment us-
                                                                        ing a machine learning approach with gan-based
                                                                        data augmentation technique trained using a cus-
                                                                        tom dataset, OBM Neurobiology 6 (2022). doi:10.
                                                                        21926/obm.neurobiol.2204139 .
                                                                    [3] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
                                                                        Analysis pre and post covid-19 pandemic rorschach
                                                                        test data of using em algorithms and gmm models,
                                                                        volume 3360, 2022, pp. 55 – 63.
                                                                    [4] V. Marcotrigiano, G. D. Stingi, S. Fregnan, P. Mag-
                                                                        arelli, P. Pasquale, S. Russo, G. B. Orsi, M. T. Mon-
                                                                        tagna, C. Napoli, C. Napoli, An integrated control
                                                                        plan in primary schools: Results of a field investi-
                                                                        gation on nutritional and hygienic features in the
                                                                        apulia region (southern italy), Nutrients 13 (2021).
Figure 9: Those shown in the figure are two different peo-              doi:10.3390/nu13093006 .
ple, who however visually appear practically identical. Their       [5] G. De Magistris, M. Romano, J. Starczewski,
appearance descriptor is therefore very similar and this leads          C. Napoli, A novel dwt-based encoder for human
the framework to a wrong ID attribution when one of the two             pose estimation, volume 3360, 2022, pp. 33 – 40.
leaves the room.                                                    [6] M. Bano, C. Arora, D. Zowghi, A. Ferrari, The rise
                                                                        and fall of covid-19 contact-tracing apps: when
                                                                        nfrs collide with pandemic, in: 2021 IEEE 29th In-
                                                                        ternational Requirements Engineering Conference
6. Conclusion                                                           (RE), 2021, pp. 106–116. doi:10.1109/RE51729.
                                                                        2021.00017 .
DPPL Hallway Tracker turns out to be a good starting
                                                                    [7] N. Wojke, A. Bewley, D. Paulus, Simple online and
point for developing a framework capable of tracking
                                                                        realtime tracking with a deep association metric,
people entering and leaving multiple rooms. The use of
                                                                        in: 2017 IEEE international conference on image
a re-ID network that exploits the masks produced in the
                                                                        processing (ICIP), IEEE, 2017, pp. 3645–3649.
detection and segmentation phase leads, even in the tests
                                                                    [8] A. Alfarano, G. De Magistris, L. Mongelli, S. Russo,
performed, to improvements in identification.
                                                                        J. Starczewski, C. Napoli, A novel convmixer trans-
   A project extension might be able to address some of
                                                                        former based architecture for violent behavior de-
the remaining issues: (1) the enrichment of the datasets of
                                                                        tection 14126 LNAI (2023) 3 – 16. doi:10.1007/
people and doors could lead to better detection in several
                                                                        978- 3- 031- 42508- 0_1 .
more challenging contexts: for example, as discussed
                                                                    [9] B. Yang, R. Nevatia, Multi-target tracking by online
above, the detection and segmentation of doors “thinned”
                                                                        learning of non-linear motion patterns and robust
from perspective remains difficult; (2) using a dynamic
                                                                        appearance models, in: 2012 IEEE Conference on
threshold and investigating complementary solutions to


                                                              60
Christian Marinoni et al. CEUR Workshop Proceedings                                                               51–61


     Computer Vision and Pattern Recognition, 2012, pp.           - cascade neural network based approach, 2014, pp.
     1918–1925. doi:10.1109/CVPR.2012.6247892 .                    355 – 362. doi:10.1109/SPEEDAM.2014.6872103 .
[10] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft,         [21] F. Bonanno, G. Capizzi, G. L. Sciuto, C. Napoli,
     Simple online and realtime tracking, 2016 IEEE               Wavelet recurrent neural network with semi-
     International Conference on Image Processing                  parametric input data preprocessing for micro-wind
     (ICIP) (2016). URL: http://dx.doi.org/10.1109/ICIP.           power forecasting in integrated generation sys-
     2016.7533003. doi:10.1109/icip.2016.7533003 .                 tems, 2015, pp. 602 – 609. doi:10.1109/ICCEP.
[11] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:             2015.7177554 .
     Towards real-time object detection with region pro-     [22] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich
     posal networks, Advances in neural information                feature hierarchies for accurate object detection
     processing systems 28 (2015) 91–99.                           and semantic segmentation, in: Proceedings of the
[12] R. E. Kalman, A new approach to linear filtering              IEEE conference on computer vision and pattern
     and prediction problems (1960).                               recognition, 2014, pp. 580–587.
[13] V. Rabaud, S. Belongie, Counting crowded mov-           [23] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan,
     ing objects, in: 2006 IEEE Computer Society Con-              S. Belongie, Feature pyramid networks for object
     ference on Computer Vision and Pattern Recog-                 detection, 2017. arXiv:1612.03144 .
     nition (CVPR’06), volume 1, 2006, pp. 705–711.          [24] J. Ramôa, V. Lopes, L. Alexandre, S. Mogo, Real-time
     doi:10.1109/CVPR.2006.92 .                                    2d–3d door detection and state classification on a
[14] C. Labit-Bonis, J. Thomas, F. Lerasle, F. Madrigal,           low-power device, SN Applied Sciences 3 (2021).
     Fast tracking-by-detection of bus passengers with             doi:10.1007/s42452- 021- 04588- 3 .
     siamese cnns, in: 2019 16th IEEE International          [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
     Conference on Advanced Video and Signal Based                 D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
     Surveillance (AVSS), 2019, pp. 1–8. doi:10.1109/              coco: Common objects in context, in: European
     AVSS.2019.8909843 .                                           conference on computer vision, Springer, 2014, pp.
[15] C.-H. Chen, Y.-C. Chang, T.-Y. Chen, D.-J. Wang,             740–755.
     People counting system for getting in/out of a bus      [26] N. Wojke, A. Bewley, Deep cosine metric learn-
     based on video processing, in: 2008 Eighth In-                ing for person re-identification, in: IEEE Win-
     ternational Conference on Intelligent Systems De-             ter Conference on Applications of Computer Vi-
     sign and Applications, volume 3, 2008, pp. 565–569.           sion (WACV), IEEE, 2018. URL: https://elib.dlr.de/
     doi:10.1109/ISDA.2008.335 .                                  116408/.
[16] J.-W. Perng, T.-Y. Wang, Y.-W. Hsu, B.-F. Wu, The       [27] MARS: A Video Benchmark for Large-Scale Person
     design and implementation of a vision-based people            Re-identification, Springer, 2016.
     counting system in buses, in: 2016 International        [28] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian,
     Conference on System Science and Engineering                  Scalable person re-identification: A benchmark, in:
     (ICSSE), 2016, pp. 1–3. doi:10.1109/ICSSE.2016.               Proceedings of the IEEE International Conference
     7551620 .                                                     on Computer Vision (ICCV), 2015.
[17] S. A. Velastin, R. Fernández, J. E. Espinosa,           [29] P. Pathak, A. E. Eshratifar, M. Gormish, Video per-
     A. Bay, Detecting, tracking and counting peo-                 son re-id: Fantastic techniques and where to find
     ple getting on/off a metropolitan train using                 them, 2019. arXiv:1912.05295 .
     a standard video camera, Sensors 20 (2020).             [30] Y. Yan, J. Qin1, J. Chen, L. Liu, F. Zhu, Y. Tai,
     URL: https://www.mdpi.com/1424-8220/20/21/6251.               L. Shao, Learning multi-granular hypergraphs
     doi:10.3390/s20216251 .                                       for video-based person re-identification, 2021.
[18] S. D. Pore, B. F. Momin, Bidirectional people count-          arXiv:2104.14913 .
     ing system in video surveillance, in: 2016 IEEE
     International Conference on Recent Trends in Elec-
     tronics, Information Communication Technology
     (RTEICT), 2016, pp. 724–727. doi:10.1109/RTEICT.
     2016.7807919 .
[19] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-
     cnn, in: Proceedings of the IEEE international con-
     ference on computer vision, 2017, pp. 2961–2969.
[20] F. Bonanno, G. Capizzi, S. Coco, C. Napoli, A. Lau-
     dani, G. L. Sciuto, Optimal thicknesses determina-
     tion in a multilayer structure to improve the spp
     efficiency for photovoltaic devices by an hybrid fem


                                                        61