=Paper=
{{Paper
|id=Vol-3092/p14
|storemode=property
|title=Contagion Prevention of COVID-19 by means of Touch
Detection for Retail Stores
|pdfUrl=https://ceur-ws.org/Vol-3092/p14.pdf
|volume=Vol-3092
|authors=Rafał Brociek,Giorgio De Magistris,Francesca Cardia,Federica Coppa,Samuele Russo
|dblpUrl=https://dblp.org/rec/conf/system/BrociekMCCR21
}}
==Contagion Prevention of COVID-19 by means of Touch
Detection for Retail Stores==
<pdf width="1500px">https://ceur-ws.org/Vol-3092/p14.pdf</pdf>
<pre>
Contagion Prevention of COVID-19 by means of Touch
Detection for Retail Stores
Rafał Brociek1 , Giorgio De Magistris2 , Francesca Cardia2 , Federica Coppa2 and Samuele Russo3
1
  Sapienza University of Rome, piazzale Aldo Moro 5, Roma 00185, Italy
2
  Department of Computer, Automation and Management Engineering, Sapienza University of Rome, via Ariosto 25 Roma 00185, Italy
2
  Department of Psychology, Sapienza University of Rome, via dei Marsi 78 Roma 00185, Italy


                                             Abstract
                                             The recent Covid-19 pandemic has changed many aspects of people’s life. One of the principal preoccupations regards how
                                             easily the virus spreads through infected items. Of special concern are physical stores, where the same items can be touched
                                             by a lot of people throughout the day. In this paper a system to efficiently detect the human interaction with clothes in
                                             clothing stores is presented. The system recognizes the elements that have been touched, allowing a selective sanitization of
                                             potentially infected items. In this work two approaches are presented and compared: the pixel approach and the bounding
                                             box approach. The former has better detection performances while the latter is slightly more efficient.


1. Introduction                                                                                                            tact with the customer and the same cannot take them
                                                                                                                           all into consideration because at that moment he was
The recent Covid-19 pandemic has affected hardly most                                                                      not present or the his attention was focused elsewhere.
commercial activities[18, 5, 23]. The recent restrictions                                                                  The use of an automatic system capable to recognize and
imposed by the governments to contrast the virus spred-                                                                    remember potentially contaminated areas or objects can
ing had a big impact on most retail stores, favouring                                                                      considerably reduce the effort for the sanitizer and im-
online shopping [14], where the infection risk through                                                                     prove the accuracy and effectiveness of his action. At the
infected items is obviously reduced. In this context, an                                                                   same time, the implementation of such a solution would
efficient sanitization of stores would decrease the expo-                                                                  considerably reduce the feeling of discomfort that the cus-
sure to infection [6, 2] making people more inclined to                                                                    tomer can experience in the presence of a sanitizer who
return to physical shopping.                                                                                               disinfects every object that the customer has touched in
   Some contexts, especially where several people are                                                                      front of the customer. This aspect allows the customer
present at the same time, often do not allow to keep                                                                       avoid embarrassments while maintaining a relationship
under control every part of the environment. In partic-                                                                    of trust, reducing the risks of a mortification, for which
ular it gets difficult to stay aware about all the physical                                                                the customer would feel limited by the possibility of ex-
contacts with people, among themselves, with the en-                                                                       pressing his own behavior while exploring the store. This
vironment and with its content. During the COVID-19                                                                        principle can also be applicable to professional studies
pandemic scenario it has become necessary to constantly                                                                    or other facilities, where the construction of an alliance
sanitize the environment and all its potentially contam-                                                                   and a relationship of trust between the professional and
inated parts. Therefore it has become clear that very                                                                      the client is always a critical and delicate moment that
help that can facilitate this task would be of great use in                                                                must be managed with the utmost sensibility.
such contexts. Moreover the sanitizing actions carried                                                                        The aim of this work consists in creating a system that
out by an employee in the presence of the customer, in                                                                     is able to help the sellers to sanitize items faster and more
certain circumstances, can induce a feeling of annoyance                                                                   efficiently, knowing which product should be sanitized
or discomfort. However, postponing the intervention                                                                        and which do not. In particular, we designed a system
can be difficult because the cleaner would not remember                                                                    for clothing stores, but the same solution can be adapted
precisely which parts of the environment came into con-                                                                    to many other retail stores. The general idea consists
                                                                                                                           in the implementation of a system that is able to detect
SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology,                                                                    the touch action. We decided to restrict the context to
Engineering and Mathematics. July 27–29, 2021, Catania, IT
                                                                                                                           clothing stores because the model is more efficient when
$ rafal.brociek@polsl.pl (R. Brociek);
demagistris@diag.uniroma1.it (G. De Magistris);                                                                            trained on a specific set of objects. Moreover, clothing
cardia.1759331@studenti.uniroma1.it (F. Cardia);                                                                           stores represent one of the commercial activities with
coppa.1749614@studenti.uniroma1.it (F. Coppa);                                                                             a higher risk of Covid bacteria spreading, since people
samuele.russo@uniroma1.it (S. Russo)                                                                                       touch and try on dresses continuously before buying
 0000-0002-7255-6951 (R. Brociek); 0000-0002-3076-4509 (G. De                                                             them.
Magistris); 0000-0002-1846-9996 (S. Russo)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative           In section 2 we formalize the problem of touch detection
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                             and we relate it to the state of the art. Section 3 and 4


                                                                                                                      89
Rafał Brociek et al. CEUR Workshop Proceedings                                                                     89–94


respectively describe the models and the datasets that are      Algorithm 2: Algorithm to check the overlap
used in the proposed system. In section 5 we illustrate         between two pixel masks
the training strategy and some implementation details in         Input:
order to make the system easily reproducible. In section         𝑀 1𝑁 ×𝑁 = pixel mask of the first object
6 we report the performance metrics and compare the              𝑀 2𝑁 ×𝑁 = pixel mask of the second object
different approaches. In section 7 conclusions are drawn.        Output:
                                                                 Overlap / No Overlap
2. Problem Definition and State of                               Time Complexity:
                                                                 𝒪(𝑁 2 )
   Art                                                           Algorithm
                                                                 for i = 0 to N-1 do
This paper presents a new method for detecting the                  for j = 0 to N-1 do
"touch" event and in particular we narrowed the scope to                 if 𝑀 1𝑖,𝑗 and 𝑀 2𝑖,𝑗 then
the action of touching clothes with one’s hands. The colli-                   return Overlap
sion detection task in a 3D environment is a well studied
problem [16, 19, 15] in literature and it finds application      return No Overlap
in many fields such as robotics and video gaming. How-
ever, to the best of our knowledge, there is no equivalent
formulation in the context of 2D images, where there is        3. Method
no depth information. According to our formulation, the
touch detection is based on the recognition of the objects     Object detection and image segmentation are two founda-
of interest (in this case clothes and hands). The result       mental problems in computer vision. Before the incredi-
of such recognition, depending on the method that has          ble success of deep learning these tasks were performed
been used, can be either a set of coordinates identifying a    using solely standard computer vision algorithms. For
bounding box (detection) or a pixel mask (segmentation).       example the selective search [24, 20] leverages the hier-
The bounding box or the pixel mask is then used to check       archical structure of images and, from an initial segmen-
if there is an overlap between the two objects. We will        tation, it recursively merges similar patches in terms of
refer to the former as Bounding Box approach and to the        color, texture, size and shape[Capizzi201645, 4]. State
latter as Pixel approach. In the first case a simple check     of the art deep learning models for detection and segmen-
on the coordinates is sufficient (see algorithm 1), while      tation are based on the R-CNN architecture introduced
in the other a parallel scan of the pixels of the two masks    in [10]. This network receives as input a set of region
is needed (see algorithm 2).                                   proposals which are the candidates for the classifications,
                                                               the architecture is independent of the algorithm used,
                                                               then a pre-trained large CNN network is used to extract
 Algorithm 1: Algorithm to check the overlap
                                                               features from the selected regions and then class spe-
 between two rectangular bounding boxes
                                                               cific Linear Support Vector Machines (SVM) are used to
  Input:                                                       classify the regions[21]. The main problem of this ar-
  A,B = upper-left and bottom-right vertices of the            chitecture was the long evaluation time, preventing the
   first rectangle                                             model from online usage, hence Fast R-CNN [9] was in-
  A’,B’ = upper-left and bottom-right vertices of the          troduced to speed-up evaluation time. This model learns
   second rectangle                                            to classify object proposals and to refine their spatial
  Output:                                                      locations jointly. Each region proposal is mapped into
  Overlap / No Overlap                                         a fixed-length feature vector using interleaved convolu-
  Time Complexity:                                             tional and pooling layers followed by fully connected
  𝒪(1)                                                         layers. Then the feature vector flows into the two output
  Algorithm                                                    branches which outputs are respectively: softmax prob-
  if A’.x > B.x or A.x > B’.x then                             abilities and per-class bounding-box regression offsets.
      return No Overlap
                                                               The architecture is trained end-to-end with a multi-task
  if B.y > A’.y or B’.y > A.y then
      return No Overlap                                        loss.
  return Overlap                                                  For the task of object detection, we used Faster R-CNN
                                                               [22] that is an extension of Fast R-CNN that avoids the
                                                               bottleneck of the region proposal module with the intro-
                                                               duction of a Region Proposal Network (RPN). The RPN is
                                                               a fully convolutional network sharing the convolutional
                                                               features of the detection network that simultaneously


                                                          90
Rafał Brociek et al. CEUR Workshop Proceedings                                                                           89–94


predicts object bounds and objectness scores at each po-
sition. It is trained end-to-end to generate high-quality
region proposals, which are used by Fast R-CNN for de-
tection.
   For the task of image segmentation, we used Mask
R-CNN [13] that extends the Faster R-CNN architecture
with a branch for predicting an object mask in parallel
with the existing branch for bounding box recognition. Figure 2: Samples from the Hands Dataset
This network adds only a small overhead with respect
to Faster R-CNN and it runs at 5 fps. Moreover Mask R-
CNN surpasses all previous state-of-the-art single-model
results on the COCO instance segmentation competition for validation. Some images were randomly selected
[17].                                                      trough Google Search, some were taken from the known
                                                           Clothing Dataset [11] and others were added manually.
                                                           In both datasets, images were annotated using the VIA
4. Datasets                                                Annotation Software [7] that is an open source light
                                                           weight software that runs in the web browser and allows
The models from the R-CNN family are trained with to annotate images with bounding boxes or pixel masks.
labelled and annotated images. We trained two separate
models, respectively for hands and clothes recognition,
hence the objects of interest are labelled with a single
label. For the task of object detection the annotation
consists of the four values that identify the bounding
box that are the x and y pixel coordinates of the center,
width and height in pixels. For the segmentation task the
ground truth is another image with the same dimensions
of the original image where the pixels that belongs to the
object of interest are white (mask) and the background is
black (see figure 1). The network only allows dimensions


                                                                 Figure 3: Samples from each of the four categories of the
                                                                 Clothing dataset, from left to right: t-shirt, trousers, skirt,
                                                                 long sleeve


Figure 1: Pixel masks for a hand (left) and for trousers

                                                                 5. Training
like 256,320,384 or whatever is dividable by 2 at least
6 times. For this reason each image in both hands and            We trained the two models separately, respectively for
Clothing datasets have dimensions 384×448. The Hands             hands and clothes detection and segmentation. We used
Dataset (see figure 2) is obtained collecting 400 images for     the Mask R-CNN implementation provided by [1] both for
training and 100 for testing from three famous datasets:         detection (bounding box) and segmentation. Remember
EgoHands [3], HandOverFace [8] and EgoYouTubeHands               that Mask R-CNN is an extension of Faster R-CNN that
[25]. Moreover 40 images for training and 10 for testing         adds a branch for predicting the mask, but the rest of the
were added manually. We chose images from multiple               architecture is unchanged, including the branch for the
datasets to have representations of hands in different           bounding box regression. This implementation requires
contexts, in order to improve the generalization power           only the annotation with the pixel mask, the bounding
of the model. For the clothing recognition task we built         box for the ground truth is computed on the fly picking
a dataset of 500 images labelled with the following four         the smallest box that encapsulates all the pixels of the
labels (see figure 3): t-shirt, trousers, skirt, long sleeve.    mask. Both models have been fine tuned (all layers) for
We followed the common practice of partitioning the              50 epochs, with a learning rate of 0.0001, a weight decay
dataset using the 80% for training and the remaining             of 0.00001, ResNet-101 [12] as Backbone and some data


                                                            91
Rafał Brociek et al. CEUR Workshop Proceedings                                                                                 89–94


augmentation techniques to improve the performances of                                    Accuracy      Precision    Recall    F1 score
                                                                       Bounding Box         0.56          0.53        0.89       0.66
Mask R-CNN. Figure 4 shows the learning curves, while
                                                                        Pixel Mask          0.88          0.89        0.88       0.88
in figure 5 the single components of the loss function are
illustrated for the validation set. Considering that the two          Table 1
models have similar plots, we will illustrate only those              Evaluation metrics for the task of touch detection. In the first
regarding the model trained on the Clothing dataset for               row the metrics for the Bounding Box approach while in the
conciseness sake.                                                     second row the ones for the Pixel approach


                                                                      and hands. In order to test the two approaches (bound-
                                                                      ing box vs pixel mask) we built manually a new dataset
                                                                      with 100 photographed images with hands and clothes
                                                                      and we labelled each image with the two labels overlap
                                                                      and no overlap. In order to check the overlap we used
Figure 4: Learning curves, from left to right the training and        algorithms 1 and 2 respectively for the bounding box and
validation loss                                                       pixel approaches. The result is a set of images with their
                                                                      associated labels. Table 1 reports some metrics that are
                                                                      commonly used to evaluate the detection, while figure 6
                                                                      shows the confusion matrices for the two approaches.


              (a)                            (b)


                                                                      Figure 6: The confusion matrices for the touch detection task.
                                                                      On the left the values refer to the Bounding Box approach, on
                                                                      the right they refer to the Pixel approach.


              (c)                            (d)                         From these metrics it emerges that the Pixel approach
                                                                      is much superior than the Bounding Box approach. In
                                                                      particular, the Bounding Box approach returns a lot of
                                                                      false positives, because often the bounding boxes overlap
                                                                      while the objects inside do not, as shown in figure 7.


                             (e)
Figure 5: Components of the loss function of the Mask R-
CNN. The rpn_class_loss 5a and rpn_bbox_loss 5b indicates
respectively how well the Region Proposal Network separates
background from objects and localizes objects. While the mr-
cnn_bbox_loss 5c, mrcnn_class_loss 5d and mrcnn_mask_loss             Figure 7: Bounding box approach (left) compared with the
5e measure the performances of the Mask R-CNN in localizing,          Pixel approach (right). This is an example misclassification by
labelling and segmenting objects.                                     the Bounding Box approach (the rectangles are contained one
                                                                      into the other) and correct classification by the Pixel based
                                                                      approach (the pixel masks have no pixel in common).


6. Results
At this point we have the two models trained to detect
with a bounding box and segment respectively clothes


                                                                 92
Rafał Brociek et al. CEUR Workshop Proceedings                                                                     89–94


7. Conclusion                                                  [7]   Abhishek Dutta and Andrew Zisserman. “The VIA
                                                                     annotation software for images, audio and video”.
In this work a system to efficiently detect the human                In: Proceedings of the 27th ACM international con-
interaction with objects in clothing stores was presented.           ference on multimedia. 2019, pp. 2276–2279.
The proposed system can be easily adapted to a variety
                                                              [8]    Sakher Ghanem, Ashiq Imran, and Vassilis Athit-
of other fields by changing the datasets used for the ob-
                                                                     sos. “Analysis of hand segmentation on challeng-
ject detection and segmentation tasks. We presented two
                                                                     ing hand over face scenario”. In: Proceedings of
approaches, the former based on object detection with
                                                                     the 12th ACM International Conference on PErva-
bounding box and the latter based on segmentation, and
                                                                     sive Technologies Related to Assistive Environments.
we showed that the second one performed much better
                                                                     2019, pp. 236–242.
with the cost of a small overhead. A further improve-
ment to the proposed model would be the introduction          [9]    Ross Girshick. “Fast r-cnn”. In: Proceedings of the
of depth information. This extension however, would                  IEEE international conference on computer vision.
increase the performances at the expense of a higher cost            2015, pp. 1440–1448.
for more specialized hardware and this factor could limit [10]       Ross Girshick et al. “Rich feature hierarchies for
its widespread use. That said, we think that our system              accurate object detection and semantic segmenta-
achieves good enough results to be implemented in phys-              tion”. In: Proceedings of the IEEE conference on com-
ical stores as a highly cost-effective tool for the Covid-19         puter vision and pattern recognition. 2014, pp. 580–
pandemic containment.                                                587.
                                                             [11]    Alexey Grigorev. Clothing dataset. 2020. url: https:
                                                                     //www.kaggle.com/agrigorev/clothing-dataset-
References                                                           full.
 [1]   Waleed Abdulla. Mask R-CNN for object detection [12]          Kaiming He et al. “Deep residual learning for im-
       and instance segmentation on Keras and TensorFlow.            age recognition”. In: Proceedings of the IEEE con-
       https : / / github . com / matterport / Mask _ RCNN.          ference on computer vision and pattern recognition.
       2017.                                                         2016, pp. 770–778.
 [2]   R. Avanzato et al. “YOLOv3-based mask and face [13]           Kaiming He et al. “Mask r-cnn”. In: Proceedings
       recognition algorithm for individual protection               of the IEEE international conference on computer
       applications”. In: vol. 2768. 2020, pp. 41–45.                vision. 2017, pp. 2961–2969.
 [3]   Sven Bambach et al. “Lending A Hand: Detect- [14]             Rae Yule Kim. “The impact of COVID-19 on con-
       ing Hands and Recognizing Activities in Com-                  sumers: Preparing for digital sales”. In: IEEE Engi-
       plex Egocentric Interactions”. In: The IEEE Inter-            neering Management Review 48.3 (2020), pp. 212–
       national Conference on Computer Vision (ICCV).                218.
       Dec. 2015.                                           [15]     Sinan Kockara et al. “Collision detection: A sur-
 [4]   F. Bonanno et al. “Optimal thicknesses determina-             vey”. In: 2007 IEEE International Conference on Sys-
       tion in a multilayer structure to improve the SPP             tems, Man and Cybernetics. IEEE. 2007, pp. 4046–
       efficiency for photovoltaic devices by an hybrid              4051.
       FEM - Cascade Neural Network based approach”. [16]            Ming Lin and Stefan Gottschalk. “Collision de-
       In: 2014, pp. 355–362. doi: 10.1109/SPEEDAM.                  tection between geometric models: A survey”. In:
       2014.6872103.                                                 Proc. of IMA conference on mathematics of surfaces.
 [5]   P. Caponnetto et al. “The effects of physical ex-             Vol. 1. Citeseer. 1998, pp. 602–608.
       ercise on mental health: From cognitive improve- [17]         Tsung-Yi Lin et al. “Microsoft coco: Common ob-
       ments to risk of addiction”. In: International Jour-          jects in context”. In: European conference on com-
       nal of Environmental Research and Public Health               puter vision. Springer. 2014, pp. 740–755.
       18.24 (2021). doi: 10.3390/ijerph182413384.          [18]     V. Marcotrigiano et al. “An integrated control plan
 [6]   Federica Carraturo et al. “Persistence of SARS-               in primary schools: Results of a field investigation
       CoV-2 in the environment and COVID-19 trans-                  on nutritional and hygienic features in the apulia
       mission risk from environmental matrices and                  region (southern italy)”. In: Nutrients 13.9 (2021).
       surfaces”. In: Environmental pollution 265 (2020),            doi: 10.3390/nu13093006.
       p. 115010.                                           [19]     C. Napoli, G. Pappalardo, and E. Tramontana. “An
                                                                     agent-driven semantical identifier using radial ba-
                                                                     sis neural networks and reinforcement learning”.
                                                                     In: vol. 1260. 2014.


                                                          93
Rafał Brociek et al. CEUR Workshop Proceedings                       89–94


[20]   C. Napoli, G. Pappalardo, and E. Tramontana. “Us-
       ing modularity metrics to assist move method
       refactoring of large systems”. In: 2013, pp. 529–
       534. doi: 10.1109/CISIS.2013.96.
[21]   B.A. Nowak et al. “Multi-class nearest neighbour
       classifier for incomplete data handling”. In: vol. 9119.
       2015, pp. 469–480. doi: 10.1007/978-3-319-19324-
       3_42.
[22]   Shaoqing Ren et al. “Faster r-cnn: Towards real-
       time object detection with region proposal net-
       works”. In: Advances in neural information pro-
       cessing systems 28 (2015), pp. 91–99.
[23]   S. Russo et al. “Reducing the psychological bur-
       den of isolated oncological patients by means of
       decision trees”. In: vol. 2768. 2020, pp. 46–53.
[24]   Jasper RR Uijlings et al. “Selective search for object
       recognition”. In: International journal of computer
       vision 104.2 (2013), pp. 154–171.
[25]   Aisha Urooj and Ali Borji. “Analysis of hand seg-
       mentation in the wild”. In: Proceedings of the IEEE
       Conference on Computer Vision and Pattern Recog-
       nition. 2018, pp. 4710–4719.


                                                                94

</pre>