<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CISIS.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CISIS.2013.96</article-id>
      <title-group>
        <article-title>Contagion Prevention of COVID-19 by means of Touch Detection for Retail Stores</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafał Brociek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio De Magistris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Cardia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federica Coppa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuele Russo</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Psychology, Sapienza University of Rome</institution>
          ,
          <addr-line>via dei Marsi 78 Roma 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sapienza University of Rome</institution>
          ,
          <addr-line>piazzale Aldo Moro 5, Roma 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>96</volume>
      <fpage>89</fpage>
      <lpage>94</lpage>
      <abstract>
        <p>The recent Covid-19 pandemic has changed many aspects of people's life. One of the principal preoccupations regards how easily the virus spreads through infected items. Of special concern are physical stores, where the same items can be touched by a lot of people throughout the day. In this paper a system to eficiently detect the human interaction with clothes in clothing stores is presented. The system recognizes the elements that have been touched, allowing a selective sanitization of potentially infected items. In this work two approaches are presented and compared: the pixel approach and the bounding box approach. The former has better detection performances while the latter is slightly more eficient.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tact with the customer and the same cannot take them
all into consideration because at that moment he was
The recent Covid-19 pandemic has afected hardly most not present or the his attention was focused elsewhere.
commercial activities[18, 5, 23]. The recent restrictions The use of an automatic system capable to recognize and
imposed by the governments to contrast the virus spred- remember potentially contaminated areas or objects can
ing had a big impact on most retail stores, favouring considerably reduce the efort for the sanitizer and
imonline shopping [14], where the infection risk through prove the accuracy and efectiveness of his action. At the
infected items is obviously reduced. In this context, an same time, the implementation of such a solution would
eficient sanitization of stores would decrease the expo- considerably reduce the feeling of discomfort that the
cussure to infection [6, 2] making people more inclined to tomer can experience in the presence of a sanitizer who
return to physical shopping. disinfects every object that the customer has touched in</p>
      <p>Some contexts, especially where several people are front of the customer. This aspect allows the customer
present at the same time, often do not allow to keep avoid embarrassments while maintaining a relationship
under control every part of the environment. In partic- of trust, reducing the risks of a mortification, for which
ular it gets dificult to stay aware about all the physical the customer would feel limited by the possibility of
excontacts with people, among themselves, with the en- pressing his own behavior while exploring the store. This
vironment and with its content. During the COVID-19 principle can also be applicable to professional studies
pandemic scenario it has become necessary to constantly or other facilities, where the construction of an alliance
sanitize the environment and all its potentially contam- and a relationship of trust between the professional and
inated parts. Therefore it has become clear that very the client is always a critical and delicate moment that
help that can facilitate this task would be of great use in must be managed with the utmost sensibility.
such contexts. Moreover the sanitizing actions carried The aim of this work consists in creating a system that
out by an employee in the presence of the customer, in is able to help the sellers to sanitize items faster and more
certain circumstances, can induce a feeling of annoyance eficiently, knowing which product should be sanitized
or discomfort. However, postponing the intervention and which do not. In particular, we designed a system
can be dificult because the cleaner would not remember for clothing stores, but the same solution can be adapted
precisely which parts of the environment came into con- to many other retail stores. The general idea consists
in the implementation of a system that is able to detect
SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, the touch action. We decided to restrict the context to
Engineering and Mathematics. July 27–29, 2021, Catania, IT clothing stores because the model is more eficient when
d$emraafgails.btrriosc@iedki@agp.uonlsirl.opml(aR1..iBtr(Goc.ieDke);Magistris); trained on a specific set of objects. Moreover, clothing
cardia.1759331@studenti.uniroma1.it (F. Cardia); stores represent one of the commercial activities with
coppa.1749614@studenti.uniroma1.it (F. Coppa); a higher risk of Covid bacteria spreading, since people
samuele.russo@uniroma1.it (S. Russo) touch and try on dresses continuously before buying
0000-0002-7255-6951 (R. Brociek); 0000-0002-3076-4509 (G. De them.</p>
      <p>Magistris)©; 2002010C0o-py0r0ig0ht2f-or1t8hi4s6pa-p9e9rb9y6its(aSu.thRorus. sUsseop)ermitted under Creative In section 2 we formalize the problem of touch detection
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) and we relate it to the state of the art. Section 3 and 4
respectively describe the models and the datasets that are
used in the proposed system. In section 5 we illustrate
the training strategy and some implementation details in
order to make the system easily reproducible. In section
6 we report the performance metrics and compare the
diferent approaches. In section 7 conclusions are drawn.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Definition and State of Art</title>
      <p>This paper presents a new method for detecting the
"touch" event and in particular we narrowed the scope to
the action of touching clothes with one’s hands. The
collision detection task in a 3D environment is a well studied
problem [16, 19, 15] in literature and it finds application
in many fields such as robotics and video gaming.
However, to the best of our knowledge, there is no equivalent
formulation in the context of 2D images, where there is
no depth information. According to our formulation, the
touch detection is based on the recognition of the objects
of interest (in this case clothes and hands). The result
of such recognition, depending on the method that has
been used, can be either a set of coordinates identifying a
bounding box (detection) or a pixel mask (segmentation).
The bounding box or the pixel mask is then used to check
if there is an overlap between the two objects. We will
refer to the former as Bounding Box approach and to the
latter as Pixel approach. In the first case a simple check
on the coordinates is suficient (see algorithm 1), while
in the other a parallel scan of the pixels of the two masks
is needed (see algorithm 2).</p>
      <sec id="sec-2-1">
        <title>Algorithm 1: Algorithm to check the overlap</title>
        <p>between two rectangular bounding boxes</p>
        <sec id="sec-2-1-1">
          <title>Input:</title>
          <p>A,B = upper-left and bottom-right vertices of the
ifrst rectangle
A’,B’ = upper-left and bottom-right vertices of the
second rectangle
Output:
Overlap / No Overlap
Time Complexity:
(1)
Algorithm
if A’.x &gt; B.x or A.x &gt; B’.x then</p>
          <p>return No Overlap
if B.y &gt; A’.y or B’.y &gt; A.y then</p>
          <p>return No Overlap
return Overlap
89–94</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Algorithm 2: Algorithm to check the overlap</title>
        <p>between two pixel masks</p>
        <sec id="sec-2-2-1">
          <title>Input:</title>
          <p>1×  = pixel mask of the first object
 2×  = pixel mask of the second object
Output:
Overlap / No Overlap
Time Complexity:
( 2)
Algorithm
for i = 0 to N-1 do
for j = 0 to N-1 do
if  1, and  2, then</p>
          <p>return Overlap
return No Overlap</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>Object detection and image segmentation are two
foundamental problems in computer vision. Before the
incredible success of deep learning these tasks were performed
using solely standard computer vision algorithms. For
example the selective search [24, 20] leverages the
hierarchical structure of images and, from an initial
segmentation, it recursively merges similar patches in terms of
color, texture, size and shape[Capizzi201645, 4]. State
of the art deep learning models for detection and
segmentation are based on the R-CNN architecture introduced
in [10]. This network receives as input a set of region
proposals which are the candidates for the classifications,
the architecture is independent of the algorithm used,
then a pre-trained large CNN network is used to extract
features from the selected regions and then class
specific Linear Support Vector Machines (SVM) are used to
classify the regions[21]. The main problem of this
architecture was the long evaluation time, preventing the
model from online usage, hence Fast R-CNN [9] was
introduced to speed-up evaluation time. This model learns
to classify object proposals and to refine their spatial
locations jointly. Each region proposal is mapped into
a fixed-length feature vector using interleaved
convolutional and pooling layers followed by fully connected
layers. Then the feature vector flows into the two output
branches which outputs are respectively: softmax
probabilities and per-class bounding-box regression ofsets.
The architecture is trained end-to-end with a multi-task
loss.</p>
      <p>For the task of object detection, we used Faster R-CNN
[22] that is an extension of Fast R-CNN that avoids the
bottleneck of the region proposal module with the
introduction of a Region Proposal Network (RPN). The RPN is
a fully convolutional network sharing the convolutional
features of the detection network that simultaneously
predicts object bounds and objectness scores at each
position. It is trained end-to-end to generate high-quality
region proposals, which are used by Fast R-CNN for
detection.</p>
      <p>For the task of image segmentation, we used Mask
R-CNN [13] that extends the Faster R-CNN architecture
with a branch for predicting an object mask in parallel
with the existing branch for bounding box recognition.
This network adds only a small overhead with respect
to Faster R-CNN and it runs at 5 fps. Moreover Mask
RCNN surpasses all previous state-of-the-art single-model
results on the COCO instance segmentation competition
[17].</p>
    </sec>
    <sec id="sec-4">
      <title>4. Datasets</title>
      <p>The models from the R-CNN family are trained with
labelled and annotated images. We trained two separate
models, respectively for hands and clothes recognition,
hence the objects of interest are labelled with a single
label. For the task of object detection the annotation
consists of the four values that identify the bounding
box that are the x and y pixel coordinates of the center,
width and height in pixels. For the segmentation task the
ground truth is another image with the same dimensions
of the original image where the pixels that belongs to the
object of interest are white (mask) and the background is
black (see figure 1). The network only allows dimensions
like 256,320,384 or whatever is dividable by 2 at least
6 times. For this reason each image in both hands and
Clothing datasets have dimensions 384× 448. The Hands
Dataset (see figure 2) is obtained collecting 400 images for
training and 100 for testing from three famous datasets:
EgoHands [3], HandOverFace [8] and EgoYouTubeHands
[25]. Moreover 40 images for training and 10 for testing
were added manually. We chose images from multiple
datasets to have representations of hands in diferent
contexts, in order to improve the generalization power
of the model. For the clothing recognition task we built
a dataset of 500 images labelled with the following four
labels (see figure 3): t-shirt, trousers, skirt, long sleeve.
We followed the common practice of partitioning the
dataset using the 80% for training and the remaining
for validation. Some images were randomly selected
trough Google Search, some were taken from the known
Clothing Dataset [11] and others were added manually.
In both datasets, images were annotated using the VIA
Annotation Software [7] that is an open source light
weight software that runs in the web browser and allows
to annotate images with bounding boxes or pixel masks.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Training</title>
      <p>We trained the two models separately, respectively for
hands and clothes detection and segmentation. We used
the Mask R-CNN implementation provided by [1] both for
detection (bounding box) and segmentation. Remember
that Mask R-CNN is an extension of Faster R-CNN that
adds a branch for predicting the mask, but the rest of the
architecture is unchanged, including the branch for the
bounding box regression. This implementation requires
only the annotation with the pixel mask, the bounding
box for the ground truth is computed on the fly picking
the smallest box that encapsulates all the pixels of the
mask. Both models have been fine tuned (all layers) for
50 epochs, with a learning rate of 0.0001, a weight decay
of 0.00001, ResNet-101 [12] as Backbone and some data
augmentation techniques to improve the performances of
Mask R-CNN. Figure 4 shows the learning curves, while
in figure 5 the single components of the loss function are
illustrated for the validation set. Considering that the two
models have similar plots, we will illustrate only those
regarding the model trained on the Clothing dataset for
conciseness sake.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>At this point we have the two models trained to detect
with a bounding box and segment respectively clothes
89–94
F1 score
0.66
0.88
Bounding Box</p>
      <p>Pixel Mask
and hands. In order to test the two approaches
(bounding box vs pixel mask) we built manually a new dataset
with 100 photographed images with hands and clothes
and we labelled each image with the two labels overlap
and no overlap. In order to check the overlap we used
algorithms 1 and 2 respectively for the bounding box and
pixel approaches. The result is a set of images with their
associated labels. Table 1 reports some metrics that are
commonly used to evaluate the detection, while figure 6
shows the confusion matrices for the two approaches.</p>
      <p>From these metrics it emerges that the Pixel approach
is much superior than the Bounding Box approach. In
particular, the Bounding Box approach returns a lot of
false positives, because often the bounding boxes overlap
while the objects inside do not, as shown in figure 7.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this work a system to eficiently detect the human
interaction with objects in clothing stores was presented.
The proposed system can be easily adapted to a variety [8]
of other fields by changing the datasets used for the
object detection and segmentation tasks. We presented two
approaches, the former based on object detection with
bounding box and the latter based on segmentation, and
we showed that the second one performed much better
with the cost of a small overhead. A further
improvement to the proposed model would be the introduction [9]
of depth information. This extension however, would
increase the performances at the expense of a higher cost
for more specialized hardware and this factor could limit [10]
its widespread use. That said, we think that our system
achieves good enough results to be implemented in
physical stores as a highly cost-efective tool for the Covid-19
pandemic containment.
[7]
[11]
[12]
89–94</p>
      <sec id="sec-7-1">
        <title>Abhishek Dutta and Andrew Zisserman. “The VIA annotation software for images, audio and video”.</title>
        <p>In: Proceedings of the 27th ACM international
conference on multimedia. 2019, pp. 2276–2279.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Sakher Ghanem, Ashiq Imran, and Vassilis Athit</title>
        <p>sos. “Analysis of hand segmentation on
challenging hand over face scenario”. In: Proceedings of
the 12th ACM International Conference on
PErvasive Technologies Related to Assistive Environments.
2019, pp. 236–242.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Ross Girshick. “Fast r-cnn”. In: Proceedings of the</title>
        <p>IEEE international conference on computer vision.
2015, pp. 1440–1448.</p>
      </sec>
      <sec id="sec-7-4">
        <title>Ross Girshick et al. “Rich feature hierarchies for</title>
        <p>accurate object detection and semantic
segmentation”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2014, pp. 580–
587.</p>
      </sec>
      <sec id="sec-7-5">
        <title>Alexey Grigorev. Clothing dataset. 2020. url: https:</title>
        <p>//www.kaggle.com/agrigorev/clothing-datasetfull.</p>
      </sec>
      <sec id="sec-7-6">
        <title>Kaiming He et al. “Deep residual learning for im</title>
        <p>age recognition”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition.
2016, pp. 770–778.</p>
      </sec>
      <sec id="sec-7-7">
        <title>Kaiming He et al. “Mask r-cnn”. In: Proceedings</title>
        <p>of the IEEE international conference on computer
vision. 2017, pp. 2961–2969.</p>
      </sec>
      <sec id="sec-7-8">
        <title>Rae Yule Kim. “The impact of COVID-19 on con</title>
        <p>sumers: Preparing for digital sales”. In: IEEE
Engineering Management Review 48.3 (2020), pp. 212–
218.</p>
      </sec>
      <sec id="sec-7-9">
        <title>Sinan Kockara et al. “Collision detection: A sur</title>
        <p>vey”. In: 2007 IEEE International Conference on
Systems, Man and Cybernetics. IEEE. 2007, pp. 4046–
4051.</p>
      </sec>
      <sec id="sec-7-10">
        <title>Ming Lin and Stefan Gottschalk. “Collision detection between geometric models: A survey”. In:</title>
        <p>Proc. of IMA conference on mathematics of surfaces.
Vol. 1. Citeseer. 1998, pp. 602–608.</p>
      </sec>
      <sec id="sec-7-11">
        <title>Tsung-Yi Lin et al. “Microsoft coco: Common ob</title>
        <p>jects in context”. In: European conference on
computer vision. Springer. 2014, pp. 740–755.</p>
      </sec>
      <sec id="sec-7-12">
        <title>V. Marcotrigiano et al. “An integrated control plan</title>
        <p>in primary schools: Results of a field investigation
on nutritional and hygienic features in the apulia
region (southern italy)”. In: Nutrients 13.9 (2021).
doi: 10.3390/nu13093006.</p>
      </sec>
      <sec id="sec-7-13">
        <title>C. Napoli, G. Pappalardo, and E. Tramontana. “An agent-driven semantical identifier using radial basis neural networks and reinforcement learning”. In: vol. 1260. 2014.</title>
        <p>[1]</p>
      </sec>
      <sec id="sec-7-14">
        <title>B.A. Nowak et al. “Multi-class nearest neighbour classifier for incomplete data handling”. In: vol. 9119. 2015, pp. 469–480. doi: 10.1007/978-3-319-193243_42.</title>
      </sec>
      <sec id="sec-7-15">
        <title>Shaoqing Ren et al. “Faster r-cnn: Towards real</title>
        <p>time object detection with region proposal
networks”. In: Advances in neural information
processing systems 28 (2015), pp. 91–99.</p>
      </sec>
      <sec id="sec-7-16">
        <title>S. Russo et al. “Reducing the psychological bur</title>
        <p>den of isolated oncological patients by means of
decision trees”. In: vol. 2768. 2020, pp. 46–53.</p>
      </sec>
      <sec id="sec-7-17">
        <title>Jasper RR Uijlings et al. “Selective search for object</title>
        <p>recognition”. In: International journal of computer
vision 104.2 (2013), pp. 154–171.</p>
      </sec>
      <sec id="sec-7-18">
        <title>Aisha Urooj and Ali Borji. “Analysis of hand seg</title>
        <p>mentation in the wild”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition. 2018, pp. 4710–4719.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>