1. Introduction

CISIS.

10.1109/CISIS.2013.96

Contagion Prevention of COVID-19 by means of Touch Detection for Retail Stores

Rafał Brociek

Giorgio De Magistris

Francesca Cardia

Federica Coppa

Samuele Russo

0 Department of Psychology, Sapienza University of Rome , via dei Marsi 78 Roma 00185 , Italy 1 Sapienza University of Rome , piazzale Aldo Moro 5, Roma 00185 , Italy

2013

96 89 94

The recent Covid-19 pandemic has changed many aspects of people's life. One of the principal preoccupations regards how easily the virus spreads through infected items. Of special concern are physical stores, where the same items can be touched by a lot of people throughout the day. In this paper a system to eficiently detect the human interaction with clothes in clothing stores is presented. The system recognizes the elements that have been touched, allowing a selective sanitization of potentially infected items. In this work two approaches are presented and compared: the pixel approach and the bounding box approach. The former has better detection performances while the latter is slightly more eficient.

1. Introduction

tact with the customer and the same cannot take them all into consideration because at that moment he was The recent Covid-19 pandemic has afected hardly most not present or the his attention was focused elsewhere. commercial activities[18, 5, 23]. The recent restrictions The use of an automatic system capable to recognize and imposed by the governments to contrast the virus spred- remember potentially contaminated areas or objects can ing had a big impact on most retail stores, favouring considerably reduce the efort for the sanitizer and imonline shopping [14], where the infection risk through prove the accuracy and efectiveness of his action. At the infected items is obviously reduced. In this context, an same time, the implementation of such a solution would eficient sanitization of stores would decrease the expo- considerably reduce the feeling of discomfort that the cussure to infection [6, 2] making people more inclined to tomer can experience in the presence of a sanitizer who return to physical shopping. disinfects every object that the customer has touched in

Some contexts, especially where several people are front of the customer. This aspect allows the customer present at the same time, often do not allow to keep avoid embarrassments while maintaining a relationship under control every part of the environment. In partic- of trust, reducing the risks of a mortification, for which ular it gets dificult to stay aware about all the physical the customer would feel limited by the possibility of excontacts with people, among themselves, with the en- pressing his own behavior while exploring the store. This vironment and with its content. During the COVID-19 principle can also be applicable to professional studies pandemic scenario it has become necessary to constantly or other facilities, where the construction of an alliance sanitize the environment and all its potentially contam- and a relationship of trust between the professional and inated parts. Therefore it has become clear that very the client is always a critical and delicate moment that help that can facilitate this task would be of great use in must be managed with the utmost sensibility. such contexts. Moreover the sanitizing actions carried The aim of this work consists in creating a system that out by an employee in the presence of the customer, in is able to help the sellers to sanitize items faster and more certain circumstances, can induce a feeling of annoyance eficiently, knowing which product should be sanitized or discomfort. However, postponing the intervention and which do not. In particular, we designed a system can be dificult because the cleaner would not remember for clothing stores, but the same solution can be adapted precisely which parts of the environment came into con- to many other retail stores. The general idea consists in the implementation of a system that is able to detect SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, the touch action. We decided to restrict the context to Engineering and Mathematics. July 27–29, 2021, Catania, IT clothing stores because the model is more eficient when d$emraafgails.btrriosc@iedki@agp.uonlsirl.opml(aR1..iBtr(Goc.ieDke);Magistris); trained on a specific set of objects. Moreover, clothing cardia.1759331@studenti.uniroma1.it (F. Cardia); stores represent one of the commercial activities with coppa.1749614@studenti.uniroma1.it (F. Coppa); a higher risk of Covid bacteria spreading, since people samuele.russo@uniroma1.it (S. Russo) touch and try on dresses continuously before buying 0000-0002-7255-6951 (R. Brociek); 0000-0002-3076-4509 (G. De them.

Magistris)©; 2002010C0o-py0r0ig0ht2f-or1t8hi4s6pa-p9e9rb9y6its(aSu.thRorus. sUsseop)ermitted under Creative In section 2 we formalize the problem of touch detection CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) and we relate it to the state of the art. Section 3 and 4 respectively describe the models and the datasets that are used in the proposed system. In section 5 we illustrate the training strategy and some implementation details in order to make the system easily reproducible. In section 6 we report the performance metrics and compare the diferent approaches. In section 7 conclusions are drawn.

2. Problem Definition and State of Art

This paper presents a new method for detecting the "touch" event and in particular we narrowed the scope to the action of touching clothes with one’s hands. The collision detection task in a 3D environment is a well studied problem [16, 19, 15] in literature and it finds application in many fields such as robotics and video gaming. However, to the best of our knowledge, there is no equivalent formulation in the context of 2D images, where there is no depth information. According to our formulation, the touch detection is based on the recognition of the objects of interest (in this case clothes and hands). The result of such recognition, depending on the method that has been used, can be either a set of coordinates identifying a bounding box (detection) or a pixel mask (segmentation). The bounding box or the pixel mask is then used to check if there is an overlap between the two objects. We will refer to the former as Bounding Box approach and to the latter as Pixel approach. In the first case a simple check on the coordinates is suficient (see algorithm 1), while in the other a parallel scan of the pixels of the two masks is needed (see algorithm 2).

Algorithm 1: Algorithm to check the overlap

between two rectangular bounding boxes

Input:

A,B = upper-left and bottom-right vertices of the ifrst rectangle A’,B’ = upper-left and bottom-right vertices of the second rectangle Output: Overlap / No Overlap Time Complexity: (1) Algorithm if A’.x > B.x or A.x > B’.x then

return No Overlap if B.y > A’.y or B’.y > A.y then

return No Overlap return Overlap 89–94

Algorithm 2: Algorithm to check the overlap

between two pixel masks

Input:

1× = pixel mask of the first object 2× = pixel mask of the second object Output: Overlap / No Overlap Time Complexity: ( 2) Algorithm for i = 0 to N-1 do for j = 0 to N-1 do if 1, and 2, then

return Overlap return No Overlap

3. Method

Object detection and image segmentation are two foundamental problems in computer vision. Before the incredible success of deep learning these tasks were performed using solely standard computer vision algorithms. For example the selective search [24, 20] leverages the hierarchical structure of images and, from an initial segmentation, it recursively merges similar patches in terms of color, texture, size and shape[Capizzi201645, 4]. State of the art deep learning models for detection and segmentation are based on the R-CNN architecture introduced in [10]. This network receives as input a set of region proposals which are the candidates for the classifications, the architecture is independent of the algorithm used, then a pre-trained large CNN network is used to extract features from the selected regions and then class specific Linear Support Vector Machines (SVM) are used to classify the regions[21]. The main problem of this architecture was the long evaluation time, preventing the model from online usage, hence Fast R-CNN [9] was introduced to speed-up evaluation time. This model learns to classify object proposals and to refine their spatial locations jointly. Each region proposal is mapped into a fixed-length feature vector using interleaved convolutional and pooling layers followed by fully connected layers. Then the feature vector flows into the two output branches which outputs are respectively: softmax probabilities and per-class bounding-box regression ofsets. The architecture is trained end-to-end with a multi-task loss.

For the task of object detection, we used Faster R-CNN [22] that is an extension of Fast R-CNN that avoids the bottleneck of the region proposal module with the introduction of a Region Proposal Network (RPN). The RPN is a fully convolutional network sharing the convolutional features of the detection network that simultaneously predicts object bounds and objectness scores at each position. It is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.

For the task of image segmentation, we used Mask R-CNN [13] that extends the Faster R-CNN architecture with a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. This network adds only a small overhead with respect to Faster R-CNN and it runs at 5 fps. Moreover Mask RCNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation competition [17].

4. Datasets

The models from the R-CNN family are trained with labelled and annotated images. We trained two separate models, respectively for hands and clothes recognition, hence the objects of interest are labelled with a single label. For the task of object detection the annotation consists of the four values that identify the bounding box that are the x and y pixel coordinates of the center, width and height in pixels. For the segmentation task the ground truth is another image with the same dimensions of the original image where the pixels that belongs to the object of interest are white (mask) and the background is black (see figure 1). The network only allows dimensions like 256,320,384 or whatever is dividable by 2 at least 6 times. For this reason each image in both hands and Clothing datasets have dimensions 384× 448. The Hands Dataset (see figure 2) is obtained collecting 400 images for training and 100 for testing from three famous datasets: EgoHands [3], HandOverFace [8] and EgoYouTubeHands [25]. Moreover 40 images for training and 10 for testing were added manually. We chose images from multiple datasets to have representations of hands in diferent contexts, in order to improve the generalization power of the model. For the clothing recognition task we built a dataset of 500 images labelled with the following four labels (see figure 3): t-shirt, trousers, skirt, long sleeve. We followed the common practice of partitioning the dataset using the 80% for training and the remaining for validation. Some images were randomly selected trough Google Search, some were taken from the known Clothing Dataset [11] and others were added manually. In both datasets, images were annotated using the VIA Annotation Software [7] that is an open source light weight software that runs in the web browser and allows to annotate images with bounding boxes or pixel masks.

5. Training

We trained the two models separately, respectively for hands and clothes detection and segmentation. We used the Mask R-CNN implementation provided by [1] both for detection (bounding box) and segmentation. Remember that Mask R-CNN is an extension of Faster R-CNN that adds a branch for predicting the mask, but the rest of the architecture is unchanged, including the branch for the bounding box regression. This implementation requires only the annotation with the pixel mask, the bounding box for the ground truth is computed on the fly picking the smallest box that encapsulates all the pixels of the mask. Both models have been fine tuned (all layers) for 50 epochs, with a learning rate of 0.0001, a weight decay of 0.00001, ResNet-101 [12] as Backbone and some data augmentation techniques to improve the performances of Mask R-CNN. Figure 4 shows the learning curves, while in figure 5 the single components of the loss function are illustrated for the validation set. Considering that the two models have similar plots, we will illustrate only those regarding the model trained on the Clothing dataset for conciseness sake.

6. Results

At this point we have the two models trained to detect with a bounding box and segment respectively clothes 89–94 F1 score 0.66 0.88 Bounding Box

Pixel Mask and hands. In order to test the two approaches (bounding box vs pixel mask) we built manually a new dataset with 100 photographed images with hands and clothes and we labelled each image with the two labels overlap and no overlap. In order to check the overlap we used algorithms 1 and 2 respectively for the bounding box and pixel approaches. The result is a set of images with their associated labels. Table 1 reports some metrics that are commonly used to evaluate the detection, while figure 6 shows the confusion matrices for the two approaches.

From these metrics it emerges that the Pixel approach is much superior than the Bounding Box approach. In particular, the Bounding Box approach returns a lot of false positives, because often the bounding boxes overlap while the objects inside do not, as shown in figure 7.

7. Conclusion

In this work a system to eficiently detect the human interaction with objects in clothing stores was presented. The proposed system can be easily adapted to a variety [8] of other fields by changing the datasets used for the object detection and segmentation tasks. We presented two approaches, the former based on object detection with bounding box and the latter based on segmentation, and we showed that the second one performed much better with the cost of a small overhead. A further improvement to the proposed model would be the introduction [9] of depth information. This extension however, would increase the performances at the expense of a higher cost for more specialized hardware and this factor could limit [10] its widespread use. That said, we think that our system achieves good enough results to be implemented in physical stores as a highly cost-efective tool for the Covid-19 pandemic containment. [7] [11] [12] 89–94

Abhishek Dutta and Andrew Zisserman. “The VIA annotation software for images, audio and video”.

In: Proceedings of the 27th ACM international conference on multimedia. 2019, pp. 2276–2279.

Sakher Ghanem, Ashiq Imran, and Vassilis Athit

sos. “Analysis of hand segmentation on challenging hand over face scenario”. In: Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments. 2019, pp. 236–242.

Ross Girshick. “Fast r-cnn”. In: Proceedings of the

IEEE international conference on computer vision. 2015, pp. 1440–1448.

Ross Girshick et al. “Rich feature hierarchies for

accurate object detection and semantic segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, pp. 580– 587.

Alexey Grigorev. Clothing dataset. 2020. url: https:

//www.kaggle.com/agrigorev/clothing-datasetfull.

Kaiming He et al. “Deep residual learning for im

age recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.

Kaiming He et al. “Mask r-cnn”. In: Proceedings

of the IEEE international conference on computer vision. 2017, pp. 2961–2969.

Rae Yule Kim. “The impact of COVID-19 on con

sumers: Preparing for digital sales”. In: IEEE Engineering Management Review 48.3 (2020), pp. 212– 218.

Sinan Kockara et al. “Collision detection: A sur

vey”. In: 2007 IEEE International Conference on Systems, Man and Cybernetics. IEEE. 2007, pp. 4046– 4051.

Ming Lin and Stefan Gottschalk. “Collision detection between geometric models: A survey”. In:

Proc. of IMA conference on mathematics of surfaces. Vol. 1. Citeseer. 1998, pp. 602–608.

Tsung-Yi Lin et al. “Microsoft coco: Common ob

jects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755.

V. Marcotrigiano et al. “An integrated control plan

in primary schools: Results of a field investigation on nutritional and hygienic features in the apulia region (southern italy)”. In: Nutrients 13.9 (2021). doi: 10.3390/nu13093006.

C. Napoli, G. Pappalardo, and E. Tramontana. “An agent-driven semantical identifier using radial basis neural networks and reinforcement learning”. In: vol. 1260. 2014.

[1]

B.A. Nowak et al. “Multi-class nearest neighbour classifier for incomplete data handling”. In: vol. 9119. 2015, pp. 469–480. doi: 10.1007/978-3-319-193243_42. Shaoqing Ren et al. “Faster r-cnn: Towards real

time object detection with region proposal networks”. In: Advances in neural information processing systems 28 (2015), pp. 91–99.

S. Russo et al. “Reducing the psychological bur

den of isolated oncological patients by means of decision trees”. In: vol. 2768. 2020, pp. 46–53.

Jasper RR Uijlings et al. “Selective search for object

recognition”. In: International journal of computer vision 104.2 (2013), pp. 154–171.

Aisha Urooj and Ali Borji. “Analysis of hand seg

mentation in the wild”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 4710–4719.