=Paper=
{{Paper
|id=Vol-3092/p14
|storemode=property
|title=Contagion Prevention of COVID-19 by means of Touch
Detection for Retail Stores
|pdfUrl=https://ceur-ws.org/Vol-3092/p14.pdf
|volume=Vol-3092
|authors=Rafał Brociek,Giorgio De Magistris,Francesca Cardia,Federica Coppa,Samuele Russo
|dblpUrl=https://dblp.org/rec/conf/system/BrociekMCCR21
}}
==Contagion Prevention of COVID-19 by means of Touch
Detection for Retail Stores==
Contagion Prevention of COVID-19 by means of Touch Detection for Retail Stores Rafał Brociek1 , Giorgio De Magistris2 , Francesca Cardia2 , Federica Coppa2 and Samuele Russo3 1 Sapienza University of Rome, piazzale Aldo Moro 5, Roma 00185, Italy 2 Department of Computer, Automation and Management Engineering, Sapienza University of Rome, via Ariosto 25 Roma 00185, Italy 2 Department of Psychology, Sapienza University of Rome, via dei Marsi 78 Roma 00185, Italy Abstract The recent Covid-19 pandemic has changed many aspects of people’s life. One of the principal preoccupations regards how easily the virus spreads through infected items. Of special concern are physical stores, where the same items can be touched by a lot of people throughout the day. In this paper a system to efficiently detect the human interaction with clothes in clothing stores is presented. The system recognizes the elements that have been touched, allowing a selective sanitization of potentially infected items. In this work two approaches are presented and compared: the pixel approach and the bounding box approach. The former has better detection performances while the latter is slightly more efficient. 1. Introduction tact with the customer and the same cannot take them all into consideration because at that moment he was The recent Covid-19 pandemic has affected hardly most not present or the his attention was focused elsewhere. commercial activities[18, 5, 23]. The recent restrictions The use of an automatic system capable to recognize and imposed by the governments to contrast the virus spred- remember potentially contaminated areas or objects can ing had a big impact on most retail stores, favouring considerably reduce the effort for the sanitizer and im- online shopping [14], where the infection risk through prove the accuracy and effectiveness of his action. At the infected items is obviously reduced. In this context, an same time, the implementation of such a solution would efficient sanitization of stores would decrease the expo- considerably reduce the feeling of discomfort that the cus- sure to infection [6, 2] making people more inclined to tomer can experience in the presence of a sanitizer who return to physical shopping. disinfects every object that the customer has touched in Some contexts, especially where several people are front of the customer. This aspect allows the customer present at the same time, often do not allow to keep avoid embarrassments while maintaining a relationship under control every part of the environment. In partic- of trust, reducing the risks of a mortification, for which ular it gets difficult to stay aware about all the physical the customer would feel limited by the possibility of ex- contacts with people, among themselves, with the en- pressing his own behavior while exploring the store. This vironment and with its content. During the COVID-19 principle can also be applicable to professional studies pandemic scenario it has become necessary to constantly or other facilities, where the construction of an alliance sanitize the environment and all its potentially contam- and a relationship of trust between the professional and inated parts. Therefore it has become clear that very the client is always a critical and delicate moment that help that can facilitate this task would be of great use in must be managed with the utmost sensibility. such contexts. Moreover the sanitizing actions carried The aim of this work consists in creating a system that out by an employee in the presence of the customer, in is able to help the sellers to sanitize items faster and more certain circumstances, can induce a feeling of annoyance efficiently, knowing which product should be sanitized or discomfort. However, postponing the intervention and which do not. In particular, we designed a system can be difficult because the cleaner would not remember for clothing stores, but the same solution can be adapted precisely which parts of the environment came into con- to many other retail stores. The general idea consists in the implementation of a system that is able to detect SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, the touch action. We decided to restrict the context to Engineering and Mathematics. July 27–29, 2021, Catania, IT clothing stores because the model is more efficient when $ rafal.brociek@polsl.pl (R. Brociek); demagistris@diag.uniroma1.it (G. De Magistris); trained on a specific set of objects. Moreover, clothing cardia.1759331@studenti.uniroma1.it (F. Cardia); stores represent one of the commercial activities with coppa.1749614@studenti.uniroma1.it (F. Coppa); a higher risk of Covid bacteria spreading, since people samuele.russo@uniroma1.it (S. Russo) touch and try on dresses continuously before buying 0000-0002-7255-6951 (R. Brociek); 0000-0002-3076-4509 (G. De them. Magistris); 0000-0002-1846-9996 (S. Russo) © 2021 Copyright for this paper by its authors. Use permitted under Creative In section 2 we formalize the problem of touch detection Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) and we relate it to the state of the art. Section 3 and 4 89 Rafał Brociek et al. CEUR Workshop Proceedings 89–94 respectively describe the models and the datasets that are Algorithm 2: Algorithm to check the overlap used in the proposed system. In section 5 we illustrate between two pixel masks the training strategy and some implementation details in Input: order to make the system easily reproducible. In section 𝑀 1𝑁 ×𝑁 = pixel mask of the first object 6 we report the performance metrics and compare the 𝑀 2𝑁 ×𝑁 = pixel mask of the second object different approaches. In section 7 conclusions are drawn. Output: Overlap / No Overlap 2. Problem Definition and State of Time Complexity: 𝒪(𝑁 2 ) Art Algorithm for i = 0 to N-1 do This paper presents a new method for detecting the for j = 0 to N-1 do "touch" event and in particular we narrowed the scope to if 𝑀 1𝑖,𝑗 and 𝑀 2𝑖,𝑗 then the action of touching clothes with one’s hands. The colli- return Overlap sion detection task in a 3D environment is a well studied problem [16, 19, 15] in literature and it finds application return No Overlap in many fields such as robotics and video gaming. How- ever, to the best of our knowledge, there is no equivalent formulation in the context of 2D images, where there is 3. Method no depth information. According to our formulation, the touch detection is based on the recognition of the objects Object detection and image segmentation are two founda- of interest (in this case clothes and hands). The result mental problems in computer vision. Before the incredi- of such recognition, depending on the method that has ble success of deep learning these tasks were performed been used, can be either a set of coordinates identifying a using solely standard computer vision algorithms. For bounding box (detection) or a pixel mask (segmentation). example the selective search [24, 20] leverages the hier- The bounding box or the pixel mask is then used to check archical structure of images and, from an initial segmen- if there is an overlap between the two objects. We will tation, it recursively merges similar patches in terms of refer to the former as Bounding Box approach and to the color, texture, size and shape[Capizzi201645, 4]. State latter as Pixel approach. In the first case a simple check of the art deep learning models for detection and segmen- on the coordinates is sufficient (see algorithm 1), while tation are based on the R-CNN architecture introduced in the other a parallel scan of the pixels of the two masks in [10]. This network receives as input a set of region is needed (see algorithm 2). proposals which are the candidates for the classifications, the architecture is independent of the algorithm used, then a pre-trained large CNN network is used to extract Algorithm 1: Algorithm to check the overlap features from the selected regions and then class spe- between two rectangular bounding boxes cific Linear Support Vector Machines (SVM) are used to Input: classify the regions[21]. The main problem of this ar- A,B = upper-left and bottom-right vertices of the chitecture was the long evaluation time, preventing the first rectangle model from online usage, hence Fast R-CNN [9] was in- A’,B’ = upper-left and bottom-right vertices of the troduced to speed-up evaluation time. This model learns second rectangle to classify object proposals and to refine their spatial Output: locations jointly. Each region proposal is mapped into Overlap / No Overlap a fixed-length feature vector using interleaved convolu- Time Complexity: tional and pooling layers followed by fully connected 𝒪(1) layers. Then the feature vector flows into the two output Algorithm branches which outputs are respectively: softmax prob- if A’.x > B.x or A.x > B’.x then abilities and per-class bounding-box regression offsets. return No Overlap The architecture is trained end-to-end with a multi-task if B.y > A’.y or B’.y > A.y then return No Overlap loss. return Overlap For the task of object detection, we used Faster R-CNN [22] that is an extension of Fast R-CNN that avoids the bottleneck of the region proposal module with the intro- duction of a Region Proposal Network (RPN). The RPN is a fully convolutional network sharing the convolutional features of the detection network that simultaneously 90 Rafał Brociek et al. CEUR Workshop Proceedings 89–94 predicts object bounds and objectness scores at each po- sition. It is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for de- tection. For the task of image segmentation, we used Mask R-CNN [13] that extends the Faster R-CNN architecture with a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Figure 2: Samples from the Hands Dataset This network adds only a small overhead with respect to Faster R-CNN and it runs at 5 fps. Moreover Mask R- CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation competition for validation. Some images were randomly selected [17]. trough Google Search, some were taken from the known Clothing Dataset [11] and others were added manually. In both datasets, images were annotated using the VIA 4. Datasets Annotation Software [7] that is an open source light weight software that runs in the web browser and allows The models from the R-CNN family are trained with to annotate images with bounding boxes or pixel masks. labelled and annotated images. We trained two separate models, respectively for hands and clothes recognition, hence the objects of interest are labelled with a single label. For the task of object detection the annotation consists of the four values that identify the bounding box that are the x and y pixel coordinates of the center, width and height in pixels. For the segmentation task the ground truth is another image with the same dimensions of the original image where the pixels that belongs to the object of interest are white (mask) and the background is black (see figure 1). The network only allows dimensions Figure 3: Samples from each of the four categories of the Clothing dataset, from left to right: t-shirt, trousers, skirt, long sleeve Figure 1: Pixel masks for a hand (left) and for trousers 5. Training like 256,320,384 or whatever is dividable by 2 at least 6 times. For this reason each image in both hands and We trained the two models separately, respectively for Clothing datasets have dimensions 384×448. The Hands hands and clothes detection and segmentation. We used Dataset (see figure 2) is obtained collecting 400 images for the Mask R-CNN implementation provided by [1] both for training and 100 for testing from three famous datasets: detection (bounding box) and segmentation. Remember EgoHands [3], HandOverFace [8] and EgoYouTubeHands that Mask R-CNN is an extension of Faster R-CNN that [25]. Moreover 40 images for training and 10 for testing adds a branch for predicting the mask, but the rest of the were added manually. We chose images from multiple architecture is unchanged, including the branch for the datasets to have representations of hands in different bounding box regression. This implementation requires contexts, in order to improve the generalization power only the annotation with the pixel mask, the bounding of the model. For the clothing recognition task we built box for the ground truth is computed on the fly picking a dataset of 500 images labelled with the following four the smallest box that encapsulates all the pixels of the labels (see figure 3): t-shirt, trousers, skirt, long sleeve. mask. Both models have been fine tuned (all layers) for We followed the common practice of partitioning the 50 epochs, with a learning rate of 0.0001, a weight decay dataset using the 80% for training and the remaining of 0.00001, ResNet-101 [12] as Backbone and some data 91 Rafał Brociek et al. CEUR Workshop Proceedings 89–94 augmentation techniques to improve the performances of Accuracy Precision Recall F1 score Bounding Box 0.56 0.53 0.89 0.66 Mask R-CNN. Figure 4 shows the learning curves, while Pixel Mask 0.88 0.89 0.88 0.88 in figure 5 the single components of the loss function are illustrated for the validation set. Considering that the two Table 1 models have similar plots, we will illustrate only those Evaluation metrics for the task of touch detection. In the first regarding the model trained on the Clothing dataset for row the metrics for the Bounding Box approach while in the conciseness sake. second row the ones for the Pixel approach and hands. In order to test the two approaches (bound- ing box vs pixel mask) we built manually a new dataset with 100 photographed images with hands and clothes and we labelled each image with the two labels overlap and no overlap. In order to check the overlap we used Figure 4: Learning curves, from left to right the training and algorithms 1 and 2 respectively for the bounding box and validation loss pixel approaches. The result is a set of images with their associated labels. Table 1 reports some metrics that are commonly used to evaluate the detection, while figure 6 shows the confusion matrices for the two approaches. (a) (b) Figure 6: The confusion matrices for the touch detection task. On the left the values refer to the Bounding Box approach, on the right they refer to the Pixel approach. (c) (d) From these metrics it emerges that the Pixel approach is much superior than the Bounding Box approach. In particular, the Bounding Box approach returns a lot of false positives, because often the bounding boxes overlap while the objects inside do not, as shown in figure 7. (e) Figure 5: Components of the loss function of the Mask R- CNN. The rpn_class_loss 5a and rpn_bbox_loss 5b indicates respectively how well the Region Proposal Network separates background from objects and localizes objects. While the mr- cnn_bbox_loss 5c, mrcnn_class_loss 5d and mrcnn_mask_loss Figure 7: Bounding box approach (left) compared with the 5e measure the performances of the Mask R-CNN in localizing, Pixel approach (right). This is an example misclassification by labelling and segmenting objects. the Bounding Box approach (the rectangles are contained one into the other) and correct classification by the Pixel based approach (the pixel masks have no pixel in common). 6. Results At this point we have the two models trained to detect with a bounding box and segment respectively clothes 92 Rafał Brociek et al. CEUR Workshop Proceedings 89–94 7. Conclusion [7] Abhishek Dutta and Andrew Zisserman. “The VIA annotation software for images, audio and video”. In this work a system to efficiently detect the human In: Proceedings of the 27th ACM international con- interaction with objects in clothing stores was presented. ference on multimedia. 2019, pp. 2276–2279. The proposed system can be easily adapted to a variety [8] Sakher Ghanem, Ashiq Imran, and Vassilis Athit- of other fields by changing the datasets used for the ob- sos. “Analysis of hand segmentation on challeng- ject detection and segmentation tasks. We presented two ing hand over face scenario”. In: Proceedings of approaches, the former based on object detection with the 12th ACM International Conference on PErva- bounding box and the latter based on segmentation, and sive Technologies Related to Assistive Environments. we showed that the second one performed much better 2019, pp. 236–242. with the cost of a small overhead. A further improve- ment to the proposed model would be the introduction [9] Ross Girshick. “Fast r-cnn”. In: Proceedings of the of depth information. This extension however, would IEEE international conference on computer vision. increase the performances at the expense of a higher cost 2015, pp. 1440–1448. for more specialized hardware and this factor could limit [10] Ross Girshick et al. “Rich feature hierarchies for its widespread use. That said, we think that our system accurate object detection and semantic segmenta- achieves good enough results to be implemented in phys- tion”. In: Proceedings of the IEEE conference on com- ical stores as a highly cost-effective tool for the Covid-19 puter vision and pattern recognition. 2014, pp. 580– pandemic containment. 587. [11] Alexey Grigorev. Clothing dataset. 2020. url: https: //www.kaggle.com/agrigorev/clothing-dataset- References full. [1] Waleed Abdulla. Mask R-CNN for object detection [12] Kaiming He et al. “Deep residual learning for im- and instance segmentation on Keras and TensorFlow. age recognition”. In: Proceedings of the IEEE con- https : / / github . com / matterport / Mask _ RCNN. ference on computer vision and pattern recognition. 2017. 2016, pp. 770–778. [2] R. Avanzato et al. “YOLOv3-based mask and face [13] Kaiming He et al. “Mask r-cnn”. In: Proceedings recognition algorithm for individual protection of the IEEE international conference on computer applications”. In: vol. 2768. 2020, pp. 41–45. vision. 2017, pp. 2961–2969. [3] Sven Bambach et al. “Lending A Hand: Detect- [14] Rae Yule Kim. “The impact of COVID-19 on con- ing Hands and Recognizing Activities in Com- sumers: Preparing for digital sales”. In: IEEE Engi- plex Egocentric Interactions”. In: The IEEE Inter- neering Management Review 48.3 (2020), pp. 212– national Conference on Computer Vision (ICCV). 218. Dec. 2015. [15] Sinan Kockara et al. “Collision detection: A sur- [4] F. Bonanno et al. “Optimal thicknesses determina- vey”. In: 2007 IEEE International Conference on Sys- tion in a multilayer structure to improve the SPP tems, Man and Cybernetics. IEEE. 2007, pp. 4046– efficiency for photovoltaic devices by an hybrid 4051. FEM - Cascade Neural Network based approach”. [16] Ming Lin and Stefan Gottschalk. “Collision de- In: 2014, pp. 355–362. doi: 10.1109/SPEEDAM. tection between geometric models: A survey”. In: 2014.6872103. Proc. of IMA conference on mathematics of surfaces. [5] P. Caponnetto et al. “The effects of physical ex- Vol. 1. Citeseer. 1998, pp. 602–608. ercise on mental health: From cognitive improve- [17] Tsung-Yi Lin et al. “Microsoft coco: Common ob- ments to risk of addiction”. In: International Jour- jects in context”. In: European conference on com- nal of Environmental Research and Public Health puter vision. Springer. 2014, pp. 740–755. 18.24 (2021). doi: 10.3390/ijerph182413384. [18] V. Marcotrigiano et al. “An integrated control plan [6] Federica Carraturo et al. “Persistence of SARS- in primary schools: Results of a field investigation CoV-2 in the environment and COVID-19 trans- on nutritional and hygienic features in the apulia mission risk from environmental matrices and region (southern italy)”. In: Nutrients 13.9 (2021). surfaces”. In: Environmental pollution 265 (2020), doi: 10.3390/nu13093006. p. 115010. [19] C. Napoli, G. Pappalardo, and E. Tramontana. “An agent-driven semantical identifier using radial ba- sis neural networks and reinforcement learning”. In: vol. 1260. 2014. 93 Rafał Brociek et al. CEUR Workshop Proceedings 89–94 [20] C. Napoli, G. Pappalardo, and E. Tramontana. “Us- ing modularity metrics to assist move method refactoring of large systems”. In: 2013, pp. 529– 534. doi: 10.1109/CISIS.2013.96. [21] B.A. Nowak et al. “Multi-class nearest neighbour classifier for incomplete data handling”. In: vol. 9119. 2015, pp. 469–480. doi: 10.1007/978-3-319-19324- 3_42. [22] Shaoqing Ren et al. “Faster r-cnn: Towards real- time object detection with region proposal net- works”. In: Advances in neural information pro- cessing systems 28 (2015), pp. 91–99. [23] S. Russo et al. “Reducing the psychological bur- den of isolated oncological patients by means of decision trees”. In: vol. 2768. 2020, pp. 46–53. [24] Jasper RR Uijlings et al. “Selective search for object recognition”. In: International journal of computer vision 104.2 (2013), pp. 154–171. [25] Aisha Urooj and Ali Borji. “Analysis of hand seg- mentation in the wild”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition. 2018, pp. 4710–4719. 94