1. Introduction

Psychoeducative Social Robots for an Healthier Lifestyle using Artificial Intelligence: a Case-Study

Valerio Ponzi

ponzi@diag.uniroma1.it 0

Samuele Russo

samuele.russo@uniroma1.it 3

Valerio Bianco

Christian Napoli

christian.napoli@uniroma1.it 2

Agata Wajda

agata.wajda@polsl.pl 1 0 Department of Computer, Automation and Management Engineering, Sapienza University of Rome , via Ariosto 25 Roma 00185 , Italy 1 Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of Technology , 44-100 Gliwice , Poland 2 Department of Medical Surgical Sciences and Translational Medicine, Sapienza University of Rome , Via di Grottarossa 1035, Roma 00189 , Italy 3 Department of Psychology, Sapienza University of Rome , via dei Marsi 78 Roma 00185 , Italy

26 33

Smoking is the greatest preventable cause of mortality worldwide. In this paper, we present a social experiment where mobile robot equipped with a cigarette detector alert smokers, in particular those smoking close to children. In our research, we compare diferent methods. In the first case we trained the cigarette detection model using a homemade dataset based on the pre-trained SSD MobileNet detection model. In the second case we analyzed how smokingNet performs applied to our task. Next to distinguish between children and adults, we take advantage of the Cascade classifier and a neural network. Both networks are built to leverage TensorFlow Lite, a mobile-friendly format that enables inference on-device. When a smoking scene is identified, the mobile robot draws near the smoker and issues a warning based on the circumstances.

eol>Smoking Object Detection Cascade Classifier Raspberry Pi TensorFlow Lite

1. Introduction Smoking is the leading preventable cause of death

and disability. According to WHO (World Health Organization) [1, 2] data, tobacco directly causes over 7 million deaths, while around 1.2 million are the result of non-smokers being exposed to secondhand smoke. Secondhand smoke is dangerous, especially for children, and can increase their risk of multiple health issues. For many years it has been forbidden to smoke in the presence of children in public places except those dedicated to smokers. In recent times, a further restriction has been imposed by the Council of Ministers which has issued several legislative decrees in which the anti-smoking regulations are made even more restrictive. In addition to the medical aspects, linked to the now evident consequences associated to passive smoking, there are other damages connected to the development of fascination towards cigarettes and consequent developement of possible addiction [ 3 ]. In fact, starting from Albert Bandura’s studies on social learning theory [4], it is highlighted as learning of pro-social or anti-social behaviors can also occur without direct contact with objects, or learning can also occur through indirect experiences, through the observation of other people[5, 6]. Bandura used the term modeling (imitation) to identify a learning process that is activated when the behavior of an individual who observes changes according to the behavior of another individual who acts as a model. So the behavior is the result of a process of acquiring information from other individuals. Furthermore, Bandura synthesizes a series of properties acting in a modeling situation, which influence the impact of information learned about performance: the identification that is established between model and modeled is identified as a fundamental characteristic of observational learning (or vicarious learning). . The higher it is, the more learning will have an efect on the behavior of the model. So for example, according to this theory, a child who daily observes a reference adult who smokes, will learn this behavior more easily, since he is exposed to behavior patterns that "normalize" the use of cigarettes on a daily basis. This learning theory is also called social learning, because it focuses on the identification mechanism that links observer to observe. This identification process is also linked to aefctive aspects, and it is often found in identifying behaviors that people adopt in certain roles or social characters. It therefore becomes essential for the child to reduce exposure to potentially dangerous behaviors as much as possible, such as that related to cigarette smoking.

The aim of this research is to develop a mobile robot capable of detecting smokers and raise a warning if those are in the proximity of youngsters. To achieve this, we splat the objective in two main

tasks: cigarettes detection through a camera stream and light and near-infrared camera, and the developed system classification between adults and teenagers. Cigarettes judges whether or not there is a driver smoking behavior detection is analyzed in two diferent methods. In the first in the day and night conditions. Since their dataset is we started from a pre-trained SSD MobileNet model[7] specific for foreground cigarette detection, their results with a feature pyramid network as a subnetwork. The are not directly transferable to our study-case. latter is especially well-suited for mobile oriented applications since it gets rid of main memory access constraints in a large number of embedded hardware. 3. Datasets Instead, in the second case we used SmokingNet[8] that detects smoking photos by utilizing the feature 3.1. Cigarette Dataset extraction capabilities of convolutional neural networks. The dataset was realized with the objective of identifying Furthermore, the program is hosted on a Raspberry cigarettes in a variety of situations. The background and Pi [9, 10]. Once a cigarette is detected, the detection the quality of the pictures are varied. In the images, the process determines the presence of people in the frame cigarette can be extremely large or extremely small in using Haar-cascade frontal face classifier[ 11, 12, 13], comparison to the entire image. The images are uploaded which requires significantly less hardware computation. to Roboflow [ 20], a tool for annotation, allowing users to Moreover, the image’s detected face will be extracted and upload files, including images, annotations, and videos. It fed as an input to a deep neural network. The network supports a wide variety of annotation formats and makes acts as a binary classifier trained on the UTKFace dataset it simple to add new training data as it is collected. The to distinguish children from adults. format of the data set is set to TFRecords. During the To make models portable, both detection and classifi- annotation process, some images were discarded due to cation models are written using TFlite[14] library and their low quality, such as cigarettes that are obscured by inference is performed on the Raspberry Pi 4. The mobile other objects or are barely visible to human eyes. The robot will approach the adult and issue a warning if remaining dataset consists of 2017 images that have been there are children nearby. divided into a training set and a test set according to radio 9:1. The training set contains 1816 images, while 2. Related Works the test set contains 201 images. A representative sample of the dataset can be found in Figure 1.

There has been some research on smoking detection through diferent methods. In [ 15] the authors suggest a smoking gesture based detection method. It captures changes in the orientation of a person’s arm, and uses a machine learning pipeline that processes this data to accurately detect smoking gestures and sessions in realtime.

In [16] the authors proposed a machine learning method for pufing and smoking detection using data from a wrist accelerometer. More recent approaches suggest the use of latest generation techniques based on the object detection of the cigarette itself.

In [17] the author presents object detection of cigarette litter on side-walks. The system is designed to work in real-time by exploiting a lighter version of YOLOV4 [18] (Tiny-YOLOV4) in order to let the model be deployed on a mobile robot. The dataset used to train this network was specific to the littering problem, that is, detecting cigarettes butts near sidewalks. Therefore being our objective the identification of situations involving people smoking, we cannot expect to achieve high performance by applying transfer learning and fine tuning on their pretrained network.

In [19] the authors use a YOLOv2 deep-learning image based methodology for driver’s cigarette object detection.

The driver’s images are captured by a dual-mode visible

3.2. UTKFace Dateset

The UTKFace dataset[21] is a large-scale face collection with a wide age range (range from 0 to 116 years old). The dataset contains over 20,000 face images with age, gender, and ethnicity annotations. The images demonstrate a wide range of poses, facial expressions, illumination, occlusion, and resolution. This dataset has the potential to be used for a variety of tasks, including face detection, age estimation, age progression/regression, and landmark localization. We only use age information to classify those under the age of 18 as children and those over the age of 18 as adults.

4. Cigarette detection 4.1. SSD MobileNet

To achieve a trade-of between speed and accuracy, in the first case we used the SSD MobileNet V2 object detection model with the FPN-lite feature extractor, shared box predictor, focal loss and a 640x640 training image Figure 4: mAP of all objects on the test set. size. SSD[22] is a multi-category single-shot detector that is substantially faster and more accurate than the initial version of YOLO. In the literature, there is a variety of more accurate but slower techniques such as Faster R-CNN. However, since our main focus is to deploy a real time application, we give more importance to the speed of the model. The core of SSD is predicting category scores and box ofsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps. MobileNet V2[7] significantly improves the performance of mobile models on a variety of tasks and benchmarks. It is based on an inverted residual structure in which shortcut connections are made between Figure 5: mAP of large objects on the test set. thin bottleneck layers. FPN[23] exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal The model performs poorly on small objects in comextra cost. It is a top-down architecture with lateral con- parison to large objects. This is partly due to the SSD nections is developed for building high-level semantic MobileNet’s features and also to the compressed input feature maps at all scales. size. Large objects have an accuracy of up to 0.65, while medium and small objects have an accuracy of only 0.57 4.2. Training and 0.29, respectively. This results in a decrease in the overall accuracy of all images.

The entirety of the training process was conducted on

Google Colab[24]. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. With Colab Pro, we are able to train our model with a K80 GPU for up to 24 hours. The training procedure is carried out by means of 20,000 epoches.

4.3. SmokingNet A diferent method for cigarette detection is using SmokingNet. SmokingNet was announced in 2018, it detects smoking photos by utilizing the feature extraction capabilities of CNN. The convolution kernels of the CNN

situations. Instead, in the presence of a large number of people the amount of gestures to be analyzed becomes too heavy from a computational point of view and too many elbow movements are misleading.

5. Age classification 5.1. Cascade Classifier

Object detection with Haar feature-based cascade classiifers is a powerful object detection technique proposed in 2001 by Paul Viola and Michael Jones[11]. It is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focusing attention on promising regions of the image. It is a machine learning-based technique that involves training a cascade function on a large number of positive and negative images. It is then applied to other images in order to detect objects. OpenCV includes pretrained cascade model for the frontal face, eye, body, and even the smile. For our research, we used the default Haar cascade frontal face model.

5.2. Age classifier 4.4. Comparison between SSD MobileNet model and SmokingNet

Image classification is a classical problem in computer convolutional layers have been used to extract local fea- vision which is the task of assigning a label to an input imtures of a given image, and the features extracted by the age, from a fixed set of categories. This is a fundamental ifrst convolutional layer directly afect the feature fusion problem in Computer Vision, and despite its simplicity, of the deep network. Based on the shape characteris- it has a wide range of practical applications. With the tics of cigarettes, convolution kernels of four sizes are UTKFace dateset, we trained a binary classifier to distinincluded in the first convolutional layer of SmokingNet. guish children and adults, which allows the mobile robot This method can detect smoking images by utilizing only to give diferent warnings. After training a CNN model the information of human smoking gestures and cigarette with only 10 epochs, the model can achieve over 95 perimage characteristics without requiring the real detection cent accuracy. The results can be observed in Figure 8. of cigarette. This model achieves an accuracy and recall In subfigure 8.a we can see the loss tendency through the of 0.9. epoches. Subfigure 8.b shows the accuracy over the training and validation set. We can appreciate the absence of overfitting at the end of the training process.

6. Mobile Robot

Table 1 The mobile robot chosen for this research is the Performance of SSD MobileNet model and SmokingNet Sapienza’s robot MARRtino. MARRtino is a ROS-based low-cost diferential drive robot platform that comes in

Model Precision Recall many shapes. MARRtino has been designed to be easy-toSSD MobileNet 0.43 0.46 build and easy-to-program, but at the same time it uses SmokingNet 0.90 0.90 professional software based on ROS. It is thus suitable to implement and experiment many typical Robotics and

As we can see in table 1 the results obtained by Smok- Artificial Intelligence tasks, such as smart navigation, ingNet are much better than the other. However, this spoken human-robot interaction, image analysis, etc. It is also due to the type of research that is carried out to uses a diferential wheeled robot, hence its movement is understand whether or not the cigarette is present in based on two separately driven wheels placed on either the image. This method is efective in not very crowded side of the robot body. It can thus change its direction (a) Training and validation loss of age classifier. (b) Training and validation accuracy of age classifier.

The brain of our mobile robot is a Raspberry Pi 4B. The

Raspberry Pi is a small, powerful, and low-cost embedded device. The Raspberry Pi 4B uses a Broadcom BCM2711 SoC with a 1.5 GHz 64-bit quad-core ARM Cortex-A72 processor, with a 1 MB shared L2 cache. The Raspberry Pi Foundation, in collaboration with Broadcom, developed a series of miniature single-board computers (SBCs) in the United Kingdom. Initially, the Raspberry Pi initiative was geared toward promoting the teaching of fundamental computer science in schools and impoverished countries. The first model achieved greater popularity than planned, selling outside of its intended market for applications such as robots. It is widely utilized in a variety of fields, including weather monitoring, due to its inexpensive cost, modular construction, and open architecture. Due to its support for HDMI and USB devices, it is commonly utilized by computer and electronic hobbyists.

6.2. Depth Camera

on test set. by varying the relative rate of rotation of its wheels and There are numerous types of depth cameras, which vary hence does not require an additional steering motion. in terms of how they receive world data or how that On the front of the mobile robot, a 480p webcam is in- data is processed in order to present it in a useful format. stalled, which is essential for our objective of recognizing The sensors can difer in a variety of ways, including a smoking scene by detecting cigarettes, adults, and chil- acquisition method, resolution, and range. Stereo sendren in its view. The webcam can rotate horizontally and sors attempt to replicate human vision by utilizing two vertically and has two degrees of freedom. After recog- cameras addressing the scene with a certain amount of nizing a smoking scene, the speaker in the center of the separation between them. The images from these cammobile robot will issue a warning. All of the sensors and eras are gathered and then utilized to extract and match components are connected to a drive board that extends visual features (important visual information) in order to create what is known as a disparity map between the open-source deep learning framework that enables the deployment of TensorFlow models on mobile devices.

It is optimized for machine learning on-device. After conversion, we can use TensorFlow Lite on our mobile robot to create predictions based on the input data. As illustrated in the plots, the TensorFlow Lite model is still capable of doing high performance cigarette detection on the test photos when the cigarettes are suficiently visible.

Our task is to determine whether one or more persons are smoking and whether children are near the smoker.

Figure 10: Astra Pro (Our RGB-D camera) Therefore, the object detector will first determine the number of smokers and children in the area. For example, if it detects a cigarette, two adults and a child, that means there are a smoker and a child there. Then the robot will cameras’ viewpoints. Time of Flight (ToF) sensors illu- approach the smoker and attempt to convince them to minate the entire image and determine depth based on stop smoking or inform them that passive smoking is the time required for each photon to return to the sensor. dangerous for children.

This means that each pixel corresponds to a single beam of light projected by the device, resulting in increased data density, less shadows cast by objects, and simplified 8. Human-Robot Interaction calibration (no stereo matching). By contrast, structured light (SL) sensors make use of a predetermined pattern In this section we see how the interaction of the robot projected into the scene by the IF sensor. The deforma- afects people. Social psychology has shown how explicit tion of the pattern is then used to generate the depth map. prohibitions can lead individuals to develop a conduct In this research, we choose a depth camera with a struc- opposite to what is required, whereby an explicit prohibitured light sensor, a specific variant of the Orbbec Astra tion to do something causes the subject to disregard that Pro. Actually, the Astra Pro has a higher RGB resolution prohibition and to take the prohibited behavior. For this camera as well as a depth camera. Astra Pro was created reason, a very important aspect was the construction of a to be largely compatible with the existing OpenNI library. dataset that contained "kind messages" that could be sent Through a Python binding for OpenNI2, we are able to to people to dissuade them from smoking in the presence obtain both RGB and depth information from the camera. of children. The characteristics of the message that we

The initial mobile robot was equipped with a stan- considered relevant were the following: 1) it did not have dard monocular camera that is unable of determining to contain an explicit prohibition; 2) the sentence had to the depth of a scene. We did, however, investigate the be short and understandable; 3) the sentence had to have possibility using an RGB-D camera. Due to mechanical a content that could be judged plausible by the subject; constraints, we were unable to directly mount the cam- 4) the sentence had to use direct but gentle language. era on the mobile robot with screws, but we devised a Eg. some of the phrases used could contain a message method for attaching the camera to the mobile robot’s like this: "Please smoke away from here because there base. It is obvious that when testing alone with the RGB- is a child" - "Kindly, do not smoke in this area because D camera, the depth of the detected item can be easily there are children too close". To evaluate the "quality" determined. The camera features a depth sensor in addi- of the answers, and the efectiveness of the robot, the tion to the standard color sensor. The depth sensor can subjects were subjected to a questionnaire that contained be used to determine the proximity of an object to the 3 types of questions: how did they evaluate the relevance camera. As a result, once our object has been detected, it of the robot with respect to the task for which it was is straightforward to locate and calculate the distance to programmed; how they assessed the robot’s kindness in the identical object on the depth map. being able to persuade them not to smoke in front of the children; and how they assessed the robot’s ability to persuade them to smoke in general. Follow a table (Table 2) and a chart with the results of the social experiment (Figure 11).

7. Implementation of the Whole Structure Prior to combining everything, we need to convert object detection model to TensorFlow Lite in order to deploy it on the Raspberry Pi. TensorFlow Lite is a free,

Table 2 of afirmative answers on the robot’s ability to persuade The questions proposed to the users after the social experiment people not to smoke in the presence of children. HowN. Question % YES % NO ever, it would be necessary to consider that these latter 1 Was the robot pertinent? 77 23 responses could be conditioned by the sympathy aroused 2 Was the robot polite and kind? 83 17 by the robot, rather than by a sincere intent to change 3 Will you smoke near children? 35 65 one’s lifestyle. However, when asked about the efective4 Will you quit smoking? 29 71 ness of the robot in convincing people to quit smoking in general, there was a considerable number of negative responses, which leads us to think that cigarette addic9. Conclusion tion is very high and certainly requires further strategies to convincing people to change their lifestyle in a stable In this research, we deployed a mobile robot to deal and radical way. with the exposure of youngsters to second-hand smoke. It would also be interesting to repeat the experiment To detect cigarettes, we compared a custom SSD Mo- with a control group and an interview at least six months bileNet model trained on our home-made dataset with apart.

SmokingNet. Next, to discriminate between children and adults, we use a pretrained Cascade classifier as a face detector and a CNN model trained on the UTKFace dataset as an age classifier. The age classifier is only initialized when a mobile robot comes across a cigarette and the Cascade classifier detects faces. This approach improved the accuracy of distinguishing children from adults. In the literature, researchers focused on achieving high performance related to the cigarette detector. It can be shown that these models required a significant amount of computational time. The models chosen were structured in order to be mobile friendly and succeeded to be deployed on a Raspberry Pi. In the first model of cigarette detection one of the main drawbacks was the performance of the Figure 11: Column chart showing the percentage of people’s detector related to really small size images. This may also responses to the robot. need some additional data augmentation on the dataset.

The limited performance of SSD MobileNet chosen lead to choose SmokingNet which turns out to be much more precise. We also discovered that the Raspberry Pi’s per- References formance is insuficient for running a high-performance object detector. When running inference with Tensor- [1] World health organization, 2022. URL: https://www. Flow Lite, the frames per second is only 0.55 which a who.int/. bit slow for a real-time detection. It’s dificult to strike [2] C. Pasquarella, L. Veronesi, C. Napoli, P. Castiglia, a balance between eficiency and performance. A more G. Liguori, R. Rizzetto, I. Torre, E. Righi, P. Farrugpowerful computer, such as the Jetson Nano, will be able gia, M. Tesauro, et al., Microbial environmental conto run a deeper neural network, which will undoubtedly tamination in italian dental clinics: A multicenter increase eficiency and performance. In this paper we study yielding recommendations for standardized have shown how, even with few computational resources sampling methods and threshold values, Science of available, satisfactory results can be achieved for impor- the total environment 420 (2012) 289–299. tant and large-scale problems such as the second-hand [ 3 ] P. Caponnetto, et al., The efects of physical exercise smoking towards teenagers and children. In conclusion, on mental health: From cognitive improvements from the answers to the questions, highlighted by the to risk of addiction, International Journal of Enchart, it emerges how the use of the robot can be a good vironmental Research and Public Health 18 (2021). "facilitator" to communicate messages inviting them not doi:10.3390/ijerph182413384. to smoke in the presence of children. Probably this role [4] A. Bandura, Model of Causality in Social Learning of facilitator is made possible by the "sympathy" that the Theory, Springer US, Boston, MA, 1985, pp. 81–99. robot can arouse in the majority of the people involved in URL: https://doi.org/10.1007/978-1-4684-7562-3_3. this study. In fact, as regards the questions related to how doi:10.1007/978-1-4684-7562-3_3. the robot was perceived, it has a very high percentage of [5] S. Russo, C. Napoli, A comprehensive solution for positive answers. There is also a significant percentage psychological treatment and therapeutic path plan

ning based on knowledge base and expertise shar - https://projekter.aau.dk/projekter/files/419098381/

ing , volume 2472 , 2019 , p. 41 - 47 . Master_Thesis_Mathiebhan.pdf . [6]

Illari ,

Russo ,

Avanzato ,

Napoli , A cloud- [18]

Bochkovskiy ,

Wang ,

H. M.

Liao , Yolov4: Opti-

follow-up of hospitalized patients , volume 2694 , abs/ 2004 .10934 ( 2020 ). URL: https://arxiv.org/abs/

2020 , p. 29 - 35 . 2004 . 10934 . arXiv: 2004 . 10934 . [7]

Sandler ,

Howard ,

Zhu ,

Zhmoginov , L.- [19] Deep learning based driver smoking behavior detec-

Chen , Mobilenetv2: Inverted residuals and linear tion for driving safety ( 2020 ). URL: http://www.joig.

bottlenecks, in: Proceedings of the IEEE conference net/uploadfile/ 2020 /0318/20200318051129839.pdf .

on computer vision and pattern recognition, 2018 , [20] Roboflow , https://roboflow.com, 2022 . URL: https:

pp. 4510 - 4520 . //roboflow.com. [8]

Zhang ,

Jiao ,

Wang , Smoking image detec- [21]

S. Y.

Zhang Zhifei ,

Hairong , Age progres-

2018 IEEE 4th International Conference on Com- coder , in: IEEE Conference on Computer Vision

puter and Communications (ICCC) , IEEE, 2018 , pp. and Pattern Recognition (CVPR) , IEEE, 2017 .

1509- 1515 . [22]

Liu ,

Anguelov ,

Erhan ,

Szegedy , S. Reed, [9] Raspberry , 2022 . URL: https://www.raspberrypi. C. -Y. Fu , A. C. Berg , Ssd: Single shot multibox

com/for-home/ . detector, in: European conference on computer [10]

Woźniak ,

Połap ,

Napoli , E. Tramontana, vision, Springer, 2016 , pp. 21 - 37 .

Application of bio-inspired methods in intelligent [23] T.-Y. Lin , P.

Dollár , R.

Girshick , K.

He , B.

Hariharan ,

trol 46 ( 2017 ) 150 - 164 . doi: 10 .5755/j01.itc. detection, in: Proceedings of the IEEE conference

46.1.13872. on computer vision and pattern recognition, 2017 , [11]

Viola ,

Jones , Rapid object detection using a pp . 2117 - 2125 .

boosted cascade of simple features , in: Proceedings [24] Colab , https://colab.research.google.com, 2022 .

of the 2001 IEEE computer society conference on

2001 , volume 1 , Ieee , 2001 , pp. I-I. [12]

Starczewski ,

Pabiasz ,

Vladymyrska , A . Mar-

maps for 3d face understanding , Lecture Notes

in Bioinformatics) 9693 ( 2016 ) 210 - 217 . doi:10.

1007 / 978 -3- 319 -39384-1_ 19 . [13]

Brandizzi ,

Bianco , G. Castro,

Russo , A. Wa-

tion recognition , volume 3092 , 2021 , p. 66 - 74 . [14] M. A. ...,

TensorFlow: Large-scale machine learn-

ing on heterogeneous systems , 2015 . URL: https:

sorflow.org. [15]

Parate , M.-C. Chiu , C.

Chadowitz , D.

Ganesan ,

biSys 2014 - Proceedings of the 12th Annual Interna-

and Services 2014 ( 2014 ). doi: 10 .1145/2594368.

2594379. [16]

Tang ,

Vidrine ,

Crowder ,

Intille , Auto-

accelerometers , ICST, 2014 . doi: 10 .4108/icst.

pervasivehealth.

2014 . 254978 . [17] Identification of cigarette litter with the

use of outdoor mobile robots (

2021 ). URL: