=Paper=
{{Paper
|id=Vol-2687/paper9
|storemode=property
|title=Pepper4Museum: Towards a Human-like Museum Guide
|pdfUrl=https://ceur-ws.org/Vol-2687/paper9.pdf
|volume=Vol-2687
|authors=Giovanna Castellano,Berardina De Carolis,Nicola Macchiarulo,Gennaro Vessio
|dblpUrl=https://dblp.org/rec/conf/avi/CastellanoCMV20
}}
==Pepper4Museum: Towards a Human-like Museum Guide==
Pepper4Museum: Towards a Human-like Museum Guide Giovanna Castellano Berardina De Carolis Department of Computer Science Department of Computer Science University of Bari Aldo Moro University of Bari Aldo Moro Bari, Italy Bari, Italy giovanna.castellano@uniba.it berardina.decarolis@uniba.it Nicola Macchiarulo Gennaro Vessio Department of Computer Science Department of Computer Science University of Bari Aldo Moro University of Bari Aldo Moro Bari, Italy Bari, Italy nicola.macchiarulo@uniba.it gennaro.vessio@uniba.it ABSTRACT to a wider population [13]. One of the most promising ICT applica- With the recent advances in technology, new ways to engage vis- tions in this domain concerns the provision of personalized services, itors in a museum have been proposed. Relevant examples range in which the visitor’s specific characteristics and preferences are from the simple use of mobile apps and interactive displays to vir- taken into account [5], and recommendation of personalized mu- tual and augmented reality settings. Recently social robots have seum visiting paths [18]. Monitoring what visitors are looking at, been used as a solution to engage visitors in museum tours, due to what they like or dislike, etc., can be used to personalize and en- their ability to interact with humans naturally and familiarly. In hance their experience during the visit [31]. Successful applications this paper, we present our preliminary work on the use of a social range from the simple use of mobile apps and interactive displays robot, Pepper in this case, as an innovative approach to engaging to virtual and augmented reality settings. people during museum visiting tours. To this aim, we endowed For instance, in [3] an indoor location-aware architecture, which Pepper with a vision module that allows it to perceive the visitor relies on a wearable device to automatically provide the user with and the artwork he is looking at, as well as estimating his age and cultural content related to the observed artwork, is proposed. A gender. These data are used to provide the visitor with recommenda- similar approach was followed in [34], where Zhang et al. proposed tions about artworks the user might like to see during the visit. We the use of a wearable camera equipped with image processing tested the proposed approach in our research lab and preliminary capabilities to solve the task of artwork identification within a mu- experiments show its feasibility. seum. A remarkable contribution to the topic of personalized visit experience was provided in [5], where Bartolini et al. proposed CCS CONCEPTS a context-aware recommendation system aimed at supporting in- telligent multimedia services for the users. In [1] and [16], the • Computing methodologies → Visual content-based index- authors explored the possibility to harness electroencephalograph ing and retrieval; • Applied computing → Fine arts; • Human- (EEG) signals captured by off-the-shelf EEG low-cost headsets to centered computing → Interactive systems and tools. understand if an artwork is of interest for a visitor. More recently, Cardoso et al. [7] proposed the use of a mobile application to set KEYWORDS museum itineraries where visitors can move at their own pace and, Museum visit; Human-Robot Interaction; Computer Vision. at the same time, have all the complementary information they ACM Reference Format: need about points of interest adapted to the user’s needs. Giovanna Castellano, Berardina De Carolis, Nicola Macchiarulo, and Gen- A recent solution to engage visitors in museum tours is to use naro Vessio. 2020. Pepper4Museum: Towards a Human-like Museum Guide. social robots. Social robots are embodied, autonomous agents that In Proceedings of 𝐴𝑉 𝐼 2𝐶𝐻 2020: Workshop on Advanced Visual Interfaces communicate and interact with humans on a social and emotional and Interactions in Cultural Heritage (𝐴𝑉 𝐼 2𝐶𝐻 2020). ACM, New York, NY, level. They represent an emerging field of research focused on de- USA, 5 pages. veloping a “social intelligence” that aims to maintain the illusion of dealing with a human being [4]. Thanks to their ability to inter- 1 INTRODUCTION act with humans in a natural and familiar way, social robots are In the last few years, due to technology improvements and drasti- spreading more and more often into human life not only for enter- cally declining costs, many innovative Information and Communi- tainment, but also to assist users in their activities of daily living, or cation Technology (ICT) solutions have been applied to the cultural in teaching and educational settings. In particular, they can provide domain, with the aim of making art more accessible and engaging novel, interactive social interfaces in cultural and tourism services [33], thus improving the overall experience of the user [28]. Historical examples of museum tour guide robots include RHINO 𝐴𝑉 𝐼 2𝐶𝐻 2020, September 29, Island of Ischia, Italy © Copyright 2020 for this paper by its authors. Use permitted under Creative Commons [6] and Minerva [32]. RHINO integrates low-level probabilistic rea- License Attribution 4.0 International (CC BY 4.0). soning and high-level problem solving, embedded in first order 𝐴𝑉 𝐼 2𝐶𝐻 2020, September 29, Island of Ischia, Italy Giovanna Castellano et al. following, the main modules we are developing for museum visit assistance are briefly described. 2.1 Museum Mapping and Localization Simultaneous Localization And Mapping (SLAM) is the problem of constructing and updating a map of an unknown environment while keeping track of the robot’s location within it. As building maps is one of the fundamental tasks of mobile robots, a lot of researchers focused on this problem. The SLAM algorithms mainly use optical sensor data to reconstruct the map of the environment and determine the orientation and position of the robot. There are Figure 1: Overview of the system components. two common approaches to SLAM: Visual SLAM, based on data captured from RGB or RGB-D cameras, and LiDAR (Light Detection logic, to navigate at high speeds through dense crowds, while re- and Ranging) SLAM, based on data captured from laser sensors. liably avoiding collisions with obstacles. Differently from RHINO, The approach based on LiDAR is typically faster and more accurate Minerva learns the map from sensor data and presents an improved than Visual SLAM [15]. As far as concerns the Pepper’s capabilities interaction system with the users. To do this, it adopts a “pervasively of mapping and navigating in an environment, Pepper is able, using probabilistic” approach, which relies on explicit representations the NaoQI API, to: i) map the environment; and ii) localize itself and of uncertainty in perception and control. More recently, in [17], navigate inside the mapped environment in accordance with the Germak et al. developed a telepresence robot designed as a tool to SLAM approach. To this aim, Pepper uses its odometry and laser explore inaccessible areas of a cultural site. In [2] the humanoid sensors. Thus, as a first step, we developed a module that allows robot Pepper has been used as a tour guide in a museum. Pepper Pepper to map the museum space moving around autonomously. was equipped with several modules, useful for accompanying visi- Then, once the mapping has been completed, the resulting map is tors and interacting with them. Suddrey et al. [30] recently showed stored as a 2D image (see Fig. 2a). Successively, the map is anno- that the Pepper’s basic functionalities can be improved to enable tated with the points of interest close to the artworks’ position and the robot to provide autonomous and interactive tours. each point is tagged with the artwork ID. This ID is then used to In this context, the recent advances in Computer Vision are retrieve information about the corresponding artwork (i.e., author, allowing researchers to endow robots with novel and powerful description, image, tags). capabilities. In this paper, we present our preliminary work on the Besides annotating the points of interest in the space, Pepper use of a social robot, Pepper in this case, as a museum tour guide. has to detect and localize visitors in the mapped space. This is In particular, we present a vision-based approach for supporting done with the use of a particular deep neural network for object people during a museum visit. The vision module allows Pepper detection. Specifically, we used SSD MobileNetV2 [27], a state-of- to perceive the presence of a visitor and localize him in the space, the-art deep learning model pre-trained on the MS COCO dataset and estimating his age and gender. Moreover, a visual link retrieval [22], which is able to detect 80 different objects, including people. module gives Pepper the ability to take the image of the painting Given the frames captured by the Pepper’s camera as an input, observed by the visitor as a visual query to search for visually SSD MobileNetV2 returns a bounding box around all the detected similar paintings in the museum database. The robot uses these people, as shown in Fig. 2b. data and other information acquired during the dialog to provide We fused the information about the bounding box of a person the visitor with recommendations about similar artworks he might in the image with the data captured from the depth camera of the like to see in the museum. We tested the proposed approach in our robot in order to compute the coordinates of the visitor in the map research lab and preliminary experiments show its feasibility. previously created with the SLAM algorithm. To determine if a visitor is close to an artwork, we compute the Euclidean distance 2 PEPPER4MUSEUM between the person’s point and each point of the artwork. If the distance is less than a threshold, Pepper approaches the visitor. Designing the behaviors of a social robot acting as a museum guide requires endowing it with different capabilities that would provide visitors with an engaging and effective experience during the visit. 2.2 Age and Gender Estimation These capabilities are meant to allow the robot to detect and localize In order to start gathering information about the target user, a people in the museum, recognize artworks the visitor is looking at, soft biometric module is used [11]. The soft biometric module al- profile the user during the visit so as to generate suitable recommen- lows Pepper to automatically infer the age and gender of the user dations, and finally engage people in the interaction using suitable who is interacting with it. The algorithm follows the approach de- conversational skills. This is the final aim of the Pepper4Museum scribed in [26], which relies on a fine-tuned version of the VGG16 project (Fig. 1) which exploits the combination of Computer Vision state-of-the-art deep convolutional neural network [29], using an and Social Robotics. unconstrained image dataset. The capability of deep neural net- As robot platform we use Pepper, a semi-humanoid robot de- works to solve complex perceptual tasks has been shown in several veloped by SoftBank Robotics. It is an omnidirectional wheeled recent works (e.g., [9, 21]). Our approach showed a good perfor- humanoid robot equipped with several cameras and sensors. In the mance, as gender recognition reached an accuracy of 85%, while age Pepper4Museum: Towards a Human-like Museum Guide 𝐴𝑉 𝐼 2𝐶𝐻 2020, September 29, Island of Ischia, Italy in which to search for similarities among paintings [10]. These similarities can be used to provide semantic links among paintings so as to recommend artworks a visitor may be interested in. The proposed method is mainly based on “visual attributes” au- tomatically learned by a VGG16-based model. The resulting high dimensional representation is then embedded in a more compact feature space by applying Principal Component Analysis. Finally, similarities among paintings, i.e. visual links, are obtained through a distance measure in a completely unsupervised Nearest Neighbor (a) (b) (c) fashion. The proposed method thus provides the nearest neighbors for each query image, that are those images more similarly linked Figure 2: (a) An example of map generated after the explo- to the input query. Relying on a completely unsupervised approach ration of a museum space. (b) People detection. (c) An exam- makes the proposed method simple and practical, as it excludes the ple of interaction between a young woman and Pepper with necessity to acquire labels of visual links, which can be unavailable real-time gender and age estimation. or very difficult to collect. 2.5 Behavior Manager estimation reached an accuracy (±1 year) of 84% on the previously A behavior is a program that combines and coordinates the ut- mentioned dataset. Soft biometric traits can be used with two main terances, gestures, expressions, touch-screen interactive elements, purposes: i) improving the recommender module performance by and locomotion based on the current robot perceptions. In the con- filtering recommendations accordingly; and ii) adapting the robot’s text of museum visiting, the behavior manager module can trigger dialogue to the person it is interacting with. In our museum sce- a particular behavior according to two approaches: reactive and nario, we used an approach similar to the one described in [14], in proactive. which the robot uses a different level of formality in its dialogue, In the reactive case, the triggered behavior is an answer to the based on the age and gender of the person being tracked (Fig. 2c). recognized user’s intent. In particular, user input can be provided via voice or through a touch screen. In the first case, the input 2.3 User Profiling and Recommender Module is processed by the automated speech recognition module that is Understanding the user preferences may enable a social robot to already encoded in the programming environment of the robot. In adapt its behavior accordingly, hence enhancing the user satisfac- the second case, the user can interact with the tablet of the robot, tion during the interaction [25]. Usually this process is based on which shows the available choices. It is worth stating that the touch explicit feedback, e.g. specific question answering or explicit rating screen is needed, since the overall quality of the speech recognition of items. However, this approach, albeit more precise and reliable, module encoded in Pepper is typically low. At the current stage of is time consuming and requires an effort by the user. implementation of the system, five families of intents are captured: Recent trends use approaches based on implicit feedback that greet, small_talk, current_painting information, suggestions, and tour. can be inferred by observing and analyzing user’s behavior without Clearly, each intent invokes a different, more or less complex, be- interrupting the user engagement in the interaction. In the museum havior as an answer. In the proactive case, a rule-based system, in visit context, we decided to exploit a hybrid approach that combines which the current state of the perception is periodically matched observations of the user behavior with explicit questions asked by with the preconditions of rules, has been adopted. Then, the behav- the robot during the interaction. Then, thanks to the soft biomet- ior associated to the selected rule is executed. If no rule is selected, ric analysis, information about user gender and age is used as a the robot executes the idle behavior in which it moves a bit around, feature for triggering an initial stereotypical model for the visitor randomly displaying on its tablet artworks present in the museum [24]. Moreover, these data can be used to tailor the dialog and the exhibition with the invitation to ask about them. Each behavior information presented to the visitor (i.e., descriptions provided to a may require the fulfillment of a service execution in the cloud of child will be different from those provided to an adult). In addition, Pepper4Museum as in the case of the recommendation generation. these data, together with information about what is of interest for the visitor (inferred by observing what the visitor is looking at and 3 PRELIMINARY STUDY by answering to specific questions during the visit), can be used to As a proof of concept, we preliminarly investigated the effectiveness trigger the recommendation about what to see next in the museum. of the different modules embedded in Pepper4Museum. This phase represents an active process of feedback and preference About the gender and age estimation, the soft biometric module acquisition that allows the robot to acquire new information that in the wild was able to recognize with 87.5% accuracy the gender of can be used for refining subsequent recommendations. the visitors and with 62.5% accuracy the age of the visitors (more details in [8, 14]). The classification of the age was lower than we 2.4 Visual Link Retrieval expected because the interaction in the wild did not guarantee a The proposed module for link retrieval assumes that the robot has static and frontal position of the user with respect to the camera. knowledge about the artworks exhibited in the museum. The goal is Also the variation of lighting conditions influenced the analysis of to project the raw pixel images into a new, numerical feature space the face for age estimation. 𝐴𝑉 𝐼 2𝐶𝐻 2020, September 29, Island of Ischia, Italy Giovanna Castellano et al. such as colors and shapes, and conceptual elements, such as subject matter and meaning of the painted scene. The mapping and localization process could not be tested in a real museum due to the COVID-19 emergency. We tested this module in the “Museum of History of Computers” located in our department and we observed that its performance was overall acceptable. Some delay was registered when Pepper found an unexpected obstacle on its planned path (e.g., people crossing). The other modules need to be tested in the wild as soon as it will be possible. 4 CONCLUSION AND FUTURE WORK In this paper, we have presented our preliminary work towards the development of Pepper4Museum: a human-like museum guide. Promising results in our research lab have been obtained. As future work, we plan to test and refine all the behaviors we implemented in this domain. Then, in order to run an experiment in a real museum context, a test on the integration of the described components is Figure 3: Sample artwork queries and corresponding visu- needed. A test in a museum will allow for measuring the visitor ally linked paintings provided by the system. experience and evaluating the impact of this technology in this context. Finally, it is worth remarking that the data collected by Then, we tested the visual link retrieval method on a database Pepper represent a valuable source of information that can be prof- collecting paintings of 50 very popular painters. We used data pro- itably used to better understand and predict the visitors’ behavior vided by the Kaggle platform,1 scraped from an art challenge Web- [18, 19, 23]. Such an analysis could be carried out by means of graph site.2 Artists belong to very different epochs and painting schools, theory, e.g. [20], or process mining techniques [12]. ranging from Giotto di Bondone and Renaissance painters such as Leonardo da Vinci and Michelangelo, to Modern Art exponents, ACKNOWLEDGMENTS including Pablo Picasso, Salvador Dalí, and so on. Once the reduced Funding for this work was partially provided by Fondazione Puglia features representing paintings were obtained, we applied the Near- that supported the Italian project “Programmazione Avanzata di est Neighbor matching mechanism to derive, for each query image, Robot Sociali Intelligenti”. Gennaro Vessio acknowledges funding the top 𝑘 matching images (𝑘 = 3 in this case). To give an illustra- support from the Italian Ministry of Education, University and tive example of the behavior of our system, in Fig. 3 we provide Research through the PON AIM 1852414 project. three sample image queries, together with the corresponding top visually linked artworks retrieved by the system. For each query, a REFERENCES brief description of the results is given below: [1] F. Abbattista, V. Carofiglio, and B. De Carolis. 2018. BrainArt: a BCI-based Q1 The first image query is the Romanticist “Fort Vimieux” by Assessment of User’s Interests in a Museum Visit.. In AVI*CH. [2] D. Allegra, F. Alessandro, C. Santoro, and F. Stanco. 2018. Experiences in Using William Turner, depicting a classic red sunset of the author. the Pepper Robotic Platform for Museum Assistance Applications. In 2018 25th It can be seen that the system was able to retrieve paintings IEEE International Conference on Image Processing (ICIP). IEEE, 1033–1037. [3] S. Alletto et al. 2015. An indoor location-aware system for an IoT-based smart similar both in content and color distribution. museum. IEEE Internet of Things Journal 3, 2 (2015), 244–253. Q2 The second query is the Impressionist “Confluence of the [4] M. Alqaderi and A. Rad. 2018. A Multi-Modal Person Recognition System for So- Seine and the Loing” by Alfred Sisley. It can be noticed that cial Robots. Applied Sciences 8 (03 2018), 387. https://doi.org/10.3390/app8030387 [5] I. Bartolini et al. 2016. Recommending multimedia visiting paths in cultural the three neighbors, i.e. two artworks by Camille Pissarro heritage applications. Multimedia Tools and Applications 75, 7 (2016), 3813–3842. and a work by Claude Monet, share the same painting style, [6] W. Burgard et al. 1998. The interactive museum tour-guide robot. In AAAI/IAAI. characterized by the typical color vibration. 11–18. [7] P.J.S. Cardoso et al. 2019. Cultural heritage visits supported on visitors’ pref- Q3 Finally, we considered as query a version of the “Sunflowers” erences and mobile devices. Universal Access in the Information Society (2019), series by Vincent van Gogh. As expected, the 3-top images 1–15. [8] B. De Carolis, N. Macchiarulo, and G. Palestra. 2018. A Comparative Study on Soft retrieved by the system represent still lifes, two of them by Biometric Approaches to Be Used in Retail Stores. In Foundations of Intelligent Renoir, the other one by Edouard Manet. Systems - 24th International Symposium, (ISMIS2018) (Lecture Notes in Computer Science), M. Ceci, N. Japkowicz, J. Liu, G.A. Papadopoulos, and Z.W. Ras (Eds.), Based on a qualitative evaluation of the retrieval results, we can Vol. 11177. Springer, 120–129. conclude that, overall, the proposed system is able to find visual [9] G. Castellano, C. Castiello, C. Mencar, and G. Vessio. 2020. Crowd Detection links that are not in contrast with the human perception. The visual for Drone Safe Landing Through Fully-Convolutional Neural Networks. In In- ternational Conference on Current Trends in Theory and Practice of Informatics. links discovered by the system are sufficiently justifiable by a human Springer, 301–312. observer and in most cases resemble the intrinsic criteria humans [10] G. Castellano and G. Vessio. 2020. Towards a Tool for Visual Link Retrieval and adopt to link visual arts. These criteria combine visual elements, Knowledge Discovery in Painting Datasets. In Italian Research Conference on Digital Libraries. Springer, 105–110. [11] A. Dantcheva, C. Velardo, A. D’angelo, and J.-L. Dugelay. 2011. Bag of soft 1 https://www.kaggle.com/ikarus777/best-artworks-of-all-time biometrics for person identification. Multimedia Tools and Applications 51, 2 2 http://artchallenge.ru (2011), 739–777. Pepper4Museum: Towards a Human-like Museum Guide 𝐴𝑉 𝐼 2𝐶𝐻 2020, September 29, Island of Ischia, Italy [12] B. De Carolis, S. Ferilli, and D. Redavid. 2015. Incremental Learning of Daily Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755. Routines as Workflows in a Smart Home Environment. ACM Trans. Interact. [23] C. Martella et al. 2017. Visualizing, clustering, and predicting the behavior of Intell. Syst. 4, 4, Article 20 (Jan. 2015), 23 pages. museum visitors. Pervasive and Mobile Computing 38 (2017), 430–443. [13] B. de Carolis, C. Gena, T. Kuflik, and J. Lanir. 2018. Special issue on advanced [24] E. Rich. 1998. User Modeling via Stereotypes. Morgan Kaufmann Publishers Inc., interfaces for cultural heritage. International Journal of Human-Computer Studies San Francisco, CA, USA, 329–342. 114 (03 2018). [25] S. Rossi, F. Ferland, and A. Tapus. 2017. User Profiling and Behavioral Adaptation [14] B. De Carolis, N. Macchiarulo, and G. Palestra. 2019. Soft Biometrics for So- for HRI. Pattern Recogn. Lett. 99, C (Nov. 2017), 3–12. cial Adaptive Robots. In Int. Conference on Industrial, Engineering and Other [26] R. Rothe, R. Timofte, and L. Van Gool. 2015. Dex: Deep expectation of apparent Applications of Applied Intelligent Systems. Springer, 687–699. age from a single image. In Proc. of the IEEE international conference on computer [15] M. Filipenko and I. Afanasyev. 2018. Comparison of Various SLAM Systems vision workshops. 10–15. for Mobile Robot in an Indoor Environment. In 2018 International Conference on [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. 2018. MobileNetV2: Intelligent Systems (IS). 400–407. Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Com- [16] C. Gena, C. Mattutino, S. Pirani, and B. De Carolis. 2019. Do BCIs Detect User’s puter Vision and Pattern Recognition. 4510–4520. Engagement? The Results of an Empirical Experiment with Emotional Artworks. [28] J. et al. Santos. 2018. A Personal Robot as an Improvement to the Customers’ In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and In-Store Experience. Service Robots (2018), 1. Personalization (Larnaca, Cyprus) (UMAP’19 Adjunct). Association for Computing [29] K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for Machinery, New York, NY, USA, 387–391. large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [17] C. Germak, M.L. Lupetti, L. Giuliano, and M.E.K. Ng. 2015. Robots and cultural [30] G. Suddrey, A. Jacobson, and B. Ward. 2018. Enabling a pepper robot to pro- heritage: New museum experiences. Journal of Science and Technology of the Arts vide automated and interactive tours of a robotics laboratory. arXiv preprint 7, 2 (2015), 47–57. arXiv:1804.03288 (2018). [18] T. Kuflik, Z. Boger, and M. Zancanaro. 2012. Analysis and Prediction of Museum [31] A. Tavčar, A. Csaba, and E. V. Butila. 2016. Recommender system for virtual Visitors’ Behavioral Pattern Types. Cognitive Technologies (04 2012). assistant supported museum tours. Informatica 40, 3 (2016). [19] J. Lanir, T. Kuflik, J. Sheidin, N. Yavin, K. Leiderman, and M. Segal. 2016. Visual- [32] S. Thrun et al. 2000. Probabilistic algorithms and the interactive museum tour- izing museum visitors’ behavior: Where do they go and what do they do there? guide robot minerva. The International Journal of Robotics Research 19, 11 (2000), Personal and Ubiquitous Computing (11 2016). 972–999. [20] E. Lella and E. Estrada. 2020. Communicability distance reveals hidden patterns [33] V. Tung and R. Law. 2017. The potential for tourism and hospitality experience of Alzheimer disease. Network Neuroscience Just Accepted (2020), 1–38. research in human–robot interactions. International Journal of Contemporary [21] E. Lella and G. Vessio. 2020. Ensembling complex network ‘perspectives’ for Hospitality Management 29 (08 2017), 00–00. https://doi.org/10.1108/IJCHM-09- mild cognitive impairment detection with artificial neural networks. Pattern 2016-0520 Recognition Letters (2020). [34] R. Zhang, Y. Tas, and P. Koniusz. 2018. Artwork identification from wearable [22] T.-Y. Lin et al. 2014. Microsoft COCO: Common Objects in Context. In Com- camera images for enhancing experience of museum audiences. arXiv preprint puter Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne arXiv:1806.09084 (2018).