A Deep Learning-based Approach to Model Museum Visitors Alessio Ferratoa , Carla Limongellia , Mauro Mezzinib and Giuseppe Sansonettia a Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00146 Rome, Italy b Department of Education, Roma Tre University, Viale del Castro Pretorio 20, 00185 Rome, Italy Abstract Although ubiquitous and fast access to the Internet allows us to admire objects and artworks exhibited worldwide from the comfort of our home, visiting a museum or an exhibition remains an essential experience today. Current technologies can help make that experience even more satisfying. For instance, they can assist the user during the visit, personalizing her experience by suggesting the artworks of her higher interest and providing her with related textual and multimedia content. To this aim, it is necessary to automatically acquire information relating to the active user. In this paper, we show how a deep neural network-based approach can allow us to obtain accurate information for understanding the behavior of the visitor alone or in a group. This information can also be used to identify users similar to the active one to suggest not only personalized itineraries but also possible visiting companions for promoting the museum as a vehicle for social and cultural inclusion. Keywords User interfaces, Computer vision, Deep Learning, Museum visitors 1. Introduction and Background the timing and tracking of museum visitors, highlight- ing the strengths and weaknesses of each of them (e.g., Recent technological advances have made it possible to see [13, 14, 15, 12, 16, 17, 18, 19, 20, 21]). In [22], we have significantly improve the experience of citizens when proposed an approach based on Computer Vision tech- they use public services [1, 2] or when they enjoy points nologies [23] that can represent the solution of some of of their interest [3, 4] and itineraries among them [5, 6]. the criticalities shown by the other visitor localization Among the different possible points of interest, there are technologies in the museum environment. It takes ad- also museums and exhibits. The first studies concerning vantage of image detection and classification techniques the observation and analysis of museum visitor behavior through convolutional neural networks (CNNs) capa- date back to the first half of the twentieth century [7, 8, 9]. ble of providing excellent performance in terms of ac- Since then, the works that publish studies based on the curacy [24]. This approach makes use of off-the-shelf analysis of visitor tracking have multiplied, namely, on cameras and badges such as those provided free to atten- the detailed recording of “not only where visitors go but dees by event and conference organizers [25]. Therefore, also what visitors do while inside an exhibition” [10]. In the overall cost of the entire instrumentation is reduced, early studies on the subject, the most common method which is certainly a significant advantage over other state- for recording visitor behavior was the paper-and-pencil of-the-art technologies. More specifically, this system one. Although this method is simple and low-cost, sev- relies on the Faster Region-based Convolutional Neural eral aspects limit its validity. Among these, the lack of Network (Faster R-CNN) model [26]. The architecture temporal information, more complicated to collect by the of the proposed system is shown in Figure 1. It can be observer, the need to transfer the data collected on paper to the database, and the inability to accurately determine the visitor’s real engagement. Fortunately, recent tech- nological advances in Machine Learning [11] have made available to researchers several approaches for automatic visitor tracking [12]. In the research literature, there are several contributions on the technologies adopted for Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Figure 1: The architecture of the proposed system relies on a Helsinki, Finland Faster Region-based Convolutional Neural Network (Faster Envelope-Open ale.ferrato@stud.uniroma3.it (A. Ferrato); R-CNN). limongel@dia.uniroma3.it (C. Limongelli); mauro.mezzini@uniroma3.it (M. Mezzini); gsansone@dia.uniroma3.it (G. Sansonetti) divided into three major parts: Orcid 0000-0003-4953-1390 (G. Sansonetti) © 2022 Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1. A CNN backbone (composed of a ResNet [27] CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) and a Feature Pyramid Network (FPN) [28]) that receives in input the image and gives in output a table position is fed by the detections of the model, the conv feature map; table camera is determined and created in advance by the 2. A Region Proposal Network (RPN) that takes as system supervisor. First of all, it can be convenient to input the conv feature map in output from the create the view dist_positions using the SQL Query 1. backbone and returns a set of rectangular boxes, each of which is associated with a score giving Query 1: View creation the likelihood that the region contains an object CREATE VIEW dist_positions AS or simple background; SELECT DISTINCT P.TIMESTMP as TIMESTMP, P.BADGE_ID as BADGE_ID, C.CT as CT, 3. The conv feature map given in output from the /* Changing the reference system */ backbone is also used by a detection system that, P.X + C.X AS X, given the set of regions from the RPN, determines, P.Y + C.Y AS Y, using the conv feature map of the backbone, for P.Z + C.Z AS Z each distinct class, a score ∈ [0, 1]. This value FROM positions P, camera C WHERE P.CAMERA_ID = C.CAMERA_ID represents the likelihood that the object belongs to the corresponding class. The detection system Furthermore we add another table artwork with at- also estimates the regression bounding box coor- tributes (ID, AX, AY, AZ, AW, AH). A tuple (𝑖𝑑, 𝑥, 𝑦, 𝑧, 𝑤, ℎ) dinates 𝑥, 𝑦, 𝑤, ℎ of the object being proposed by of table artwork records the 𝑖𝑑 of the artwork, the upper the RPN. right corner coordinates 𝑥, 𝑦, 𝑧 (with respect to the mu- In this paper, we show how the gathered information seum), the height ℎ and width 𝑤 of a rectangular box in can be used to model museum visitors and identify their front of the artwork 𝑖𝑑. nearest neighbors. The aim is to promote the museum Using this simple database we may easily and effec- as a vehicle to foster the social and cultural inclusion of tively retrieve the data described in the previous section. its visitors. In particular, we may obtain the distances between all pairs of distinct visitors (identified by a badge ID) and at the same choose only those pairs which have a distance 2. Database with Collected Data less than a prefixed threshold 𝛼. In fact, we may first create the SQL view visit_times (Query 2) that computes In order to make the analysis of the data collection easy for each artwork the total time a badge has spent nearby and at the same time effective, we propose the following the artwork, for all badge ids and for the time interval database implementation and give some sample queries between 𝑡0 and 𝑡1 . In the following, for the sake of sim- that could cover the most basic and useful needs when a plicity, we assume that every badge has been detected museum staff member wants to extract useful informa- in front of every possible artwork. By this assumption, tion about visitor behavior from the database. The data the query visit_times returns for every badge and every collected through the proposed system can be stored artwork at least a value. in a data structure that supports spatial and temporal analyses of visitor behavior [29]. Let us suppose, for Query 2: Badge-Artwork visit time example, that we have 𝑚 cameras and 𝑛 badges. Each CREATE VIEW visit_times AS camera detects, at a generic timestamp, a badge at cer- SELECT p.BADGE_ID as BADGE_ID, a.ID as A_ID, SUM(p.CT) tain coordinates from the camera. We can store all those as V_TIMES detections in a database composed of two tables. The FROM dist_positions p, artwork a first table, called positions, has attributes (TIMESTMP, WHERE p.TIMESTMP BETWEEN t_0 AND t_1 AND p.X BETWEEN a.AX AND a.AX + a.AW AND CAMERA_ID, BADGE_ID, X, Y, Z) and the second table, p.Y BETWEEN a.AY AND a.AY + a.AH AND called camera, has attributes (CAMERA_ID, CT, X, Y, Z). p.Z BETWEEN a.AZ AND a.AZ + 2.70 A single tuple (𝑡, 𝑐_𝑖𝑑, 𝑏_𝑖𝑑, 𝑥, 𝑦, 𝑧) of position represents GROUP BY p.bid, a.id a detection at timestamp 𝑡 from the camera 𝑐_𝑖𝑑 of the badge 𝑏_𝑖𝑑 at coordinates 𝑥, 𝑦, 𝑧 with respect to camera Using this view, we can compute the total time a single 𝑐_𝑖𝑑. A single tuple (𝑐_𝑖𝑑, 𝑐𝑡, 𝑥, 𝑦, 𝑧) of camera represents badge spent in front of any artwork (i.e, total_visit_times the coordinates 𝑥, 𝑦, 𝑧 of the camera 𝑐_𝑖𝑑 in relation to the in Query 3). museum. The value 𝑐𝑡 is the time period of a frame. If 𝑓 is the frame rate of the camera, then we have 𝑐𝑡 = 1/𝑓. Query 3: Badge total visit time For the sake of simplicity, hereafter, we suppose that 𝑐𝑡 CREATE VIEW total_visit_times AS assumes the same value for all cameras (i.e., 1/24 s), but SELECT BADGE_ID,SUM(V_TIMES) as TOTAL_TIMES all the discussion can be extended with simple and mini- FROM visit_times GROUP BY BADGE_ID mal modifications to the general case, in which cameras can have different frame rates. We note that, whilst the Then, using the two previous views, namely, visit_times is defined by the following measure: and total_visit_times, we can compute the percentage of 𝑁 time spent by any badge in front of any artwork (i.e., 𝑗 𝑗 1 𝑗 𝛿(𝑢𝑚𝑖 , 𝑢𝑚𝑗 ) = (|𝑡1𝑖 −𝑡1 |+⋯+|𝑡𝑛𝑖 −𝑡𝑛 |)/200 = ∑ |𝑡 𝑖 −𝑡 | total_visit_times_perc in Query 4). 200 ℎ=1 ℎ ℎ (1) Query 4: Badge-Artwork percentage time In this way, the measure is a real number in [0, … , 1]. The CREATE VIEW visit_times_perc AS definition given in 1 is a measure, as it enjoys properties SELECT BADGE_ID, A_ID, V_TIMES*100/TOTAL_TIMES as of Positiveness, Minimality and Simmetry. V_TIMES_PERC FROM visit_times v, total_visit_times t Positiveness Formula 1 is a sum of positive values, con- WHERE v.BADGE_ID = t.BADGE_ID sequently this property is fulfilled. Then, using the two previous views, namely, visit_times Minimality When two 𝑢𝑚s coincide, their distance has and total_visit_times, we can compute the percentage of to be 0, which means that the times spent for time spent by any badge in front of any artwork (i.e., 𝑗 each artwork are the same (𝑡ℎ𝑖 = 𝑡ℎ , ∀ℎ = 1, … , 𝑁) total_visit_times_perc in Query 4). and the overall sum is 0. On the contrary, if two 𝑢𝑚s are completely different, it means that the Query 5: Badge-Badge distances two visitors have seen different artworks. When 𝑗 SELECT v1.BADGE_ID, v2.BADGE_ID, sum(abs(v1. 𝑡𝑛𝑖 ≥ 0, 𝑡𝑛 = 0 and vice-versa. In this way, the sum V_TIMES_PERC-v2.V_TIMES_PERC))/200 is equal to 1. FROM visit_times_perc v1, visit_times_perc v2 WHERE v1.A_ID = v2.A_ID AND Simmetry It is given by the absolute value of the ob- v1.BADGE_ID