A Deep Learning-based Approach to Model Museum Visitors
Alessio Ferratoa , Carla Limongellia , Mauro Mezzinib and Giuseppe Sansonettia
a
    Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00146 Rome, Italy
b
    Department of Education, Roma Tre University, Viale del Castro Pretorio 20, 00185 Rome, Italy


                                             Abstract
                                             Although ubiquitous and fast access to the Internet allows us to admire objects and artworks exhibited worldwide from the
                                             comfort of our home, visiting a museum or an exhibition remains an essential experience today. Current technologies can
                                             help make that experience even more satisfying. For instance, they can assist the user during the visit, personalizing her
                                             experience by suggesting the artworks of her higher interest and providing her with related textual and multimedia content.
                                             To this aim, it is necessary to automatically acquire information relating to the active user. In this paper, we show how a
                                             deep neural network-based approach can allow us to obtain accurate information for understanding the behavior of the
                                             visitor alone or in a group. This information can also be used to identify users similar to the active one to suggest not only
                                             personalized itineraries but also possible visiting companions for promoting the museum as a vehicle for social and cultural
                                             inclusion.

                                             Keywords
                                             User interfaces, Computer vision, Deep Learning, Museum visitors


1. Introduction and Background                                  the timing and tracking of museum visitors, highlight-
                                                                ing the strengths and weaknesses of each of them (e.g.,
Recent technological advances have made it possible to see [13, 14, 15, 12, 16, 17, 18, 19, 20, 21]). In [22], we have
significantly improve the experience of citizens when proposed an approach based on Computer Vision tech-
they use public services [1, 2] or when they enjoy points nologies [23] that can represent the solution of some of
of their interest [3, 4] and itineraries among them [5, 6]. the criticalities shown by the other visitor localization
Among the different possible points of interest, there are technologies in the museum environment. It takes ad-
also museums and exhibits. The first studies concerning vantage of image detection and classification techniques
the observation and analysis of museum visitor behavior through convolutional neural networks (CNNs) capa-
date back to the first half of the twentieth century [7, 8, 9]. ble of providing excellent performance in terms of ac-
Since then, the works that publish studies based on the curacy [24]. This approach makes use of off-the-shelf
analysis of visitor tracking have multiplied, namely, on cameras and badges such as those provided free to atten-
the detailed recording of “not only where visitors go but dees by event and conference organizers [25]. Therefore,
also what visitors do while inside an exhibition” [10]. In the overall cost of the entire instrumentation is reduced,
early studies on the subject, the most common method which is certainly a significant advantage over other state-
for recording visitor behavior was the paper-and-pencil of-the-art technologies. More specifically, this system
one. Although this method is simple and low-cost, sev- relies on the Faster Region-based Convolutional Neural
eral aspects limit its validity. Among these, the lack of Network (Faster R-CNN) model [26]. The architecture
temporal information, more complicated to collect by the of the proposed system is shown in Figure 1. It can be
observer, the need to transfer the data collected on paper
to the database, and the inability to accurately determine
the visitor’s real engagement. Fortunately, recent tech-
nological advances in Machine Learning [11] have made
available to researchers several approaches for automatic
visitor tracking [12]. In the research literature, there are
several contributions on the technologies adopted for

Joint Proceedings of the ACM IUI Workshops 2022, March 2022,                                                        Figure 1: The architecture of the proposed system relies on a
Helsinki, Finland                                                                                                   Faster Region-based Convolutional Neural Network (Faster
Envelope-Open ale.ferrato@stud.uniroma3.it (A. Ferrato);                                                            R-CNN).
limongel@dia.uniroma3.it (C. Limongelli);
mauro.mezzini@uniroma3.it (M. Mezzini);
gsansone@dia.uniroma3.it (G. Sansonetti)                                                                            divided into three major parts:
Orcid 0000-0003-4953-1390 (G. Sansonetti)
                                       © 2022 Copyright © 2022 for this paper by its authors. Use permitted under
                                       Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                        1. A CNN backbone (composed of a ResNet [27]
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                             and a Feature Pyramid Network (FPN) [28]) that
       receives in input the image and gives in output a       table position is fed by the detections of the model, the
       conv feature map;                                       table camera is determined and created in advance by the
    2. A Region Proposal Network (RPN) that takes as           system supervisor. First of all, it can be convenient to
       input the conv feature map in output from the           create the view dist_positions using the SQL Query 1.
       backbone and returns a set of rectangular boxes,
       each of which is associated with a score giving                           Query 1: View creation
       the likelihood that the region contains an object         CREATE VIEW dist_positions AS
       or simple background;                                     SELECT DISTINCT P.TIMESTMP as TIMESTMP, P.BADGE_ID as
                                                                       BADGE_ID, C.CT as CT,
    3. The conv feature map given in output from the
                                                                     /* Changing the reference system */
       backbone is also used by a detection system that,             P.X + C.X AS X,
       given the set of regions from the RPN, determines,            P.Y + C.Y AS Y,
       using the conv feature map of the backbone, for               P.Z + C.Z AS Z
       each distinct class, a score ∈ [0, 1]. This value         FROM positions P, camera C
                                                                 WHERE P.CAMERA_ID = C.CAMERA_ID
       represents the likelihood that the object belongs
       to the corresponding class. The detection system           Furthermore we add another table artwork with at-
       also estimates the regression bounding box coor-        tributes (ID, AX, AY, AZ, AW, AH). A tuple (𝑖𝑑, 𝑥, 𝑦, 𝑧, 𝑤, ℎ)
       dinates 𝑥, 𝑦, 𝑤, ℎ of the object being proposed by      of table artwork records the 𝑖𝑑 of the artwork, the upper
       the RPN.                                                right corner coordinates 𝑥, 𝑦, 𝑧 (with respect to the mu-
In this paper, we show how the gathered information seum), the height ℎ and width 𝑤 of a rectangular box in
can be used to model museum visitors and identify their front of the artwork 𝑖𝑑.
nearest neighbors. The aim is to promote the museum               Using this simple database we may easily and effec-
as a vehicle to foster the social and cultural inclusion of tively retrieve the data described in the previous section.
its visitors.                                                  In particular, we may obtain the distances between all
                                                               pairs of distinct visitors (identified by a badge ID) and at
                                                               the same choose only those pairs which have a distance
2. Database with Collected Data                                less than a prefixed threshold 𝛼. In fact, we may first
                                                               create the SQL view visit_times (Query 2) that computes
In order to make the analysis of the data collection easy
                                                               for each artwork the total time a badge has spent nearby
and at the same time effective, we propose the following
                                                               the artwork, for all badge ids and for the time interval
database implementation and give some sample queries
                                                               between 𝑡0 and 𝑡1 . In the following, for the sake of sim-
that could cover the most basic and useful needs when a
                                                               plicity, we assume that every badge has been detected
museum staff member wants to extract useful informa-
                                                               in front of every possible artwork. By this assumption,
tion about visitor behavior from the database. The data
                                                               the query visit_times returns for every badge and every
collected through the proposed system can be stored
                                                               artwork at least a value.
in a data structure that supports spatial and temporal
analyses of visitor behavior [29]. Let us suppose, for                     Query 2: Badge-Artwork visit time
example, that we have 𝑚 cameras and 𝑛 badges. Each
                                                                 CREATE VIEW visit_times AS
camera detects, at a generic timestamp, a badge at cer- SELECT p.BADGE_ID as BADGE_ID, a.ID as A_ID, SUM(p.CT)
tain coordinates from the camera. We can store all those               as V_TIMES
detections in a database composed of two tables. The             FROM dist_positions p, artwork a
first table, called positions, has attributes (TIMESTMP,         WHERE p.TIMESTMP BETWEEN t_0 AND t_1 AND
                                                                       p.X BETWEEN a.AX AND a.AX + a.AW AND
CAMERA_ID, BADGE_ID, X, Y, Z) and the second table,
                                                                       p.Y BETWEEN a.AY AND a.AY + a.AH AND
called camera, has attributes (CAMERA_ID, CT, X, Y, Z).                p.Z BETWEEN a.AZ AND a.AZ + 2.70
A single tuple (𝑡, 𝑐_𝑖𝑑, 𝑏_𝑖𝑑, 𝑥, 𝑦, 𝑧) of position represents   GROUP BY p.bid, a.id
a detection at timestamp 𝑡 from the camera 𝑐_𝑖𝑑 of the
badge 𝑏_𝑖𝑑 at coordinates 𝑥, 𝑦, 𝑧 with respect to camera          Using this view, we can compute the total time a single
𝑐_𝑖𝑑. A single tuple (𝑐_𝑖𝑑, 𝑐𝑡, 𝑥, 𝑦, 𝑧) of camera represents badge spent in front of any artwork (i.e, total_visit_times
the coordinates 𝑥, 𝑦, 𝑧 of the camera 𝑐_𝑖𝑑 in relation to the in Query 3).
museum. The value 𝑐𝑡 is the time period of a frame. If 𝑓
is the frame rate of the camera, then we have 𝑐𝑡 = 1/𝑓.                      Query 3: Badge total visit time
For the sake of simplicity, hereafter, we suppose that 𝑐𝑡        CREATE VIEW total_visit_times AS
assumes the same value for all cameras (i.e., 1/24 s), but       SELECT BADGE_ID,SUM(V_TIMES) as TOTAL_TIMES
all the discussion can be extended with simple and mini- FROM visit_times
                                                                 GROUP BY BADGE_ID
mal modifications to the general case, in which cameras
can have different frame rates. We note that, whilst the
   Then, using the two previous views, namely, visit_times   is defined by the following measure:
and total_visit_times, we can compute the percentage of
                                                                                                                𝑁
time spent by any badge in front of any artwork (i.e.,                             𝑗             𝑗         1            𝑗
                                                             𝛿(𝑢𝑚𝑖 , 𝑢𝑚𝑗 ) = (|𝑡1𝑖 −𝑡1 |+⋯+|𝑡𝑛𝑖 −𝑡𝑛 |)/200 =  ∑ |𝑡 𝑖 −𝑡 |
total_visit_times_perc in Query 4).                                                                       200 ℎ=1 ℎ ℎ
                                                                                                                       (1)
       Query 4: Badge-Artwork percentage time                In this way, the measure is a real number in [0, … , 1]. The
 CREATE VIEW visit_times_perc AS                             definition given in 1 is a measure, as it enjoys properties
 SELECT BADGE_ID, A_ID, V_TIMES*100/TOTAL_TIMES as           of Positiveness, Minimality and Simmetry.
       V_TIMES_PERC
 FROM visit_times v, total_visit_times t                     Positiveness Formula 1 is a sum of positive values, con-
 WHERE v.BADGE_ID = t.BADGE_ID                                      sequently this property is fulfilled.
   Then, using the two previous views, namely, visit_times   Minimality When two 𝑢𝑚s coincide, their distance has
and total_visit_times, we can compute the percentage of           to be 0, which means that the times spent for
time spent by any badge in front of any artwork (i.e.,                                                  𝑗
                                                                  each artwork are the same (𝑡ℎ𝑖 = 𝑡ℎ , ∀ℎ = 1, … , 𝑁)
total_visit_times_perc in Query 4).                               and the overall sum is 0. On the contrary, if two
                                                                  𝑢𝑚s are completely different, it means that the
            Query 5: Badge-Badge distances
                                                                  two visitors have seen different artworks. When
                                                                            𝑗
 SELECT v1.BADGE_ID, v2.BADGE_ID, sum(abs(v1.                     𝑡𝑛𝑖 ≥ 0, 𝑡𝑛 = 0 and vice-versa. In this way, the sum
       V_TIMES_PERC-v2.V_TIMES_PERC))/200
                                                                  is equal to 1.
 FROM visit_times_perc v1, visit_times_perc v2
 WHERE v1.A_ID = v2.A_ID AND
                                                             Simmetry It is given by the absolute value of the ob-
       v1.BADGE_ID<v2.BADGE_ID
 GROUP BY v1.BADGE_ID, v2.BADGE_ID
                                                                  servation time difference.
 HAVING sum(abs(v1.V_TIMES_PERC-v2.V_TIMES_PERC))/200<
       ALPHA
                                                             4. Data analysis
                                                              Data collection and analysis is useful for museum cura-
3. Distance                                                   tors and staff members because it allows them to work
                                                              on the fruition of the exhibition itself, as well as on the
We want to define a measure that allows us to compare visitor flow. However, it also makes more refined ana-
two visitors according to the time they spend in front on lyzes possible. Our system analyzes the video recorded
each artwork. Instead of considering absolute values, we by the cameras positioned near each point of interest
reason on a percentage basis. Let us assume that each (POI) (i.e., artwork) and from each frame. It can iden-
visitor spends a unit of time at the exhibition and let us tify the position in pixels of the badge and the distance
calculate the percentage of the time spent observing each from the camera for each visitor positioned in front of
artwork. Let us define:                                       the POI up to a distance of 6 meters. By collecting and
                                                              triangulating the data from the different recordings, we
      • 𝑁: the number of artworks present in the exhibi-
                                                              can track the path of each visitor in each room of the
        tion;
                                                              museum. Some of the analyses that can be performed on
      • 𝑣𝑖 : the 𝑖-th visitor;                                the data acquired by this system are:
      • 𝑘: the number of visitors we are going to track
      • 𝑡𝑖 : the overall time that 𝑣𝑖 spends to visit the en-      • Temporal analysis regarding the time spent in
        tire exhibition. In this way, the time spent on               front of the artworks;
        each artwork is calculated as a percentage and             •  Analysis of the trajectory of visitors in front of
            𝑁                                                         the artworks;
        ∑𝑖=1 𝑡𝑖 = 100.
                                                                   • Spatial-temporal analyses between viewing dis-
We can define a model for the 𝑖-th visitor as:                        tance and time spent in front of an artwork;
                                𝑖        𝑖                         • Heatmap of temporal data on visitor positions in
                        𝑢𝑚𝑖 = {𝑡1 , … , 𝑡𝑁 }                          the room.
where 𝑡ℎ𝑖 , ℎ = 1, … , 𝑁 is the percent time that the the    The system also allows us to perform a more refined
visitor 𝑖 spends in front of the artwork ℎ. The comparison   analysis because it allows facial identification. Therefore,
of the previous visitor’s model with the the following       it enables us to collect specific data about the visitor’s
𝑗-th model:                                                  attention to the artwork. If the face is not identified by
                                𝑗       𝑗
                       𝑢𝑚𝑗 = {𝑡1 , … , 𝑡𝑁 }                  the system, we can assume that the visitor is distracted
                                                             and, therefore, increase the granularity of the analysis.
                                                                 cultures.
                                                                    The possible future developments of the research work
                                                                 presented herein are manifold. First of all, there is the
                                                                 integration of the data collection system within a social
                                                                 recommender system [31, 32, 33]. This would allow us to
                                                                 assess the effective benefits of our system in terms of the
                                                                 inclusion of individuals from different backgrounds [34].


                                                                 References
                                                                  [1] G. D’Aniello, M. Gaeta, I. La Rocca, KnowMIS-
                                                                      ABSA: an overview and a reference model for ap-
        Figure 2: Movements of one visitor.                           plications of sentiment analysis and aspect-based
                                                                      sentiment analysis, Artificial Intelligence Review
                                                                      (2022).
                                                                  [2] G. D’Aniello, M. Gaeta, F. Orciuoli, G. Sansonetti,
                                                                      F. Sorgente, Knowledge-based smart city service
                                                                      system, Electronics (Switzerland) 9 (2020) 1–22.
                                                                  [3] G. Sansonetti, Point of interest recommendation
                                                                      based on social and linked open data, Personal and
                                                                      Ubiquitous Computing 23 (2019) 199–214.
                                                                  [4] G. Sansonetti, F. Gasparetti, A. Micarelli, Cross-
                                                                      domain recommendation for enhancing cultural
                                                                      heritage experience, in: Adjunct Publication of the
                                                                      27th ACM UMAP Conference, ACM, New York, NY,
                                                                      USA, 2019, pp. 413–415.
                                                                  [5] A. Fogli, G. Sansonetti, Exploiting semantics for
                                                                      context-aware itinerary recommendation, Personal
                                                                      and Ubiquitous Computing 23 (2019) 215–231.
                                                                  [6] D. D’Agostino, F. Gasparetti, A. Micarelli, G. San-
                                                                      sonetti, A social context-aware recommender of
                                                                      itineraries between relevant points of interest, in:
                                                                      HCI International 2016, volume 618, Springer Inter-
                                                                      national Publishing, Cham, 2016, pp. 354–359.
        Figure 3: An heatmap of one room with two artworks.       [7] E. S. Robinson, I. C. Sherman, L. E. Curry, H. H. F.
                                                                      Jayne, The behavior of the museum visitor, Publi-
                                                                      cations of the American Association of Museums 1
                                                                      (1928) 72.
If we shift the focus to the visitor, we can exploit this         [8] A. W. Melton, Problems of installation in museums
information to improve the sociality in the museum envi-              of art, Publications of the American Association of
ronment and break down cultural barriers by exploiting                Museums 7 (1935) 29–30.
the similarity between the behavior of different visitors.        [9] A. W. Melton, Distribution of attention in galleries
                                                                      in a museum of science and industry, Museum
                                                                      News 14 (1935) 6–8.
5. Conclusions and Future Works                                  [10] S. S. Yalowitz, K. Bronnenkant, Timing and track-
                                                                      ing: Unlocking visitor behavior, Visitor Studies 12
In this paper, we have presented a deep learning-based                (2009).
approach that allows museum curators and staff members           [11] L. Vaccaro, G. Sansonetti, A. Micarelli, An empirical
to accurately collect data relating to the visitors’ behavior.        review of automated machine learning, Computers
We have seen how such data can be exploited to identify               10 (2021).
possible communities [30] of users to transform the mu-          [12] F. Zafari, A. Gkelias, K. K. Leung, A survey of indoor
seum into a place of social and cultural inclusion. The               localization systems and technologies, IEEE Com-
sharing of interests and preferences can foster the aggre-            munications Surveys Tutorials 21 (2019) 2568–2599.
gation of individuals, for example, belonging to different       [13] P. Centorrino, A. Corbetta, E. Cristiani, E. Onofri,
     Managing crowded museums: Visitors flow mea-                  Deep learning techniques for automatic detection,
     surement, analysis, modeling, and optimization,               IEEE Access 8 (2020) 213154–213167.
     Journal of Computational Science 53 (2021) 101357.       [25] M. Mezzini, C. Limongelli, G. Sansonetti,
[14] M. G. Rashed, R. Suzuki, T. Yonezawa, A. Lam,                 C. De Medio, Tracking museum visitors through
     Y. Kobayashi, Y. Kuno, Tracking visitors in a real            convolutional object detectors,            in: Adjunct
     museum for behavioral analysis, in: Joint 8th Inter-          Publication of the 28th ACM UMAP Conference,
     national Conference on Soft Computing and Intelli-            ACM, New York, NY, USA, 2020, pp. 352–355.
     gent Systems (SCIS) and 17th International Sympo-        [26] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:
     sium on Advanced Intelligent Systems (ISIS), IEEE,            Towards real-time object detection with region pro-
     2016, pp. 80–85. Sapporo, Japan, 25–28 August 2016.           posal networks, in: Proceedings of the 28th NIPS
[15] R. Angeloni, R. Pierdicca, A. Mancini, M. Paolanti,           Conference - Volume 1, MIT Press, Cambridge, MA,
     A. Tonelli, Measuring and evaluating visitors’                US, 2015, pp. 91––99.
     behaviors inside museums: the co. me. project,           [27] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     SCIRES-IT-SCIentific RESearch and Information                 ing for image recognition, in: 2016 IEEE Conference
     Technology 11 (2021) 167–178.                                 on Computer Vision and Pattern Recognition, IEEE
[16] M. G. Rashed, R. Suzuki, T. Yonezawa, A. Lam,                 Computer Society, 2016, pp. 770–778.
     Y. Kobayashi, Y. Kuno, Robustly tracking people          [28] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan,
     with lidars in a crowded museum for behavioral                S. Belongie, Feature pyramid networks for object
     analysis, IEICE Transactions on Fundamentals of               detection, in: 2017 IEEE CVPR, IEEE Computer
     Electronics, Communications and Computer Sci-                 Society, Los Alamitos, CA, USA, 2017, pp. 936–944.
     ences E100A (2017) 2458–2469.                            [29] F. Gasparetti, A. Micarelli, G. Sansonetti, Exploiting
[17] A. Kontarinis, C. Marinica, D. Vodislav, K. Zeitouni,         web browsing activities for user needs identifica-
     A. Krebs, D. Kotzinos, Towards a better understand-           tion, in: 2014 International Conference on Com-
     ing of museum visitors’ behavior through indoor               putational Science and Computational Intelligence,
     trajectory analysis, in: Digital Presentation and             volume 2, 2014, pp. 86–89.
     Preservation of Cultural and Scientific Heritage,        [30] F. Gasparetti, G. Sansonetti, A. Micarelli, Commu-
     volume 7, 2017, pp. 19–30.                                    nity detection in social recommender systems: a
[18] R. Giuliano, G. C. Cardarilli, C. Cesarini, L. Di Nun-        survey, Applied Intelligence 51 (2021) 3975–3995.
     zio, F. Fallucchi, R. Fazzolari, F. Mazzenga, M. Re,     [31] S. Caldarelli, D. F. Gurini, A. Micarelli, G. Sansonetti,
     A. Vizzarri, Indoor localization system based on              A signal-based approach to news recommendation,
     bluetooth low energy for museum applications,                 in: CEUR Workshop Proceedings, volume 1618,
     Electronics 9 (2020).                                         CEUR-WS.org, Aachen, Germany, 2016, pp. 1–4.
[19] P. Spachos, K. N. Plataniotis, Ble beacons for indoor    [32] H. A. M. Hassan, G. Sansonetti, F. Gasparetti, A. Mi-
     positioning at an interactive iot-based smart mu-             carelli, J. Beel, Bert, elmo, USE and infersent sen-
     seum, IEEE Systems Journal 14 (2020) 3483–3493.               tence encoders: The panacea for research-paper
[20] C. Martella, A. Miraglia, J. Frost, M. Cattani, M. V.         recommendation?, in: M. Tkalcic, S. Pera (Eds.),
     Steen, Visualizing, clustering, and predicting the be-        Proceedings of ACM RecSys 2019 Late-Breaking Re-
     havior of museum visitors, Pervasive Mob. Comput.             sults, volume 2431, CEUR-WS.org, 2019, pp. 6–10.
     38 (2017) 430–443.                                       [33] I. Guy, Social recommender systems, in: F. Ricci,
[21] A. Emerson, N. Henderson, J. Rowe, W. Min, S. Lee,            L. Rokach, B. Shapira (Eds.), Recommender Sys-
     J. Minogue, J. Lester, Early prediction of visitor            tems Handbook, Springer US, Boston, MA, 2015, pp.
     engagement in science museums with multimodal                 511–543.
     learning analytics, in: Proceedings of the 2020 In-      [34] R. Cordier, B. T. Milbourn, R. Martin, A. Buchanan,
     ternational Conference on Multimodal Interaction,             D. Chung, R. Speyer, A systematic review evalu-
     2020, pp. 107–116.                                            ating the psychometric properties of measures of
[22] A. Ferrato, C. Limongelli, M. Mezzini, G. Sansonetti,         social inclusion, PLoS ONE 12 (2017).
     Using deep learning for collecting data about mu-
     seum visitor behavior, Applied Sciences 12 (2022).
[23] A. Micarelli, A. Neri, G. Sansonetti, A case-based
     approach to image recognition, in: Proceedings of
     the 5th European Workshop on Advances in Case-
     Based Reasoning, EWCBR ’00, Springer-Verlag,
     Berlin, Heidelberg, 2000, pp. 443–454.
[24] G. Sansonetti, F. Gasparetti, G. D’Aniello, A. Mi-
     carelli, Unreliable users detection in social media: