Multi-modal robotic architecture for object referring tasks
                         aimed at designing new rehabilitation strategies
                         Chiara Falagario1,† , Shiva Hanifi2,† , Maria Lombardi1,* and Lorenzo Natale1,*
                         1
                             Humanoid Sensing and Perception Group, Istituito Italiano di Tecnologia, Genoa, Italy
                         2
                             Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg


                                        Abstract
                                        The integration of robotics and Artificial Intelligence (AI) in healthcare applications holds significant potential
                                        for the development of innovative rehabilitation strategies. Great advantage of these new emerging technologies
                                        is the possibility to offer a rehabilitation plan that is personalised to each patient, especially in aiding individuals
                                        with neurodevelopmental disorders, such as Autism Spectrum Disorder (ASD). In this context, a significant
                                        challenge is to endow robots with abilities to understand and replicate human social skills during interactions,
                                        while concurrently adapting to environmental stimuli. This extended abstract proposes a preliminary robotic
                                        architecture capable of estimating the human partner’s attention and recognizing the object to which the human
                                        is referring. Our work demonstrates how the robot’s ability to interpret human social cues, such as gaze, enhances
                                        system usability during object referring tasks.

                                        Keywords
                                        attentive learning architecture, visual-language model, object referring, social assistive robotics, rehabilitation
                                        training


                         1. Introduction
                         The use of social assistive robots in Healthcare is rapidly expanding due to their potential to support
                         individuals with special needs and enhance engagement during rehabilitation sessions, leading to
                         improved therapy outcomes [1, 2, 3, 4]. In this context, the ability of robots to understand human
                         mental status has a crucial and pivotal role in designing new rehabilitation strategies to assist frail
                         people and patients. Designing and implementing a robust robotic visual system capable of perceiving
                         and interpreting typical human social cues is essential for enabling natural and effective interactions
                         between humans and robots. Visual perception enables the robot to understand the surrounding
                         environment, anticipate human intentions, and help them appropriately even with a simple task (for
                         example, reach and grasp an object). The availability of such technologies will open the possibility to
                         offer rehabilitation plans that are personalised to each patient and that can best fit individual needs.
                            Among the multitude of social cues characterising human-human interactions that can be endowed
                         in an assistive robot, attention and referring understanding are crucial abilities for any task-oriented
                         interaction, raising great attention in the computer vision community [5, 6, 7]. Referring understanding
                         tasks aim at localising objects (or regions of interest) in images or videos by using natural language
                         description as input by humans. However, in a real-world scenario, the referring expression could be
                         ambiguous or incomplete. For example, an ambiguous referring expression can be “Could you pass
                         me that cracker-box, please?” if there are more than one cracker-box in the scene. In this case in
                         order to improve the referring accuracy, gaze signal can be used together with the natural language as
                         complementary cue (people often utilise gaze to confirm the referred target while interacting).

                         Workshop on Advanced AI Methods and Interfaces for Human-Centered Assistive and Rehabilitation Robotics (a Fit4MedRob
                         event) - AIxIA 2024
                         *
                           Corresponding authors.
                         †
                           These authors contributed equally.
                         $ chiara.falagario@iit.it (C. Falagario); shiva.hanifi@uni.lu (S. Hanifi); maria.lombardi1@iit.it (M. Lombardi);
                         lorenzo.natale@iit.it (L. Natale)
                          0009-0007-9381-8985 (C. Falagario); 0009-0001-9719-7342 (S. Hanifi); 0000-0001-5792-5889 (M. Lombardi);
                         0000-0002-8777-5233 (L. Natale)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                1
Chiara Falagario et al. CEUR Workshop Proceedings                                                        1–7


   Having a multi-modal attentive robotic system able to integrate natural language with the social cue
of gaze can be a valuable tool, especially in rehabilitation from social disorders like Autism Spectrum
Disorder (ASD). Studies suggest that children suffering from ASD prefer robots to interact with exhibiting
increased engagement, specifically human-like verbal-featured robots, since they are more predictable
and with more controlled visual stimuli [8, 9, 10, 11]. This suggests that robots can be effective tools for
assessing and potentially improving social interaction and communication abilities in children with
ASD. Children with ASD may experience challenges with both verbal and nonverbal skills. For example,
some children may be very limited in communicating using speech or language, and some may have
difficulties in establishing the correct visual focus of attention [12, 13].
   The work presented in this extended abstract is part of a broader project aiming at developing
new robot-assisted rehabilitation strategies for children with neuro-developmental disorders based on
face-to-face human-robot interactions involving manipulation of physical objects. Within the scope of
the project, the considered training protocol consists in the child and the robot collaborating to fulfil a
shared task, like a pick and place objects or handle and pass to each other a series of different objects. In
order to make the robot aware of the object of interest while interacting also with children with reduced
communication skills, the proposed robotic perception system has been designed to address object
referring tasks by integrating language description with the human attention estimation. Specifically,
the system takes in input an image with a caption in natural language and gives in output the object
the human is referring to. Combining verbal with non verbal cues in one multi-modal architecture, the
robot can understand the object referred by the human even with incomplete or ambiguous description,
increasing its usability and helping to perform the task in a more efficient way.
   In our study, we chose to use the humanoid robot iCub [14]. Its design strikes a balance between
being sufficiently human-like and avoiding the uncanny valley effect (see [15]), which can occur with
too human-like android robots [16]. Studies presented in [17] have shown that children with ASD
respond well to the iCub robot, making it an ideal choice for our research.


2. Related works
Very few learning architectures exists in the current literature addressing the problem of object referring
by combining natural language with additional inputs. Among them, Vasudevan et al. [18] proposed
a multi-modal architecture combining the text description with different input sources such as gaze
estimation, optical flow for motion feature and depth map. However, not all the aforementioned input
sources are always available if considering different application scenarios. For example in the considered
rehabilitation scenario, the iCub humanoid robot is equipped with low-resolution only RGB camera
making the depth estimation from the image a challenging task. The work proposed in [6] overcame the
problem in [18] combining the text description only with the gaze signal reaching even higher object
referring accuracy. However, the proposed pipeline was designed to detect human attention targets
while using looking images on screens-based devices, like tablets and smartphones. This scenario
does not align with the conditions of a rehabilitation session, where the child and the humanoid robot
are required to interact online on a collaborative task. To overcome the aforementioned limitations
and meet the needs of a rehabilitation setting, the framework proposed in this extended abstract is
specifically designed to run online on a robotic platform like iCub while using only RGB information
coming from the cameras.


3. Attentive robotic architecture for object referring tasks
The proposed system is composed of two main blocks, each one based upon a different computer vision
architecture: a human attention model, designed to estimate the human target of attention (Gaze), and
an object detection model (MDETR - Modulated Detection Transformer [19]), responsible for detecting
and recognizing objects in the scene. For that, we refer to our system as GazeMDETR.


                                                     2
Chiara Falagario et al. CEUR Workshop Proceedings                                                     1–7


Human attention estimation. The human attention model is responsible for estimating the focus of
a human’s gaze in a given scene. In this study, we use the fine-tuned VTD (Visual Target Detection) as
proposed in [20], which provides a more comprehensive gaze target distribution within the scene. This
refinement is particularly suited to tabletop scenarios, a common setting in healthcare applications.
The work was based on the Visual Target Detection (VTD) system [21], which uses a spatio-temporal
architecture to predict gaze targets in real-time video streams. VTD combines both head orientation and
scene features by leveraging a EfficientNetB5 convolutional network as a feature extractor, enhanced
with an attention mechanism. Specifically, the module takes in input the image and the human face
bounding box (extracted by using [22, 23]) and provides as output an attention heatmap representing
the image area that more likely contains the target of human attention. The returned heatmap is an
image-sized matrix, where each cell corresponds to an image pixel. The value of each cell ranges from
0 to 1 (respectively, the lowest and the highest probability to be –or to be close to– the target of human
attention).

Object detection. The object detection model is based on the MDETR [19], end-to-end framework
detecting objects within images conditioned on natural language text given in input, such as captions or
questions. Briefly, MDETR uses a combination of convolutional neural networks (CNNs) and transformer-
based encoders to fuse visual and textual data, allowing the model to align objects with free-form text
descriptions. MDETR is able to detect nuanced concepts from free-form text, and generalizes to unseen
combinations of categories and attributes. MDETR has been pre-trained on large multi-modal datasets,
and then fine-tuned in order to solve different downstream tasks, such as phrase grounding, visual
question answering, referring expression detection and segmentation. In this work, MDETR is used
with the reference to referring expression detection task (i.e., given an image and a referring expression
in plain text, the system returns the bounding box around the referred object).

Combining attention with object detection. GazeMDETR integrates the Human attention esti-
mation module and the Object detection module in one multi-modal architecture, as shown in Figure
1. Specifically, in order to merge the gaze information in the object detection, the attention heatmap
produced by the Human attention module was first downsampled to match the dimensions of the feature
map produced by the MDETR backbone, and then normalised in the range of (0.5, 1). The resulting
heatmap was finally multiplied with the convolutional features map (Figure 1). By integrating the gaze
information from the VTD module with the object detection capabilities of MDETR, GazeMDETR pro-
vides a more context-aware detection framework. The fusion of these two systems enables GazeMDETR
to detect objects within complex scenes while also inferring the primary focus of human attention.
This means that the model is able to prioritize relevant objects based on the social cue of gaze (also in
cluttered scenarios), offering enhanced accuracy in object detection tasks.


4. Methods and Preliminary results
In order to evaluate the performance of the proposed system, a testset was collected having different
human participants looking at several objects in different cluttered scenarios. The same testset was
then used also to make a comparison between our system and MDETR (used as baseline).

Data collection. A total of 4 participants were involved for the data collection (2 females, 2 males,
age: mean 27, sd 3.54). All participants had normal or corrected normal vision and provided written
informed consent. The data collection was conducted using the camera of the iCub robot [14] positioned
on one side of a table, while the human participant stands on the other side. On the table were placed
up to 11 objects chosen from the YCB dataset [24] together with regular office objects, thus to increase
the difficulty of the task. The participants were instructed to look at the requested object in a natural
and spontaneous manner. For each session and for each trial, each object was gazed at for 5 seconds
by the participant. Each participant completed three recording sessions, each one characterised by a


                                                    3
Chiara Falagario et al. CEUR Workshop Proceedings                                                              1–7


Figure 1: GazeMDETR architecture. The Human attention estimation module is composed by the pose estimation
model [22], the face detection model [23] and the VTD architecture [21]. The attention heatmap in output is then
used as input to the Object detection module to weight the feature map extracted by the MDETR’s convolutional
backbone [19]. Final output is the bounding box of the object the human is referring to.


specific arrangement of objects (Figure 2) - note that in a single session, the same object can be present
multiple times:
   1. Heterogeneous cluttered scenario: coffee can, stapler, journals, mustard bottles, chips can, sugar
      boxes, crackers boxes;
   2. Scenario with only boxes: baby food boxes, pudding boxes, crackers boxes, sugar boxes;
   3. Scenario with only repeated objects: crackers boxes, mustard bottles.


Figure 2: Sample frame for each scenario. From the left, the participant was asked to gaze: at the coffee can
object (session 1); at the big cracker box on the right (session 2); at the mustard bottle on the left (session 3).


Evaluation on the cluttered testset. We evaluate and compare the performance between MDETR
and GazeMDETR using Accuracy@1 (Acc@1). For each image the bounding box of the predicted
referred object is compared with the ground truth: if thee bounding box overlaps the gazed object
over a certain threshold, the prediction is counted as true positive, otherwise, it is predicted as a false
positive. Note that if more than one bounding box is returned as output, only the bounding box with
the highest confidence value is selected. The overlapping between the bounding boxes was evaluated
in terms of Intersection over Union (IoU) and the threshold was set at 0.5.
   The accuracy is reported as average value evaluated across all participants and all objects within a
session, using captions at different level of detail. Specifically, we considered 4 different captions having


                                                        4
Chiara Falagario et al. CEUR Workshop Proceedings                                                            1–7


a different number of attributes related to the referred object: 1) pose + color + name + placement,
2) pose + name + placement, 3) color + name, 4) name. “Pose” refers to the object orientation (e.g.,
vertical/horizontal), while “placement” refers to the object position (e.g., on the left/on the right).
This degree of detail is useful to study the performance of the models with ambiguous or incomplete
sentences. Table 1 reports the accuracy of MDETR and GazeMDETR for each caption and each session.

                      Session    Caption    GazeMDETR [Acc@1]          MDETR [Acc@1]
                                   A1             0.69                     0.45
                                   A2             0.32                     0.32
                         1
                                   A3             0.78                     0.61
                                   A4             0.54                     0.43
                                   A1             0.47                     0.53
                                   A2             0.43                     0.46
                         2
                                   A3             0.51                     0.29
                                   A4             0.34                     0.14
                                   A1             0.94                     0.89
                                   A2             0.79                     0.99
                         3
                                   A3             0.92                     0.46
                                   A4             0.86                     0.46
Table 1
Comparison between MDETR and GazeMDETR. Mean of accuracy@1 is reported for the most significant
captions for all the sessions. Specifically: A1 = The + pose + color + name + placement, A2 = The + pose + name +
placement, A3 = The + color + name , A4 = The + name. Highest performance for each caption is in bold.


5. Discussion and future directions
The results in Table 1 compare GazeMDETR and MDETR in terms of accuracy across the three sessions,
with captions varying in complexity from detailed descriptions to simpler ones. While for the captions
A1 and A2 (more detailed captions) GazeMDETR and MDETR can be considered comparable alternatives,
GazeMDETR reports a major improvement for the captions A3 and A4 (less detailed captions) for all
the sessions. For example, in session 3 GazeMDETR scores respectively an accuracy value of 0.92 and
0.86 for A3 and A4, while MDETR performance drastically drops to 0.46 for both cases.
   Given the promising results and given the effect that the caption has on the object detection accuracy,
ongoing work is focused on further analysing the capabilities of GazeMDETR with a more natural
input text trying to simulate a human request in an interaction. Examples of input description that
can be considered with different level of detail are: “Please, could you pass me the + object”, “Look at
the + object”, “Point at the + object”. Having a perception system robust to the level of detail in object
referring is crucial to enhance the usability and the user experience, especially for people suffering
from ASD with reduced verbal skills, resulting in a smoother communication and greater engagement
during the rehabilitation sessions.
   Next step will be the implementation of the GazeMDETR model on a robotic platform like iCub
humanoid robot (in this work the robot’s camera has been used only for data collection). Having such
an architecture endowed in iCub will allow the robot to be aware of the surrounding environment and
of the patient while performing the training trials. In order to have a socially assistive humanoid robot,
the proposed perception system will be combined also with other learning algorithms implementing
further social cues, such as action recognition, mutual gaze estimation and so on.


Funding
This work received funding under the project Fit for Medical Robotics (Fit4MedRob) - PNRR MUR Cod.
PNC0000007 - CUP: B53C22006960001.


                                                       5
Chiara Falagario et al. CEUR Workshop Proceedings                                                       1–7


References
 [1] H. I. Krebs, J. J. Palazzolo, L. Dipietro, M. Ferraro, J. Krol, K. Rannekleiv, B. T. Volpe, N. Hogan,
     Rehabilitation robotics: Performance-based progressive robot-assisted therapy, Autonomous
     robots 15 (2003) 7–20.
 [2] S. Boucenna, A. Narzisi, E. Tilmont, F. Muratori, G. Pioggia, D. Cohen, M. Chetouani, Interactive
     technologies for autistic children: A review, Cognitive Computation 6 (2014) 722–740.
 [3] J. Fan, L. C. Mion, L. Beuscher, A. Ullal, P. A. Newhouse, N. Sarkar, Sar-connect: a socially assistive
     robotic system to support activity and social engagement of older adults, IEEE Transactions on
     Robotics 38 (2021) 1250–1269.
 [4] X. Yang, X. Shi, X. Xue, Z. Deng, Efficacy of robot-assisted training on rehabilitation of upper limb
     function in patients with stroke: a systematic review and meta-analysis, Archives of Physical
     Medicine and Rehabilitation 104 (2023) 1498–1513.
 [5] A. Khoreva, A. Rohrbach, B. Schiele, Video object segmentation with language referring expres-
     sions, in: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth,
     Australia, December 2–6, 2018, Revised Selected Papers, Part IV 14, Springer, 2019, pp. 123–141.
 [6] J. Chen, X. Zhang, Y. Wu, S. Ghosh, P. Natarajan, S.-F. Chang, J. Allebach, One-stage object
     referring with gaze estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision
     and Pattern Recognition, 2022, pp. 5021–5030.
 [7] D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, J. Shen, Referring multi-object tracking, in:
     Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.
     14633–14642.
 [8] S. Baron-Cohen, Empathizing, systemizing, and the extreme male brain theory of autism, Progress
     in brain research 186 (2010) 167–175.
 [9] M. Hart, Autism/excel study, in: Proceedings of the 7th International ACM SIGACCESS Conference
     on Computers and Accessibility, 2005, pp. 136–141.
[10] J. Lee, H. Takehashi, C. Nagai, G. Obinata, D. Stefanov, Which robot features can stimulate better
     responses from children with autism in robot-assisted therapy?, International Journal of Advanced
     Robotic Systems 9 (2012) 72.
[11] L. V. Calderita, L. J. Manso, P. Bustos, C. Suárez-Mejías, F. Fernández, A. Bandera, Therapist:
     towards an autonomous socially interactive robot for motor and neurorehabilitation therapies for
     children, JMIR rehabilitation and assistive technologies 1 (2014) e3151.
[12] A. Di Nuovo, D. Conti, G. Trubia, S. Buono, S. Di Nuovo, Deep learning systems for estimating visual
     attention in robot-assisted therapy of children with autism and intellectual disability, Robotics 7
     (2018) 25.
[13] A. Alabdulkareem, N. Alhakbani, A. Al-Nafjan, A systematic review of research on robot-assisted
     therapy for children with autism, Sensors 22 (2022) 944.
[14] G. Metta, L. Natale, F. Nori, G. Sandini, The icub project: An open source platform for research in
     embodied cognition, in: Advanced Robotics and its Social Impacts, 2011, pp. 24–26.
[15] M. Mori, K. F. MacDorman, N. Kageki, The uncanny valley [from the field], IEEE Robotics &
     automation magazine 19 (2012) 98–100.
[16] M. Mara, M. Appel, T. Gnambs, Human-like robots and the uncanny valley, Zeitschrift für
     Psychologie (2022).
[17] D. Ghiglino, F. Floris, D. De Tommaso, K. Kompatsiari, P. Chevalier, T. Priolo, A. Wykowska,
     Artificial scaffolding: Augmenting social cognition by means of robot technology, Autism Research
     16 (2023) 997–1008.
[18] A. B. Vasudevan, D. Dai, L. Van Gool, Object referring in videos with language and human gaze,
     in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
     2018.
[19] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, N. Carion, Mdetr-modulated detection for
     end-to-end multi-modal understanding, in: Proceedings of the IEEE/CVF international conference
     on computer vision, 2021, pp. 1780–1790.


                                                     6
Chiara Falagario et al. CEUR Workshop Proceedings                                                   1–7


[20] S. Hanifi, E. Maiettini, M. Lombardi, L. Natale, icub detecting gazed objects: A pipeline estimating
     human attention, 2024. URL: https://arxiv.org/abs/2308.13318. arXiv:2308.13318.
[21] E. Chong, Y. Wang, N. Ruiz, J. M. Rehg, Detecting attended visual targets in video, in: Proceedings
     of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5396–5406.
[22] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, Openpose: Realtime multi-person 2d pose estima-
     tion using part affinity fields, 2019. URL: https://arxiv.org/abs/1812.08008. arXiv:1812.08008.
[23] M. Lombardi, E. Maiettini, V. Tikhanoff, L. Natale, icub knows where you look: Exploiting social
     cues for interactive object detection learning, in: 2022 IEEE-RAS 21st International Conference
     on Humanoid Robots (Humanoids), 2022, pp. 480–487. doi:10.1109/Humanoids53995.2022.
     10000163.
[24] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, A. M. Dollar, Benchmarking in manipulation
     research: Using the yale-cmu-berkeley object and model set, IEEE Robotics & Automation Magazine
     22 (2015) 36–52.


                                                    7