<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Hanifi);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multi-modal robotic architecture for object referring tasks aimed at designing new rehabilitation strategies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chiara Falagario</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shiva Hanifi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Lombardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Natale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humanoid Sensing and Perception Group, Istituito Italiano di Tecnologia</institution>
          ,
          <addr-line>Genoa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg</institution>
          ,
          <country country="LU">Luxembourg</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>The integration of robotics and Artificial Intelligence (AI) in healthcare applications holds significant potential for the development of innovative rehabilitation strategies. Great advantage of these new emerging technologies is the possibility to ofer a rehabilitation plan that is personalised to each patient, especially in aiding individuals with neurodevelopmental disorders, such as Autism Spectrum Disorder (ASD). In this context, a significant challenge is to endow robots with abilities to understand and replicate human social skills during interactions, while concurrently adapting to environmental stimuli. This extended abstract proposes a preliminary robotic architecture capable of estimating the human partner's attention and recognizing the object to which the human is referring. Our work demonstrates how the robot's ability to interpret human social cues, such as gaze, enhances system usability during object referring tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;attentive learning architecture</kwd>
        <kwd>visual-language model</kwd>
        <kwd>object referring</kwd>
        <kwd>social assistive robotics</kwd>
        <kwd>rehabilitation training</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The use of social assistive robots in Healthcare is rapidly expanding due to their potential to support
individuals with special needs and enhance engagement during rehabilitation sessions, leading to
improved therapy outcomes [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. In this context, the ability of robots to understand human
mental status has a crucial and pivotal role in designing new rehabilitation strategies to assist frail
people and patients. Designing and implementing a robust robotic visual system capable of perceiving
and interpreting typical human social cues is essential for enabling natural and efective interactions
between humans and robots. Visual perception enables the robot to understand the surrounding
environment, anticipate human intentions, and help them appropriately even with a simple task (for
example, reach and grasp an object). The availability of such technologies will open the possibility to
ofer rehabilitation plans that are personalised to each patient and that can best fit individual needs.
      </p>
      <p>
        Among the multitude of social cues characterising human-human interactions that can be endowed
in an assistive robot, attention and referring understanding are crucial abilities for any task-oriented
interaction, raising great attention in the computer vision community [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. Referring understanding
tasks aim at localising objects (or regions of interest) in images or videos by using natural language
description as input by humans. However, in a real-world scenario, the referring expression could be
ambiguous or incomplete. For example, an ambiguous referring expression can be “Could you pass
me that cracker-box, please?” if there are more than one cracker-box in the scene. In this case in
order to improve the referring accuracy, gaze signal can be used together with the natural language as
complementary cue (people often utilise gaze to confirm the referred target while interacting).
      </p>
      <p>
        Having a multi-modal attentive robotic system able to integrate natural language with the social cue
of gaze can be a valuable tool, especially in rehabilitation from social disorders like Autism Spectrum
Disorder (ASD). Studies suggest that children sufering from ASD prefer robots to interact with exhibiting
increased engagement, specifically human-like verbal-featured robots, since they are more predictable
and with more controlled visual stimuli [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11</xref>
        ]. This suggests that robots can be efective tools for
assessing and potentially improving social interaction and communication abilities in children with
ASD. Children with ASD may experience challenges with both verbal and nonverbal skills. For example,
some children may be very limited in communicating using speech or language, and some may have
dificulties in establishing the correct visual focus of attention [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>The work presented in this extended abstract is part of a broader project aiming at developing
new robot-assisted rehabilitation strategies for children with neuro-developmental disorders based on
face-to-face human-robot interactions involving manipulation of physical objects. Within the scope of
the project, the considered training protocol consists in the child and the robot collaborating to fulfil a
shared task, like a pick and place objects or handle and pass to each other a series of diferent objects. In
order to make the robot aware of the object of interest while interacting also with children with reduced
communication skills, the proposed robotic perception system has been designed to address object
referring tasks by integrating language description with the human attention estimation. Specifically,
the system takes in input an image with a caption in natural language and gives in output the object
the human is referring to. Combining verbal with non verbal cues in one multi-modal architecture, the
robot can understand the object referred by the human even with incomplete or ambiguous description,
increasing its usability and helping to perform the task in a more eficient way.</p>
      <p>
        In our study, we chose to use the humanoid robot iCub [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Its design strikes a balance between
being suficiently human-like and avoiding the uncanny valley efect (see [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]), which can occur with
too human-like android robots [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Studies presented in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] have shown that children with ASD
respond well to the iCub robot, making it an ideal choice for our research.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Very few learning architectures exists in the current literature addressing the problem of object referring
by combining natural language with additional inputs. Among them, Vasudevan et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] proposed
a multi-modal architecture combining the text description with diferent input sources such as gaze
estimation, optical flow for motion feature and depth map. However, not all the aforementioned input
sources are always available if considering diferent application scenarios. For example in the considered
rehabilitation scenario, the iCub humanoid robot is equipped with low-resolution only RGB camera
making the depth estimation from the image a challenging task. The work proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] overcame the
problem in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] combining the text description only with the gaze signal reaching even higher object
referring accuracy. However, the proposed pipeline was designed to detect human attention targets
while using looking images on screens-based devices, like tablets and smartphones. This scenario
does not align with the conditions of a rehabilitation session, where the child and the humanoid robot
are required to interact online on a collaborative task. To overcome the aforementioned limitations
and meet the needs of a rehabilitation setting, the framework proposed in this extended abstract is
specifically designed to run online on a robotic platform like iCub while using only RGB information
coming from the cameras.
3. Attentive robotic architecture for object referring tasks
The proposed system is composed of two main blocks, each one based upon a diferent computer vision
architecture: a human attention model, designed to estimate the human target of attention (Gaze), and
an object detection model (MDETR - Modulated Detection Transformer [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]), responsible for detecting
and recognizing objects in the scene. For that, we refer to our system as GazeMDETR.
Human attention estimation. The human attention model is responsible for estimating the focus of
a human’s gaze in a given scene. In this study, we use the fine-tuned VTD (Visual Target Detection) as
proposed in [20], which provides a more comprehensive gaze target distribution within the scene. This
refinement is particularly suited to tabletop scenarios, a common setting in healthcare applications.
The work was based on the Visual Target Detection (VTD) system [21], which uses a spatio-temporal
architecture to predict gaze targets in real-time video streams. VTD combines both head orientation and
scene features by leveraging a EficientNetB5 convolutional network as a feature extractor, enhanced
with an attention mechanism. Specifically, the module takes in input the image and the human face
bounding box (extracted by using [22, 23]) and provides as output an attention heatmap representing
the image area that more likely contains the target of human attention. The returned heatmap is an
image-sized matrix, where each cell corresponds to an image pixel. The value of each cell ranges from
0 to 1 (respectively, the lowest and the highest probability to be –or to be close to– the target of human
attention).
      </p>
      <p>
        Object detection. The object detection model is based on the MDETR [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], end-to-end framework
detecting objects within images conditioned on natural language text given in input, such as captions or
questions. Briefly, MDETR uses a combination of convolutional neural networks (CNNs) and
transformerbased encoders to fuse visual and textual data, allowing the model to align objects with free-form text
descriptions. MDETR is able to detect nuanced concepts from free-form text, and generalizes to unseen
combinations of categories and attributes. MDETR has been pre-trained on large multi-modal datasets,
and then fine-tuned in order to solve diferent downstream tasks, such as phrase grounding, visual
question answering, referring expression detection and segmentation. In this work, MDETR is used
with the reference to referring expression detection task (i.e., given an image and a referring expression
in plain text, the system returns the bounding box around the referred object).
      </p>
      <p>Combining attention with object detection. GazeMDETR integrates the Human attention
estimation module and the Object detection module in one multi-modal architecture, as shown in Figure
1. Specifically, in order to merge the gaze information in the object detection, the attention heatmap
produced by the Human attention module was first downsampled to match the dimensions of the feature
map produced by the MDETR backbone, and then normalised in the range of (0.5, 1). The resulting
heatmap was finally multiplied with the convolutional features map (Figure 1). By integrating the gaze
information from the VTD module with the object detection capabilities of MDETR, GazeMDETR
provides a more context-aware detection framework. The fusion of these two systems enables GazeMDETR
to detect objects within complex scenes while also inferring the primary focus of human attention.
This means that the model is able to prioritize relevant objects based on the social cue of gaze (also in
cluttered scenarios), ofering enhanced accuracy in object detection tasks.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Methods and Preliminary results</title>
      <p>
        In order to evaluate the performance of the proposed system, a testset was collected having diferent
human participants looking at several objects in diferent cluttered scenarios. The same testset was
then used also to make a comparison between our system and MDETR (used as baseline).
Data collection. A total of 4 participants were involved for the data collection (2 females, 2 males,
age: mean 27, sd 3.54). All participants had normal or corrected normal vision and provided written
informed consent. The data collection was conducted using the camera of the iCub robot [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] positioned
on one side of a table, while the human participant stands on the other side. On the table were placed
up to 11 objects chosen from the YCB dataset [24] together with regular ofice objects, thus to increase
the dificulty of the task. The participants were instructed to look at the requested object in a natural
and spontaneous manner. For each session and for each trial, each object was gazed at for 5 seconds
by the participant. Each participant completed three recording sessions, each one characterised by a
specific arrangement of objects (Figure 2) - note that in a single session, the same object can be present
multiple times:
1. Heterogeneous cluttered scenario: cofee can, stapler, journals, mustard bottles, chips can, sugar
boxes, crackers boxes;
2. Scenario with only boxes: baby food boxes, pudding boxes, crackers boxes, sugar boxes;
3. Scenario with only repeated objects: crackers boxes, mustard bottles.
Evaluation on the cluttered testset. We evaluate and compare the performance between MDETR
and GazeMDETR using Accuracy@1 (Acc@1). For each image the bounding box of the predicted
referred object is compared with the ground truth: if thee bounding box overlaps the gazed object
over a certain threshold, the prediction is counted as true positive, otherwise, it is predicted as a false
positive. Note that if more than one bounding box is returned as output, only the bounding box with
the highest confidence value is selected. The overlapping between the bounding boxes was evaluated
in terms of Intersection over Union (IoU) and the threshold was set at 0.5.
      </p>
      <p>The accuracy is reported as average value evaluated across all participants and all objects within a
session, using captions at diferent level of detail. Specifically, we considered 4 diferent captions having
a diferent number of attributes related to the referred object: 1) pose + color + name + placement,
2) pose + name + placement, 3) color + name, 4) name. “Pose” refers to the object orientation (e.g.,
vertical/horizontal), while “placement” refers to the object position (e.g., on the left/on the right).
This degree of detail is useful to study the performance of the models with ambiguous or incomplete
sentences. Table 1 reports the accuracy of MDETR and GazeMDETR for each caption and each session.</p>
      <p>Session
1
2
3</p>
      <p>Caption</p>
      <p>A1
A2
A3
A4
A1
A2
A3
A4
A1
A2
A3
A4</p>
    </sec>
    <sec id="sec-4">
      <title>5. Discussion and future directions</title>
      <p>The results in Table 1 compare GazeMDETR and MDETR in terms of accuracy across the three sessions,
with captions varying in complexity from detailed descriptions to simpler ones. While for the captions
A1 and A2 (more detailed captions) GazeMDETR and MDETR can be considered comparable alternatives,
GazeMDETR reports a major improvement for the captions A3 and A4 (less detailed captions) for all
the sessions. For example, in session 3 GazeMDETR scores respectively an accuracy value of 0.92 and
0.86 for A3 and A4, while MDETR performance drastically drops to 0.46 for both cases.</p>
      <p>Given the promising results and given the efect that the caption has on the object detection accuracy,
ongoing work is focused on further analysing the capabilities of GazeMDETR with a more natural
input text trying to simulate a human request in an interaction. Examples of input description that
can be considered with diferent level of detail are: “Please, could you pass me the + object”, “Look at
the + object”, “Point at the + object”. Having a perception system robust to the level of detail in object
referring is crucial to enhance the usability and the user experience, especially for people sufering
from ASD with reduced verbal skills, resulting in a smoother communication and greater engagement
during the rehabilitation sessions.</p>
      <p>Next step will be the implementation of the GazeMDETR model on a robotic platform like iCub
humanoid robot (in this work the robot’s camera has been used only for data collection). Having such
an architecture endowed in iCub will allow the robot to be aware of the surrounding environment and
of the patient while performing the training trials. In order to have a socially assistive humanoid robot,
the proposed perception system will be combined also with other learning algorithms implementing
further social cues, such as action recognition, mutual gaze estimation and so on.</p>
    </sec>
    <sec id="sec-5">
      <title>Funding</title>
      <p>This work received funding under the project Fit for Medical Robotics (Fit4MedRob) - PNRR MUR Cod.
PNC0000007 - CUP: B53C22006960001.
1–7
[20] S. Hanifi, E. Maiettini, M. Lombardi, L. Natale, icub detecting gazed objects: A pipeline estimating
human attention, 2024. URL: https://arxiv.org/abs/2308.13318. arXiv:2308.13318.
[21] E. Chong, Y. Wang, N. Ruiz, J. M. Rehg, Detecting attended visual targets in video, in: Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5396–5406.
[22] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, Openpose: Realtime multi-person 2d pose
estimation using part afinity fields, 2019. URL: https://arxiv.org/abs/1812.08008. arXiv:1812.08008.
[23] M. Lombardi, E. Maiettini, V. Tikhanof, L. Natale, icub knows where you look: Exploiting social
cues for interactive object detection learning, in: 2022 IEEE-RAS 21st International Conference
on Humanoid Robots (Humanoids), 2022, pp. 480–487. doi:10.1109/Humanoids53995.2022.
10000163.
[24] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, A. M. Dollar, Benchmarking in manipulation
research: Using the yale-cmu-berkeley object and model set, IEEE Robotics &amp; Automation Magazine
22 (2015) 36–52.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. I.</given-names>
            <surname>Krebs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Palazzolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dipietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rannekleiv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Volpe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <article-title>Rehabilitation robotics: Performance-based progressive robot-assisted therapy</article-title>
          ,
          <source>Autonomous robots 15</source>
          (
          <year>2003</year>
          )
          <fpage>7</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Boucenna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narzisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tilmont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Muratori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pioggia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chetouani</surname>
          </string-name>
          ,
          <article-title>Interactive technologies for autistic children: A review</article-title>
          ,
          <source>Cognitive Computation 6</source>
          (
          <year>2014</year>
          )
          <fpage>722</fpage>
          -
          <lpage>740</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Mion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beuscher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ullal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Newhouse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <article-title>Sar-connect: a socially assistive robotic system to support activity and social engagement of older adults</article-title>
          ,
          <source>IEEE Transactions on Robotics</source>
          <volume>38</volume>
          (
          <year>2021</year>
          )
          <fpage>1250</fpage>
          -
          <lpage>1269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Eficacy of robot-assisted training on rehabilitation of upper limb function in patients with stroke: a systematic review and meta-analysis</article-title>
          ,
          <source>Archives of Physical Medicine and Rehabilitation</source>
          <volume>104</volume>
          (
          <year>2023</year>
          )
          <fpage>1498</fpage>
          -
          <lpage>1513</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khoreva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          ,
          <article-title>Video object segmentation with language referring expressions</article-title>
          ,
          <source>in: Computer Vision-ACCV 2018: 14th Asian Conference on Computer Vision</source>
          , Perth, Australia, December 2-
          <issue>6</issue>
          ,
          <year>2018</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Selected</surname>
          </string-name>
          <string-name>
            <surname>Papers</surname>
          </string-name>
          ,
          <source>Part IV 14</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Allebach</surname>
          </string-name>
          ,
          <article-title>One-stage object referring with gaze estimation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>5021</fpage>
          -
          <lpage>5030</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          , W. Han,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Shen,
          <article-title>Referring multi-object tracking</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>14633</fpage>
          -
          <lpage>14642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baron-Cohen</surname>
          </string-name>
          ,
          <article-title>Empathizing, systemizing, and the extreme male brain theory of autism</article-title>
          ,
          <source>Progress in brain research 186</source>
          (
          <year>2010</year>
          )
          <fpage>167</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hart</surname>
          </string-name>
          , Autism/excel study,
          <source>in: Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takehashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nagai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Obinata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stefanov</surname>
          </string-name>
          ,
          <article-title>Which robot features can stimulate better responses from children with autism in robot-assisted therapy?</article-title>
          ,
          <source>International Journal of Advanced Robotic Systems</source>
          <volume>9</volume>
          (
          <year>2012</year>
          )
          <fpage>72</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Calderita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Manso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bustos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Suárez-Mejías</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bandera</surname>
          </string-name>
          ,
          <article-title>Therapist: towards an autonomous socially interactive robot for motor and neurorehabilitation therapies for children, JMIR rehabilitation and assistive technologies 1 (2014) e3151</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Di Nuovo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Conti</surname>
          </string-name>
          , G. Trubia,
          <string-name>
            <given-names>S.</given-names>
            <surname>Buono</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Di</given-names>
            <surname>Nuovo</surname>
          </string-name>
          ,
          <article-title>Deep learning systems for estimating visual attention in robot-assisted therapy of children with autism and intellectual disability</article-title>
          ,
          <source>Robotics</source>
          <volume>7</volume>
          (
          <year>2018</year>
          )
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alabdulkareem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Alhakbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Nafjan</surname>
          </string-name>
          ,
          <article-title>A systematic review of research on robot-assisted therapy for children with autism</article-title>
          ,
          <source>Sensors</source>
          <volume>22</volume>
          (
          <year>2022</year>
          )
          <fpage>944</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Metta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Natale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nori</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Sandini,</surname>
          </string-name>
          <article-title>The icub project: An open source platform for research in embodied cognition</article-title>
          ,
          <source>in: Advanced Robotics and its Social Impacts</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. F. MacDorman</surname>
          </string-name>
          , N. Kageki,
          <article-title>The uncanny valley [from the field]</article-title>
          ,
          <source>IEEE Robotics &amp; automation magazine 19</source>
          (
          <year>2012</year>
          )
          <fpage>98</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Appel</surname>
          </string-name>
          , T. Gnambs,
          <article-title>Human-like robots and the uncanny valley</article-title>
          ,
          <source>Zeitschrift für Psychologie</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ghiglino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Floris</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. De Tommaso</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Kompatsiari</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Chevalier</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Priolo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Wykowska</surname>
          </string-name>
          , Artificial scafolding:
          <article-title>Augmenting social cognition by means of robot technology</article-title>
          ,
          <source>Autism Research</source>
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <fpage>997</fpage>
          -
          <lpage>1008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>A. B. Vasudevan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>L. Van Gool</given-names>
          </string-name>
          ,
          <article-title>Object referring in videos with language and human gaze</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , G. Synnaeve, I. Misra,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carion</surname>
          </string-name>
          ,
          <article-title>Mdetr-modulated detection for end-to-end multi-modal understanding</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1780</fpage>
          -
          <lpage>1790</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>