<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Artificial Intelligence: Exploring Multimodal Social Robotics in the Classroom</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dr. Aaron Elkins</string-name>
          <email>aelkins@sdsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uyiosa Philip Amadasun</string-name>
          <email>uamadasun@sdsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dr. Sabine Matook</string-name>
          <email>s.matook@business.uq.edu.au</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Social Robotics, Humanoid Robotics, Education, Artificial Intelligence, Educational Robotics, Communication</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>James Silberrad Brown Center for Artificial Intelligence</institution>
          ,
          <addr-line>5250 Campanile Dr, San Diego, California, 92182</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>San Diego State University</institution>
          ,
          <addr-line>5250 Campanile Dr, San Diego, California, 92182</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The University of Queensland, School of Business</institution>
          ,
          <addr-line>Colin Clark, 39 Blair Dr, St Lucia QLD 4067</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social robots, equipped with advanced communication abilities, can now serve as versatile classroom assistants. These robots understand and respond to speech, visual cues, and nonverbal signals, creating dynamic learning environments. Drawing on communication theory, they adapt to various learning styles and work alongside teachers to enhance education. We explore personalized instruction and improved student engagement through multimodal interactions with robot teaching assistants. Our research examines potential benefits, including individualized learning, increased motivation, and support for multilingual and special needs education. We address challenges in visual processing, communication, and cognitive load management. This work advances AI in Education by showing how AI-enhanced robots can create more efective and inclusive learning experiences.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The field of Artificial Intelligence in Education (AIED) has evolved significantly with the emergence
of Generative AI and advanced robotics, moving beyond basic service applications to focus on social
and human-centric educational settings. The commercial AIED market is projected to reach 4.5 billion
pounds in 2024, attracting substantial investments from major tech companies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The integration of Large Language Models (LLMs) with humanoid robots presents promising
opportunities for education. Studies demonstrate that advanced Natural Language Processing technologies
enhance student engagement and learning [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Humanoid robots ofer unique advantages over
virtual assistants due to their physical embodiment, which leverages the psychological benefits of
faceto-face communication [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Recent research has extensively explored these educational applications
      </p>
      <p>While LLMs enable more natural, adaptive social interactions in classroom settings, current
limitations exist, particularly in processing visual context [10]. This raises critical questions about the
minimum capabilities required for efective educational robots. This paper examines how humanoid
robot attributes map to goal-directed communication and robot behavior during student interactions,
advancing both AIED and potential applications in other social domains like healthcare.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Early social interaction research utilized humanoid avatars before physical robot platforms became
widely available [11, 12]. Buller and Burgoon [13] characterized communication as a dynamic process
(Dr. S. Matook)
https://business.sdsu.edu/centers-institutes/ai (Dr. A. Elkins); https://business.uq.edu.au/profile/263/sabine-matook</p>
      <p>CEUR</p>
      <p>ceur-ws.org
of mutual influence requiring information management, behavioral control, and image maintenance.
Efective communication demands managing cognitive load while coordinating verbal and nonverbal
behaviors [14], adapting to feedback, and responding to expectation violations [15]. These
communication principles must inform humanoid robot design for educational settings. Robots can support
teachers by managing classroom interactions and providing individualized tutoring, enabling educators
to focus on higher-level instructional objectives while addressing cognitive load and knowledge transfer
challenges in the learning environment.</p>
      <sec id="sec-2-1">
        <title>2.1. Classroom Support</title>
        <p>Cognitive load occurs when mental resources are consumed by multiple simultaneous tasks [14]. In
social interactions, this load can impair an individual’s ability to process contextual information and
adjust their initial trait attributions based on situational factors. This is particularly relevant as people
must process various social cues while regulating their own thoughts and behaviors during interactions.</p>
        <p>The concept of cognitive load has important implications for Human-Robot Interaction (HRI) in
education. Robots must adapt to diverse learning styles and preferences, which vary dynamically
throughout the learning process. The Kolb Experiential Learning Cycle provides a framework for
adapting robot communication strategies to student needs [16, 17], consisting of four stages:
1. Concrete Experience (“Why?”): Having a specific experience and understanding its relevance
2. Reflective Observation (“What?”): Reflecting on observations
3. Abstract Conceptualization (“How?”): Theorizing about observations
4. Active Experimentation (“What if?”): Applying understanding to new situations
Teachers face significant cognitive load when accommodating multiple learning styles
simultaneously, which can impair their ability to adapt teaching methods and maintain efectiveness. Social
humanoid robots, enhanced by AI technologies including computer vision, speech recognition, and
emotion recognition, can provide cognitive support in the classroom [18]. These robots can assist
with personalized learning while teachers focus on classroom management and complex instruction.
Through continuous monitoring of student engagement and performance, robots provide actionable
insights for teaching adaptation [19], creating an efective feedback loop in robot-student interactions.
2.1.1. Robot-Student Interaction
An important metric the robot must be able to accurately measure and provide feedback to the teacher
is the actual success of knowledge transfer from the teacher to the student. Such a metric in itself is
multi-faceted. The robot must have a robust and adaptive content delivery system that includes an
inference mechanism to detect efective and inefective knowledge transfer. To develop such a system,
we analyzed Buller and Burgoon’s schematic model (p.211) of face to face communication to develop
what the stages of efective Robot-Student interaction might look like in an educational context.</p>
        <p>
          Figure 1, which was inspired by the schematic model, illustrates a schema of such an interaction. We
must state that our derived model is simplified as it cannot encompass all areas contained within this
relatively recent phenomenon of humanoid robot-human communication in an educational context. Both
the robot and student begin with a pre-interaction mental state shaped by their previous interactions.
In this case a previous interaction might encompass: [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] not just a past relations to the current robot,
but any other robots which the student might have interacted with, and [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] any priming the student
may have undergone to view the robot as an objective authority on knowledge. This will of course
afect the students expectations (positive or negative) of the interaction. It has been studied that people
ifnd interlocutors of information such as robots or humanoid avatars to be more objective and unbiased,
which leads people to open up to them [11]. Such factors afect the students initial receptiveness to
receiving knowledge from the robot. The more common factors are of course their preferred learning
style and familiarity with the class/course material.
        </p>
        <p>Another factor we take some special interest in are the goals the student may have for the interaction,
which we will discuss in our conclusions (see section 4). The robots pre-interaction cognition can be
partly categorized as the directives given to the robot - in this case to relay information and provide the
teacher accurate measurements of on the efective or inefectiveness of the relay. Another category
can be nominally identified as the robots knowledge. This would include the knowledge stored in any
pre-trained and/or fine-tuned models. For example, the knowledge stored in pre-trained LLMs. The
knowledge accessible to LLMs via extensions such as Retrieval Augmented Generation or Knowledge
graphs (both discussed in section 3), which might include prior interactions with the student, specialized
course specific information (such as curriculum), and information about the students learning style.</p>
        <p>These pre-interaction cognition states afect the initial behaviors displayed by both parties. The
initial receptiveness of the student may manifest itself in body language while the directives and initial
knowledge mirror the same mechanism in the robot. It is good to note that these pre-interactional
factors may have weight across the entire interaction. The stages between the pre and post-interactional
stages can be described as iterative. Just as in Buller and Burgoon’s model, these iterative stages can
occur simultaneously and in a feedback loop. For simplicity, we describe these stages as a series of
credibility checks and behavioral adjustments. The student might voluntarily or involuntarily change
displayed behavior based on the perceived credibility of the robots ability to transfer knowledge (or
other factors such as comfort). The robot equipped to sense these changes, adjusts behavior accordingly.
This might look like an emotionally intelligent robot detecting a change in the students emotional
state, which can be inferred from any number of sources (body language, voice modularity, facial
expression, etc.) and ofering encouragement and motivation. Within these iterative steps the robot
continuously measures student engagement. While student engagement has some correlation with
efective knowledge transfer, it cannot be the only basis of measurement.</p>
        <p>We discern that the robot must be able to draw from diferent factors - some which are specific to
the nature of the learning material - to make some accurate estimate of efective knowledge transfer.
The post-interaction stage requires the robot to relay relevant information to the teacher. This will
include some permutation of: 1) any major switch in modality of students learning style (as in Kolb’s
model), and 2) overall engagement and some measurement of how much learning is actually taking
place (which would be level of efective knowledge transfer). Now that an overview of the entirety of
the feedback process has been defined, let us dive back into the iterative stages.</p>
        <p>It is clear that the social robot must have real time adaptive content delivery capabilities which can
sense any change in student receptiveness, accurately infer perceived credibility and modulate content
delivery method accordingly. As we describe what the robot must be capable of, we are in essence
describing some system that provides the robot a sophisticated level of emotional intelligence. Such a
system could easily applicable not just in education, but in mental health and other dyadic scenarios.
The robot handles repetitive but dynamic tasks during these iterative stages, including basic
questionanswering. As a result, teachers can devote their cognitive resources to more complex responsibilities:
learner-centered instruction and personalized interventions. In diverse classrooms, these robots can also
provide multilingual support, translating or explaining concepts in multiple languages, thus reducing
the cognitive load on teachers in diverse educational settings (e.g., multi-language or special needs).
The next section discusses our eforts towards the development of an emotionally intelligent system
with multi-modal perception, communication features, and afordances.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Current Developments and Challenges in AIED with Social Robots</title>
      <p>We selected SoftBank’s Pepper and NAO humanoid robots [ 20] for their human-like form and scale, with
Pepper’s 1.20-meter height providing suitable presence in crowded environments. We developed APIs
to enhance their AI and language capabilities for real-time operation, as illustrated below in Figure 2.</p>
      <p>The architecture centralizes perceived information to generate robot responses. While utilizing
standard processes like voice activity detection (VAD), we enhanced the system with speaker diarization
[21] and recognition to distinguish multiple speakers and track long-term student engagement. Robot
responses combine spoken LLM output with nonverbal gestures via native APIs. Long-term interactions
are stored in vector databases and knowledge graphs, with context-relevant information retrieved
through Retrieval Augmented Generation (RAG) [22].</p>
      <p>Unlike existing applications that provide static information [23], our system processes dynamic
student-robot relationships. Visual information is contextualized using image-to-text Vision Language
Models (VLM) to incorporate environmental context into interactions.</p>
      <sec id="sec-3-1">
        <title>3.1. Challenges</title>
        <p>Key technical challenges include hardware constraints, model intelligence for real-time processing, and
the evolution from reactive to proactive communication [24, 25]. While ongoing work addresses these
technical challenges through AI optimization [26], temporal reasoning capabilities [27], and multimodal
processing advancements [28], our current implementations demonstrate the viability of AI-enhanced
robots in educational settings.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This paper explored the potential of integrating AI technologies with humanoid robots to create
responsive and adaptive learning environments. Despite technical challenges in real-time processing
and proactive communication, our proposed system architecture, combining NLP, computer vision, audio
processing, and nonverbal behavior, ofers a practical approach to enhancing educational accessibility
and support. By addressing challenges such as cognitive load management, visual context processing,
and multimodal communication, these AI-enhanced robots can serve as assistants for teachers, adapting
to diverse learning styles and individual student needs.</p>
      <p>The Robot-Student Interaction model recognizes that each student brings their own unique spectrum
of social skills, communication styles, and interpersonal sensitivities to the interaction [29]. Students
actively participate in shaping the educational dialogue, contributing valuable information through
their distinct communication patterns. While the nature of this interaction varies across educational
levels—from primary school to university settings—each student possesses their own sophisticated set of
social and communicative competencies. This opens interesting research possibilities for understanding
how robot-student interactions might enhance or adapt to diferent communication styles, and how
robotic intervention could support students in developing their natural communication strengths
further.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly for grammar and spelling check. The
authors reviewed and edited the content as needed and take full responsibility for the publication’s
content.
[10] D. Sobrín-Hidalgo, M. Á. González-Santamarta, Á. M. Guerrero-Higueras, F. J. Rodríguez-Lera,
V. Matellán-Olivera, Enhancing robot explanation capabilities through vision-language models:
a preliminary study by interpreting visual inputs for improved human-robot interaction, arXiv
preprint arXiv:2404.09705 (2024).
[11] A. C. Elkins, D. C. Derrick, The sound of trust: voice as a measurement of trust during interactions
with embodied conversational agents, Group decision and negotiation 22 (2013) 897–913.
[12] J. F. Nunamaker, D. C. Derrick, A. C. Elkins, J. K. Burgoon, M. W. Patton, Embodied conversational
agent-based kiosk for automated interviewing, Journal of Management Information Systems 28
(2011) 17–48.
[13] D. B. Buller, J. K. Burgoon, Interpersonal deception theory, Communication theory 6 (1996)
203–242.
[14] D. T. Gilbert, R. E. Osborne, Thinking backward: Some curable and incurable consequences of
cognitive busyness., Journal of personality and social psychology 57 (1989) 940.
[15] J. K. Burgoon, Expectancy violations theory, The international encyclopedia of interpersonal
communication (2015) 1–9.
[16] D. A. Kolb, Experiential learning: Experience as the source of learning and development, FT press,
2014.
[17] J. N. Harb, S. O. Durrant, R. E. Terry, Use of the kolb learning cycle and the 4mat system in
engineering education, Journal of engineering education 82 (1993) 70–77.
[18] H. Woo, G. K. LeTendre, T. Pham-Shouse, Y. Xiong, The use of social robots in classrooms: A
review of field-based studies, Educational Research Review 33 (2021) 100388.
[19] T. Belpaeme, J. Kennedy, A. Ramachandran, B. Scassellati, F. Tanaka, Social robots for education:</p>
      <p>A review, Science robotics 3 (2018) eaat5954.
[20] S. Robotics, Pepper the humanoid and programmable robot| softbank robotics, 2021.
[21] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, A. McCree, Speaker diarization using deep neural
network embeddings, in: 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, 2017, pp. 4930–4934.
[22] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
in Neural Information Processing Systems 33 (2020) 9459–9474.
[23] G. Wilcock, K. Jokinen, Conversational ai and knowledge graphs for social robot interaction, in:
2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), IEEE, 2022, pp.
1090–1094.
[24] Z. Wang, P. Reisert, E. Nichols, R. Gomez, Ain’t misbehavin’-using llms to generate expressive robot
behavior in conversations with the tabletop robot haru, in: Companion of the 2024 ACM/IEEE
International Conference on Human-Robot Interaction, 2024, pp. 1105–1109.
[25] E. Nichols, D. Szapiro, Y. Vasylkiv, R. Gomez, I can’t believe that happened!: exploring expressivity
in collaborative storytelling with the tabletop robot haru, in: 2022 31st IEEE International
Conference on Robot and Human Interactive Communication (RO-MAN), IEEE, 2022, pp. 59–59.
[26] G. Ioannides, A. Chadha, A. Elkins, Gaussian adaptive attention is all you need: Robust contextual
representations across multiple modalities, arXiv preprint arXiv:2401.11143 (2024).
[27] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong,
T. Yu, et al., Palm-e: An embodied multimodal language model, arXiv preprint arXiv:2303.03378
(2023).
[28] C. Zhang, J. Chen, J. Li, Y. Peng, Z. Mao, Large language models for human-robot interaction: A
review, Biomimetic Intelligence and Robotics (2023) 100131.
[29] R. E. Riggio, Assessment of basic social skills., Journal of Personality and Social Psychology 51
(1986) 649.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zouhaier</surname>
          </string-name>
          ,
          <article-title>The impact of artificial intelligence on higher education: An empirical study</article-title>
          ,
          <source>European Journal of Educational Sciences</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>17</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wambsganss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Janson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Leimeister</surname>
          </string-name>
          ,
          <article-title>Enhancing argumentative writing with automated feedback and social comparison nudging</article-title>
          ,
          <source>Computers &amp; Education</source>
          <volume>191</volume>
          (
          <year>2022</year>
          )
          <fpage>104644</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Winkler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salovaara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Söllner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Leimeister</surname>
          </string-name>
          ,
          <article-title>Engaging learners in online video lectures with dynamically scafolding conversational agents</article-title>
          ,
          <source>in: European Conference on Information Systems (ECIS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Lewohl</surname>
          </string-name>
          ,
          <article-title>Exploring student perceptions and use of face-to-face classes, technology-enhanced active learning, and online resources</article-title>
          ,
          <source>International Journal of Educational Technology in Higher Education</source>
          <volume>20</volume>
          (
          <year>2023</year>
          )
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kurtz</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Kohen-Vacs,
          <article-title>Humanoid robot as a tutor in a team-based training activity</article-title>
          ,
          <source>Interactive Learning Environments</source>
          <volume>32</volume>
          (
          <year>2024</year>
          )
          <fpage>340</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>I. Buchem</surname>
          </string-name>
          ,
          <article-title>Scaling-up social learning in small groups with robot supported collaborative learning (rscl): Efects of learners' prior experience in the case study of planning poker with the robot nao</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>4106</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <article-title>Social robots in education for long-term human-robot interaction: socially supportive behaviour of robotic tutor for creating robo-tangible learning environment in a guided discovery learning interaction</article-title>
          ,
          <source>ECS Transactions 107</source>
          (
          <year>2022</year>
          )
          <fpage>12389</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Buchem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bäcker</surname>
          </string-name>
          ,
          <article-title>Nao robot as scrum master: results from a scenario-based study on building rapport with a humanoid robot in hybrid higher education settings, Training, Education, and</article-title>
          <source>Learning Sciences</source>
          <volume>59</volume>
          (
          <year>2022</year>
          )
          <fpage>65</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Ravari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Law</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kulić</surname>
          </string-name>
          ,
          <article-title>Efects of an adaptive robot encouraging teamwork on students' learning</article-title>
          , in: 2021 30th IEEE International Conference on Robot &amp;
          <article-title>Human Interactive Communication (RO-MAN)</article-title>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>250</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>