<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>K. Holstein, B. M. McLaren, V. Aleven,
Co-designing a real-time classroom
orchestration tool to support teacher-ai
complementarity, Journal of Learning
Analytics</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.procs.2016.05.264</article-id>
      <title-group>
        <article-title>Sense the classroom: AI-supported education for a resilient new normal</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krist Shingjergji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deniz Iren</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Corrie Urlings</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roland Klemke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Open University of the Netherlands 6419 AT Heerlen, The Netherlands Faculty of Cultural Sciences</institution>
          ,
          <addr-line>TH Köln, Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>6</volume>
      <issue>2019</issue>
      <fpage>767</fpage>
      <lpage>776</lpage>
      <abstract>
        <p>Following the COVID-19 pandemic, as the user-base of online synchronous communication systems skyrocketed, the shortcomings of synchronous online learning systems became more visible. Any attempt to overcome these shortcomings should be considered worthwhile due to the magnitude of potential impact. Improving the quality and addressing the shortcomings of online education is more important than ever. The goal of this multidisciplinary study that lies in the intersection of the fields of Education Science and Computer Science is to address a number of challenges of online education by incorporating AI. This study focuses on developing methods and means to ethically collect and use non-verbal cues of participants of online classrooms to assist teachers, students, and course coordinators by providing real-time and after-the-fact feedback of the students' learning-centered affective states.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Technology enhanced learning</kwd>
        <kwd>learning-centered affective states</kwd>
        <kwd>affective computing</kwd>
        <kwd>synchronized learning</kwd>
        <kwd>artificial intelligence</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and motivation</title>
      <p>Online learning provides a means of
education to students with physical limitations
or inconvenience to participate in physical,
face-to-face classroom education. During the
COVID-19 pandemic, this limitation became
relevant for all students. Approximately, 1.2
billion learners were affected by the closure of
schools at the time of the pandemic [1] and
educational institutions worldwide made a
mandatory transition to online/hybrid learning
[2]. As the utilization of online learning reached
unprecedented levels, the already-known
challenges of online education became
painfully visible for both students (e.g., feeling
of isolation [3] ) and teachers (e.g., lack of face
to face interaction with the students [4]).</p>
      <p>Students’ learning experience and
performance are highly related to their
psychological, physiological, and emotional
states [5]. Teachers can notice when students
are distracted, confused, tired, etc., and have the
opportunity to adjust their teaching approach
accordingly [6], and choose appropriate
interventions to keep the learning experience of
the class optimal. However, many teachers who
gave an online lecture have experienced the
severe lack of an understanding of the
learningcentered affective states of students in the
classroom, thus, missing opportunities to
improve the overall learning experience. This
also directly impacts individual students. In
online lectures, students are more prone to
distractions and use the Internet for purposes
unrelated to the educational activity [7]. The
lecturers cannot give timely feedback to guide
the attention since they do not observe the
students physically. As a result, the students are
left alone to manage their learning experience,
stay motivated, and struggle not to fall behind
during the educational activity. The underlying
reason that leads to these challenges is the
communication modality limitations of video
conferencing technologies.</p>
      <p>In this study, we build on Media Naturalness
Theory to examine the limitations of video
conferencing as a medium of communication
for online, synchronized education. Our
objective is to develop artificial intelligence
(AI) models to detect a multitude of
components of learning-centered affective
states (e.g., gestures, micro-expressions, and
macro-expression) of the learners, and present
the aggregated information to the teacher and
the course coordinator in a privacy-protecting
manner, and provide the individual information
to the students themselves.</p>
      <p>The remainder of this paper is structured as
the following. Section 2 sheds light on the
background and related work. Section 3 lays out
the details of the overall research methodology.
Section 4 highlights important discussion
points such as the theoretical and practical
implications, and ethical considerations.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and related work</title>
      <p>In this section we explain the overall
methodology that is proposed for this study.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Communication modalities and Media Naturalness Theory</title>
      <p>Human communication occurs in multiple
modalities such as voice, speech, facial
expressions, and body language [8]. One
important type of human communication is the
non-verbal communication which is the way of
conveying information without the use of
words via non-verbal cues i.e., facial
expressions and body language [9]. Video
conferencing platforms fall short in conveying
non-verbal cues among participants. The
shortcomings of video conferencing as a
communication medium can be analyzed and
improved based on the Media Naturalness
Theory (MNT). MNT describes the criteria to
assess the degree of naturalness of a
2 sadness, happiness, fear, anger, surprise, and disgust
communication medium that partly relates to
the capability of transmitting body language,
facial expressions, and natural speech [10].
According to this theory, a reduction in a
medium’s naturalness may lead to a decrease in
learning effectiveness, and a potential increase
in ambiguity of the conveyed message [11].
2.2.
states</p>
    </sec>
    <sec id="sec-4">
      <title>Learning centered affective</title>
      <p>Many studies that aim to detect the
relationship between online learning and
emotions by applying emotion recognition
techniques, use the basic emotions, namely,
happiness, sadness, fear, disgust, anger, and
surprise [12]. A plethora of studies that report
an accurate mapping among facial expressions
and emotions exist in the literature [13], [14],
[15]. However, D’Mello in [5] states that the
basic emotions are quite infrequent in the
context of learning with educational software
which raised the need of focusing on the
learning-centered affective states, such as
engagement, concentration, boredom, anxiety,
confusion, frustration, and happiness. In
contrast to emotion recognition, the mapping
between facial expressions and
learningcentered affective states has been severely
understudied [16].</p>
      <p>The observable non-verbal cues consist of
gestures and body postures (e.g., head-tilt, nod,
shake), micro-expressions (e.g., movement of
inner eyebrows and lips), other expressions
(e.g., smile, frown, confusion), and other
activities (e.g., note-taking, active-listening,
looking-away) [17]. The state-of-the-art
facialexpression recognition (FER) and gesture
recognition (GR) models use Convolutional
Neural Network (CNN) and Recurrent Neural
Network (RNN) hybrid networks [18].
These models perform discrete/momentary
measurement (i.e., in short intervals), generally
on single modality, and they are trained on
datasets in which the non-verbal cues are
mimicked by actors (i.e., not naturally
occurring). As previously mentioned, the
detection of constructs other than the six
universal emotions2, such as learning-centered
affective states does not have a rich literature.
However, the collection of high-quality data for
the recognition of learning-centered affective
states has been the subject of several studies
that have certain important limitations, for
instance; focusing only on game-based
interfaces [19], being explicit to certain ethnic
groups [20], and having a limited target
affective set such as the level of engagement on
a scale [21], and the lack of interest and
boredom [22].</p>
      <p>In this study, we will bridge this gap by
improving the CNN-RNN hybrids
architecturally by introducing attention layers,
formulating fitting objective functions, fusing
data from multiple modalities, and applying
transfer learning to train models with the
multimodal data collected from synchronous online
education settings.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Methodology</title>
      <p>In this section we explain the overall
methodology that is proposed for this research.
3.1.</p>
    </sec>
    <sec id="sec-6">
      <title>Research model</title>
      <p>Active and engaged learning is an important
model in online higher education. Therefore,
this study aims at addressing the motivational
and emotional side of online education, by
providing information that can assist educators
to refine the educational activities that they
have devised. Our aspiration is to utilize
theories of learning, motivation, and emotion in
combination to (1) define the relationship
between learning-centered affective states of
students and observable non-verbal cues, (2)
develop specialized multi-modal AI algorithms
for the recognition of learning-centered
affective states, (3) and design tools to present
this information in an actionable way for
teachers and students to improve the learning
process respecting the privacy of all
participants (Figure 1).</p>
      <p>Thus, the research questions of our studies
are as follows:
1. Which are the specific non-verbal feedback
needs (e.g., facial expressions) of teachers
and students in online lectures?
2. How can we automatically detect
nonverbal cues and translate them to
learningcentered affective states of multiple
participants in online, synchronous,
educational activities?
3. How can we present this information to
teachers in real-time so that they can take
4.
5.</p>
      <p>actions to positively influence the
learningcentered affective states of the students?
How can we provide students with this
information so that learning-centered
affective states are positively influenced?
How can we design a system that is
ethically sound, that respects privacy
concerns and keeps all collected data
secure?</p>
      <p>In this study, we will employ Design-Based
Research (DBR) and experiments throughout
multiple iterations (Figure 2). Teachers and
students will be involved in focus groups and
co-designing of prototypes [23]. The DBR
iterations consist of literature study,
requirements elicitation, participatory design,
and evaluation of the interventions (i.e.,
integrated AI models) in pilot studies of online
learning. The AI models will be developed
through experimentation cycles which
comprise data collection, annotation, algorithm
development, and model training and
evaluation. We will collect data from
publicdomain video repositories and online lecture
sessions recorded by us with the informed
consent of participants. The data will be
annotated by multiple experts in terms of
observed non-verbal cues. Consecutively, we
will develop algorithms, train and test FER-GR,
and the learning-centered affective states
recognition AI models on multiple datasets to
ensure generalizability. We will rely on metrics
that are commonly used in machine learning,
i.e., precision, recall, and F-1 measure to
evaluate the accuracy of our models. Data
management will be conducted in line with the
FAIR data principles [24].</p>
      <p>We envision the system to provide certain
information in real-time and after-the-fact to
various stakeholders for different purposes
(Table 1). The information flow is targeted at
specific educational purposes for each party
involved, with short-term and long-term
educational benefits.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Discussion</title>
      <p>In this section, we discuss several important
aspects of this study including the theoretical
and practical implications, the privacy concerns
as well as the limitations.</p>
    </sec>
    <sec id="sec-8">
      <title>4.1. Theoretical implications and practical</title>
      <p>The outcomes of our experiments will
potentially allow us to gain a deeper
understanding on how learning-centered
affective states are indicated by observable
non-verbal cues, and how these states can be
related to an effective learning experience. Our
results will also contribute to the Media
Naturalness Theory by extending it to cover
widely used video conferencing platforms and
tailor it for education scenarios.</p>
      <p>The practical outcome of this study will be
an analytical platform that is integrated to video
conferencing clients of the students and the
teachers. The platform will be able to provide
feedback both on real-time and after-the-fact
for all the involved actors in the course:
students, teachers, and course coordinators
(Table 1). In real-time, the platform will
provide the teachers with aggregated
information regarding the learning-centered
affective states of the students. This
information will give the teachers the
opportunity to respond in different ways such
as changing the teaching style and/or intervene
in the course content flow. Students are going
to receive information regarding their own
learning-centered states, which they can use to
self-regulate and be active and engaged.
Regarding the real-time feedback, the system
will be designed in a way that optimizes its use
while taking part in the educational activity. On
a longer-term aspect, this aggregated
information will be useful for the teachers and
the course coordinators as it would play the role
of an evidence-based course evaluation which
can be used for future improvement of delivery
style and course design from the side of the
teacher and the course coordinator respectively.
4.2.</p>
    </sec>
    <sec id="sec-9">
      <title>Privacy</title>
      <p>We acknowledge the privacy-sensitive
nature of this study. To protect the privacy of
students, and to prevent a possible misuse of the
technology, e.g., using the obtained information
to evaluate students, we design core
privacypreserving measures to shape our research
around them. Firstly, all data collection and
experimentation will be voluntary and with the
informed consent of the participants. The
ethical board will be consulted prior to all data
collection phases. The training data will be
collected anonymously with no possibility to
link to individuals. Secondly, the designed
system will keep sensitive individual data (e.g.,
video) on individuals’ computers. We will use
a virtual webcam that implements AI models
and analyzes video data on client computers.
This feature will also allow students to keep
their camera off (use avatars or nothing at all)
while still benefiting from the system. Only the
processed and anonymized data (i.e., numerical
representations of non-verbal cues) will be
transferred, and the teacher will only be
provided with information that is aggregated at
classroom-level. Finally, this study solely aims
at developing a method for improving the
quality of online education, and not as a way of
individual assessment of the students or the
teachers. We are confident that this
privacypreserving design will not allow any misuse of
the system.
4.3.</p>
    </sec>
    <sec id="sec-10">
      <title>Limitations</title>
      <p>The source of the data in this study will be
the participants’ cameras, which results in two
important limitations. First, we cannot observe
the entire environment of a student, thus, it is
not possible to differentiate whether the
observed non-verbal cues of an individual
student are the result of an event in the
classroom or an off-task activity. Second, in the
online classrooms, students are in control of
their cameras and may refuse to turn them on
even when the proposed privacy-preserving
methods are in place. In that case, the proposed
method is not applicable.</p>
    </sec>
    <sec id="sec-11">
      <title>5. References</title>
      <p>[1]
[2]
[3]
[4]
[5]</p>
      <p>Education: From disruption to recovery,
2021 URL:
https://en.unesco.org/covid19/educatio
nresponse.</p>
      <p>S. Dhawan, Online Learning: A Panacea
in the Time of COVID-19 Crisis,
Journal of Educational Technology
Systems 49 (2020) 5–22. doi:
10.1177/0047239520934018.</p>
      <p>M. Alawamleh, L. M. Al-Twait, G. R.
Al-Saht, The effect of online learning on
communication between instructors and
students during Covid-19 pandemic,
Asian Education and Development
Studies (2020). doi:
10.1108/AEDS-062020-0131.</p>
      <p>S. Gurung, Challenges faced by teachers
in online teaching during Covid-19
pandemic, The online journal of
distance education and e-Learning 9
(2021).</p>
      <p>S. D’Mello, A selective meta-analysis
on the relative incidence of discrete
affective states during learning with
technology, Journal of Educational
Psychology 105 (2013) 1082-1099. doi:
10.1037/a0032674.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]</p>
      <p>K. Bahreini, R. Nadolski, W. Westera,
Towards multimodal emotion
recognition in e-learning environments,
Interactive Learning Environments,
(2016) 590–605. doi:
10.1080/10494820.2014.908927.
A. Lepp, J. E. Barkley, A. C. Karpinski,
S. Singh, College Students’
Multitasking Behavior in Online Versus
Face-to-Face Courses, SAGE Open 9
(2019), doi:
10.1177/2158244018824505.</p>
      <p>C. Jewitt, J. Bezemer, K. O’Halloran,
Introducing multimodality, 1st. ed.,
Routledge, London, 2016. doi:
10.4324/9781315638027.</p>
      <p>APA Dictionary of Phycology, 2020.
URL:
https://dictionary.apa.org/nonverbalcommunication.</p>
      <p>N. Kock, Media naturalness theory:
human evolution and behaviour towards
electronic communication technologies,
in: S Craig Roberts, Applied
evolutionary psychology, Oxford
University Press Inc., New York, 2012,
pp. 381-398. doi:
10.1093/acprof:oso/9780199586073.00
1.0001.</p>
      <p>O. Weiser, I. Blau, Y. Eshet-Alkalai,
How do medium naturalness,
teachinglearning interactions and Students’
personality traits affect participation in
synchronous E-learning?, Internet and
Higher Education 37 (2018) 40–51. doi:
10.1016/j.iheduc.2018.01.001.</p>
      <p>P. Ekman, An Argument for Basic
Emotions, Cognition and Emotion 6
(1992) 169–200. doi:
10.1080/02699939208411068.</p>
      <p>R. Reisenzein, M. Studtmann, G.
Horstmann, Coherence between
emotion and facial expression: Evidence
from laboratory experiments, Emotion
Review 5 (2013) 16–23. doi:
10.1177/1754073912457228.</p>
      <p>M. Wegrzyn, M. Vogt, B. Kireclioglu, J.
Schneider, J. Kissler, Mapping the
emotional face. How individual face
parts contribute to successful emotion
recognition, PLoS ONE 12, (2017) doi:
10.1371/journal.pone.0177239.</p>
      <p>P. Tarnowski, M. Kołodziej, A.
Majkowski, R. J. Rak, Emotion
recognition using facial expressions,
[16]
[17]
[18]
[19]
[20]
[21]
[22]</p>
      <p>Procedia Computer Science 108 (2017)
1175–1184. doi:
10.1016/j.procs.2017.05.025.</p>
      <p>M. A. A. Dewan, M. Murshed, F. Lin,
Engagement detection in online
learning: a review, Smart Learning
Environments 6 (2019). doi:
10.1186/s40561-018-0080-z.</p>
      <p>D. Umnia Soraya, K. Candra Kirana, S.
Wibawanto, H. Wahyu Herwanto, C.
Wijaya Kristanto, Non-Verbal
Communication Behavior of Learners
on Online-based Learning, in:
Proceedings of the 2nd International
Conference on Vocational Education
and Training (ICOVET 2018), 2019,
pp. 4-6. doi: 10.2991/icovet-18.2019.2.
M. Sharma, D. Ahmetovic, L. A. Jeni,
K. M. Kitani, Recognizing Visual
Signatures of Spontaneous Head
Gestures, in: 2018 IEEE Winter
Conference on Applications of
Computer Vision (WACV), 2018, pp.
400-408. doi:
10.1109/WACV.2018.00050.</p>
      <p>Nigel Bosch, Sidney D'Mello, Ryan
Baker, Jaclyn Ocumpaugh, Valerie
Shute, Matthew Ventura, Lubin Wang,
Weinan Zhao, Automatic detection of
learning-centered affective states in the
wild, in: Proceedings of the 20th
International Conference on Intelligent
User Interfaces (IUI '15), Association
for Computing Machinery, New York,
2015, pp. 379–388. doi:
10.1145/2678025.2701397.</p>
      <p>T. S. Ashwin, R. M. R. Guddeti,
Affective database for e-learning and
classroom environments using Indian
students’ faces, hand gestures and body
postures, Future Generation Computer
Systems 108 (2020), 334–348. doi:
10.1016/j.future.2020.02.075.</p>
      <p>J. Whitehill, Z. Serpell, Y. C. Lin, A.
Foster, J. R. Movellan, The faces of
engagement: Automatic recognition of
student engagement from facial
expressions, IEEE Transactions on
Affective Computing 5, (2014) 86–98.
doi: 10.1109/TAFFC.2014.2316163.
L. B. Krithika, G. G. Lakshmi Priya,
Student Emotion Recognition System
(SERS) for e-learning Improvement
Based on Learner Concentration Metric,
in Procedia Computer Science 85</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>