<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Affect Recognition for Adaptive Intelligent Tutoring Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruth Janning</string-name>
          <email>janning@ismll.uni-</email>
          <email>janning@ismll.unihildesheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlotta Schatten</string-name>
          <email>schatten@ismll.uni-</email>
          <email>schatten@ismll.unihildesheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars Schmidt-Thieme</string-name>
          <email>thieme@ismll.uni-</email>
          <email>thieme@ismll.unihildesheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Systems and, Machine Learning Lab, (ISMLL), University of Hildesheim</institution>
          ,
          <addr-line>Marienburger Platz 22, 31141, Hildesheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Systems and, Machine Learning Lab, (ISMLL), University of Hildesheim</institution>
          ,
          <addr-line>Marienburger Platz 22, 31141, Hildesheim, Germany, schmidt</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The performance prediction and task sequencing in traditional adaptive intelligent tutoring systems needs information gained from expert and domain knowledge. In a former work a new e cient task sequencer based on a performance prediction system was presented, which only needs former performance information but not the expensive expert and domain knowledge. In this paper we aim to support this approach by automatically gained multimodal input like for instance speech input from the students. Our proposed approach extracts features from this multimodal input and applies to that features an automatic a ect recognition method. The recognised a ects shall nally be used to support the mentioned task sequencer and its performance prediction system. Consequently, in this paper we (1) propose a new approach for supporting task sequencing and performance prediction in adaptive intelligent tutoring systems by a ect recognition applied to multimodal input, (2) present an analysis of appropriate features for a ect recognition extracted from students speech input and show the suitability of the proposed features for a ect recognition for adaptive intelligent tutoring systems, and (3) present a tool for data collection and labelling which helps to construct an appropriate data set for training the desired a ect recognition approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multimodal input</kwd>
        <kwd>a ect recognition</kwd>
        <kwd>feature analysis</kwd>
        <kwd>speech</kwd>
        <kwd>adaptive intelligent tutoring systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        students for instance in learning fractional arithmetic. The
main advantages of intelligent tutoring systems are the
possibility for a student to practice any time, as well as the
possibility of adaptivity and individualisation for a single
student. An adaptive intelligent tutoring system possesses
an internal model of the student and a task sequencer which
decides which tasks in which order are shown to the student.
Originally, the task sequencing in adaptive intelligent
tutoring systems is done using information gained from expert
and domain knowledge and logged information about the
performance of students in former exercises. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] a new
e cient sequencer based on a performance prediction
system was presented, which only uses former performance
information from the students to sequence the tasks and does
not need the expensive expert and domain knowledge. This
approach applies the machine learning method matrix
factorization (see e.g. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) for performance prediction to former
performance information. Subsequently, it uses the output
of the performance prediction process to sequence the tasks
according to the theory of Vygotsky's Zone of Proximal
Development [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. That is the sequencer chooses the next task
in order to neither bore nor frustrate the student or in other
words, the next task should not be too easy or too hard for
the student.
      </p>
      <p>
        In this paper we propose to support the task sequencer and
performance prediction system of the approach in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] in a
new way by further automatically to get and process
multimodal information. One part of this multimodal
information, which is investigated in this paper, is the speech input
from the students interacting with the intelligent tutoring
system while solving tasks. A further part will be the typed
input or mouse click input from the students, which will be
reported in upcoming works. The approach proposed in this
paper extracts features from the mentioned multimodal
information and applies to that features an automatic a ect
recognition method. The output of the a ect recognition
method indicates, if the last task was too easy, too hard or
appropriate for the student. This information matches the
theory of Vygotsky's Zone of Proximal Development, hence
it is obviously suitable for supporting the performance
prediction system and task sequencer of the approach in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
However, for the proposed approach we need a large amount
of labelled data. For this reason we developed a tutoring tool
which (a) records students speech input as well as typed
input and mouse click input and (b) allows the students to
label by themselves how di cult they perceived the shown
tasks. This tool is presented in the second part of this
paper and will be used to conduct further studies to gain the
desired labelled data.
      </p>
      <p>The main contributions of this paper are: (1) presentation
of a new approach for supporting performance prediction
and task sequencing in adaptive intelligent tutoring systems
by a ect recognition on multimodal input, (2) identi cation
and analysis of appropriate and statistically signi cant
features for the presented approach, and (3) presentation of a
new tutoring tool for multimodal data collection and
selflabelling to gain automatically labelled data for training
appropriate a ect recognition methods.</p>
      <p>In the following, rst we will present some preliminary
considerations along with state-of-the-art in section 2.
Subsequently, we will describe in section 3 the real data set used
for the feature analysis and investigate in section 4 for the
data set the correlation between students a ects and their
performance. In section 5 we will propose and analyse
appropriate features for a ect recognition and in section 6 we
will explain how to support performance prediction and task
sequencing in intelligent tutoring systems by a ect
recognition applied to multimodal input. Before we conclude, we
will describe in section 7 the mentioned tool for multimodal
data collection and self-labelling.</p>
    </sec>
    <sec id="sec-2">
      <title>2. PREPARATION AND RELATED WORK</title>
      <p>Before an automatic a ect recognition approach can be
applied, one has to clarify three things: (1) What kind of
features shall be used, (2) what kind of classes shall be used and
(3) which instances shall be mapped to features and labelled
with the class labels. After deciding which features, classes
and instances shall be considered, one can apply a ect
recognition methods to these input data. In the following
subsections we will present possible features, classes, instances and
methods for a ect recognition supporting performance
prediction and task sequencing in adaptive intelligent tutoring
systems along with the state-of-the-art.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Features</title>
      <p>
        The rst step before applying automatic a ect recognition is
to identify useful features for this process. For the purpose
to recognise a ect in speech one can use two di erent kinds
of features ([
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]): acoustic and linguistic features. Further,
one can distinct linguistics (like n-grams and bag-of-words)
and dis uencies (like pauses). If linguistics features are used,
a transcription or speech recognition process has to be
applied to the speech input before a ect recognition can be
conducted. Subsequently, approaches from the eld of
sentiment classi cation or opinion mining (see e.g. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) can be
applied to the output of this process. However, the methods
of this eld have to be adjusted to be applicable to speech
instead of written statements.
      </p>
      <p>
        Another possibility for speech features is to use dis uencies
features like it was done in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for expert
identi cation. The advantage of using such features is that
instead of a full transcription or speech recognition approach
only for instance a pause identi cation has to be applied
before. That means that one does not inherit the error of
the full speech recognition approach. Furthermore, these
features are independent from the need that students use
words related to a ects. For using this kind of features one
has to investigate, which particular features are suitable for
the special task of a ect classi cation in adaptive intelligent
tutoring systems. Because of the mentioned advantage of
dis uencies features in this work we focus on features
extracted from information about speech pauses as one part
of the multimodal input for a ect recognition.
      </p>
      <p>
        As mentioned in the introduction the other part of the
multimodal input will be features which are gained from
information about typed input or mouse click input from the
students. This kind of features is similar to the keystroke
dynamics features used in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] emotional states were
identi ed by analysing the rhythm of the typing patterns of
persons on a keyboard.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Classes</title>
      <p>
        The second step before applying automatic a ect
recognition is to de ne the classes corresponding to emotions and
a ective states, which shall be recognised by the used
affect recognition approach. According to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] it is
possible to recognise in intelligent tutoring systems students
a ects like for instance confusion, frustration, boredom and
ow. As mentioned above, we want to use the students
behaviour information gained from speech and from typed
input or mouse click input for supporting the performance
prediction system and task sequencer of the approach in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
which is based on the theory of Vygotsky's Zone of Proximal
Development [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. That means that the goal is to neither
bore the student with too easy tasks nor to frustrate him
with too hard tasks, but to keep him in the Zone of Proximal
Development. Accordingly, we want to use the output of the
automatic a ect recognition to get an answer to the question
\Was this task too easy, too hard or appropriate for the
student?", or with other words we want to nd out if the student
felt under-challenged, over-challenged or like to be in a ow.
However, the mapping between confusion, frustration,
boredom and under-challenged, over-challenged is not
unambiguous as one can infer e.g. from the studies mentioned in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
Hence, we will use instead of the above mentioned a ect
classes three other classes for supporting performance
prediction and task sequencing by automatic a ect recognition:
under-challenged, over-challenged and ow. One could
summarise these classes as perceived task-di culty classes, as we
aim to recognise the individual perceived task-di culty from
the view of the student.
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Instances</title>
      <p>
        The third step before applying automatic a ect recognition
is deciding which instances shall be mapped to features and
labelled with the class labels. If the goal of the a ect
recognition is to provide a student motivation or hints according
to his a ective state like e.g. in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], then instances can be
utterances. For supporting performance prediction and task
sequencing by a ect recognition instead one needs at the end
of a task the information, if the task overall was too easy,
too hard or appropriate for the student. The reason is that
this information shall help to choose the next task shown
to the student. Hence, an instance for supporting
performance prediction and task sequencing by a ect recognition
has to be instead of an utterance the whole speech input of
a student for one task.
      </p>
    </sec>
    <sec id="sec-6">
      <title>2.4 Methods</title>
      <p>
        The possible methods for an automatic a ect recognition
depend on the kind of the features used as input. As
mentioned above, for speech we distinct two kinds of features:
linguistics features and dis uencies. Linguistics features are
gained by a preceding speech recognition process and can
be processed by methods coming from the areas sentiment
analysis and opinion mining ([
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Especially methods from
the eld of opinion mining on microposts seem to be
appropriate if linguistics features are considered. State-of-the-art
approaches in opinion mining on microposts use methods
for instance based on optimisation approaches ([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) or Naive
Bayes ([
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]).
      </p>
      <p>
        The process of gaining dis uencies like pauses is di erent
to the full speech recognition process. For extracting for
instance pauses usually an energy threshold on the decibel
scale is used as in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or an SVM is applied for pause
classi cation on acoustic features as in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Appropriate
stateof-the-art methods for automatic emotion and a ect
recognition on dis uencies features as well as on features from
information about typed input or mouse click input are {
as proposed e.g. in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] { classi cation methods like
arti cial neural networks, SVM, decision trees or ensembles
of those.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3. REAL DATA SET</title>
      <p>After identifying features, classes, instances and methods
for a ect recognition for supporting performance prediction
and task sequencing like above one can collect data for a
concrete feature analysis and a training of the chosen a ect
classi cation method. We conducted a study in which the
speech and actions of ten 10 to 12 years old German
students were recorded and students a ective states as well as
the perceived task-di culties were reported. The labelling
of these data was done on the one hand concurrently by
the tutor and on the other hand retrospectively by a second
reviewer. Furthermore, a labelling per exercise (consisting
of several subtasks) and an overall labelling per student as
an aggregation of the labels per exercise was done. During
the study a paper sheet with fraction tasks was shown to
the students and they were asked to paint (with the
software Paint) and explain their observations and answers. We
made a screen recording to record the painting of the
students and an acoustic recording to record the speech of the
students. The screen recordings were used for the
retrospective annotation. The speech recordings shall be used to gain
the input for a ect recognition. The mentioned typed input
or mouse click input information we will collect and
investigate in further studies with the self-labelling and multimodal
data collection tutoring tool described in section 7.1. In this
paper we focus on speech features and hence in section 5 we
will propose and analyse possible features extracted from
speech pauses. But rst we will investigate in the following
section 4 the correlation between perceived task-di culty
labels and the performance of the students in the real data
set.</p>
    </sec>
    <sec id="sec-8">
      <title>4. CORRELATION OF PERCEIVED TASK</title>
    </sec>
    <sec id="sec-9">
      <title>DIFFICULTY LABELS AND SCORE</title>
      <p>Before we present speech features for recognising perceived
task-di culty, we want to show that there is a correlation
between the proposed perceived task-di culty labels and
the performance of the students, to underline the
suitability of supporting performance prediction and task
sequencing by the proposed a ect recognition approach. Hence,
we mapped the overall perceived task-di culty labels to
the overall score of the students (see gure 1). For this
mapping we encoded the di erent overall perceived
taskdi culty class labels as follows:
0 = over-challenged
1 = over-challenged/ ow
2 = ow
3 = ow/under-challenged
4 = under-challenged
The overall score of a student i is computed by
nci ;
nti
where nci is the number of correctly solved tasks of student
i and nti is the number of tasks shown to student i. In gure
1 one can see that there is a clear correlation between
perceived task-di culty labels and score. To substantiate this
observation we applied a statistical test by conducting a
linear regression and measuring the p-value, indicating the
statistical signi cance, as well as the R2 and Adjusted R2 value,
indicating how well the regression line can approximate the
real data points. This approach delivers a p-value of 0:0027,
(1)
a R2 value of 0:6966, and an Adjusted R2 value of 0:6586.
The small p-value indicates a strong statistical signi cance.
The signi cant correlation between perceived task-di culty
labels and scores, which demonstrate the performance,
indicates that it makes sense to support performance prediction
and task sequencing by perceived task-di culty classi
cation.</p>
    </sec>
    <sec id="sec-10">
      <title>5. SPEECH FEATURE ANALYSIS</title>
      <p>
        The features we propose and analyse in this section are
gained from speech pauses. Hence, rst one has to
identify pauses within the speech input data. The most easy
way is to de ne a threshold on the decibel scale as done
e.g. in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For our preliminary study of the data we also
used such a threshold, which we adjusted by hand. More
explicitly, we extracted the amplitudes of the sound les and
computed the decibel values. Subsequently, we investigated
which decibel values belong to speech and which ones to
pauses (see gure 2). In larger data and in the application
phase later on, one has to learn automatically the distinction
between speech and pauses by either learn the threshold or
train an SVM, which classi es speech and pauses.
      </p>
    </sec>
    <sec id="sec-11">
      <title>5.1 Single Feature Analysis</title>
      <p>Before we can introduce the features we want to investigate,
we have to de ne some measurements:
m: number of students
pi: total length of pauses of student i
si: total length of speech of student i
npi : number of pause segments of student i
nsi : number of speech segments of student i
pi(x): xth pause segment of student i
si(y): yth speech segment of student i
nti : number of tasks shown to student i
nci : number of correctly solved tasks by student i
Overall score for student i: nnctii
Our data set exists of acoustic recordings from m students,
each of which saw nti tasks and solved nci tasks correctly.
The overall score of a student i in this case is the number
of correctly solved tasks nci divided by the number of seen
tasks nti . After applying the above mentioned threshold to
the data, we get for each student i the total length of pauses
pi and the total length of speech si in his acoustic recoding.
Furthermore, we can count connected pause and speech
segments to get the number of pause segments npi and speech
segments nsi of a student i. The xth pause segment is then
p(x) and the yth speech segment si(y). By means of these
i
measurements and their combination we can create a set of
features useful for a ect recognition supporting performance
prediction and task sequencing:</p>
      <sec id="sec-11-1">
        <title>Ratio between pauses and speech ( psii )</title>
        <p>Frequency of speech pause changes ( maxnjp(ni+pjn+sinsj ) )
Percentage of pauses of input speech data ( (pip+isi) )</p>
        <sec id="sec-11-1-1">
          <title>Length of maximal pause segment (maxx(pi(x)))</title>
        </sec>
      </sec>
      <sec id="sec-11-2">
        <title>Length of average pause segment ( Px pi(x) )</title>
        <p>npi</p>
        <sec id="sec-11-2-1">
          <title>Length of maximal speech segment (maxy(si(y)))</title>
          <p>Length of average speech segment ( Pynssii(y) )
Average number of seconds needed per task ( (pi+si) )
nti
The ratio between the total length of pauses and the total
length of speech indicates, if one one them is notable larger
than the other one, i.e. if the student made much more
speech pauses than speaking or vice versa. The frequency
of speech and pause segment changes indicates, if there are
many short speech and pauses segments or just a few large
ones and it is normalised by dividing it by the maximal sum
of pause and speech segments over all students. From the
percentage of pauses one can see if the total pause length
was much larger than the total speech part, i.e. the student
did not speak much but was more thinking silently. The
length of maximal pause or speech segment indicates if there
was e.g. a very long pause segment where the student was
thinking silently or a very long speech segment where the
student was in a speech ow. The length of average pause
or speech segment give us an idea of how much on average
the student was in a silent thinking phase or a speech ow.
The average number of seconds needed per task indicates
how long a student on average needed for solving a task.
To investigate, if these features are suitable to describe
perceived task-di culty as well as performance in our real data
#
6
5
4
3</p>
          <p>Features
Frequency of changes,
seconds per task,
max. length of pause,
average length of pause,
max. length of speech
average length of speech
Frequency of changes,
seconds per task,
max. length of pause,
average length of pause,
average length of speech
Frequency of changes,
seconds per task,
average length of pause,
average length of speech
Frequency of changes,
frequency of changes,
average length of speech
pval.
set, we mapped the values of each feature to the score as well
as to the perceived task-di culty labels. Subsequently, we
applied a linear regression to measure the p-value as well as
the R2 and Adjusted R2 value. However, as expected, single
features are not very signi cant. The feature with the best
values for p-value, R2 and Adjusted R2 { mapped to score as
well as to labels { is the Length of maximal pause segment.
The statistical values for this feature are shown in table 1.
These values are not very satisfactory, as one would desire
a p-value smaller than 0:05 and values for R2 and Adjusted
R2 which are closer to 1. A more reasonable approach is
to combine several features instead of considering just one
feature. Hence, in the following section we will investigate
di erent combinations of features.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>5.2 Feature Combination Analysis</title>
      <p>We analysed di erent combinations of features by applying
a multivariate linear regression to them to gain the p-value,
R2 and Adjusted R2 for these combinations. The
investigated combinations are combinations where all features are
not strongly correlated, i.e. whenever we had two correlated
features we put just one of them into the feature set for that
combination. In further steps we removed from the
considered feature sets feature by feature. Furthermore, in the
multivariate linear regression we mapped the features on the
one hand to the score and on the other hand to the labels.
The results of the best combinations, i.e. such with a p-value
at least smaller than 0:05, are shown in table 2 and 3. For
the score there were no combinations with only 2 features
with a p-value smaller than 0:05, hence in table 2 we just
listed the best combinations with 3 up to 6 features. For
the labels instead there were no such combinations, which
have a p-value smaller than 0:05, with 6 features, so that
in table 3 we only listed the best combinations of 2 up to 5
features. For both (score and labels) there are statistically
signi cant feature combinations. That means that our
proposed features are able to describe the score as well as the
labels.</p>
    </sec>
    <sec id="sec-13">
      <title>6. SUPPORTING PERFORMANCE PREDIC</title>
    </sec>
    <sec id="sec-14">
      <title>TION AND SEQUENCING</title>
      <p>
        As mentioned in the introduction, our goal is to support the
performance prediction system and task sequencer of the
approach in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] by a ect recognition, or by multimodal input
respectively. Hence, in the following we will propose how
to realise this support. In gure 3 a block diagram of the
approach of supporting performance prediction and task
sequencing by means of a ect recognition is presented. The
approach in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is represented in gure 3 by the non-dotted
arrows: the performance prediction gets input from former
performances and computes by means of the machine
learning method matrix factorization predictions for future
performances, which are the input for the task sequencer. The
task sequencer decides based on the theory of Vygotsky's
Zone of Proximal Development from the performance
prediction input which task shall be shown next to the student.
This process can be supported by the multimodal input as
follows:
(1) The additional input for the performance predictor can
be the output of the a ect recognition, i.e. the
perceived task-di culty labels. In this case the
performance predictor can take the perceived task-di culty
of the last task (T (t)) to use the following rules for
deciding how di cult the next task (T (t+1)) should be:
{ If T (t) was too easy (label under-challenged or
ow/under-challenged ), then T (t+1) should be harder.
{ If T (t) was appropriate (label ow ), then T (t+1)
should be similar di cult.
{ If T (t) was too hard (label over-challenged or
overchallenged/ ow ), then T (t+1) should be easier.
(2) The values of the features gained by feature
extraction from speech, typed input and mouse click input
      </p>
      <p>can be fed directly into the performance prediction
without applying an a ect recognition. That means
that the features are mapped to scores instead of
perceived task-di culty classes. That this makes sense
was shown in section 4 and 5. The performance
predictor can then compare e.g. the di erences between
performances, expressed as score, and the scores
computed by means of the features (s[core). This di
erence indicates outliers like if a student felt to be in
a ow or under-challenged but his score is worse, i.e.
s[core &gt; score. In this case the student may not fully
understand the principles of the considered task
although he thinks so. Hence, next the system should
show the student rather tasks which explain the
approach of solving such kind of tasks.</p>
      <p>In our studies we observed the behaviour of students
described in (2), i.e. the student was labelled as to be in a
ow or under-challenged, although he performed worse, as
he just thought to understand how the tasks should be solved
but he was wrong. In gure 4 this behaviour is indicated by
the outliers.</p>
    </sec>
    <sec id="sec-15">
      <title>7. LABELLING AND DATA COLLECTION</title>
      <p>
        As mentioned in section 3 the labels of our real data set come
from two sources: (a) a concurrent annotation by the tutor
and (b) a retrospective annotation by another external
reviewer on the basis of the tasks sheet, the sound les and the
screen recording. However, in the literature one can nd
further labelling strategies like self-labelling of the students (see
e.g. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). The advantage of self-labelling is that one
can gain automatically a labelled data set for a subsequent
training of an a ect recognition method. Furthermore, as
we want to recognise the perceived task-di culty from the
view of the student, a label from the student himself seem to
be more appropriate than labels from another person only
reviewing the behaviour of the student. Hence, for further
studies we developed a tool for collecting speech data and
typed input and mouse click input data, labelled
automatically with the task-di culty perceived by the student. This
tool will be further described in the following section.
      </p>
    </sec>
    <sec id="sec-16">
      <title>7.1 Self-Labelling Fractional Arithmetic Tutor for Multimodal Data Collection</title>
      <p>
        To be able to conduct studies in which the students
themselves label the task-di culty which they perceived, we
developed a tutoring tool (self - self-labelling f ractional
arithmetic tutor for multimodal data collection) written in Java.
However, for little children it might be di cult to analyse
themselves (see e.g. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). Hence, self-labelling is often
applied in experiments with at least college students as for
instance in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Therefore, we will conduct the experiments
with this tool rst with older students and more challenging
tasks. Later on we will investigate if there is a way to adapt
the tool so that a self-labelling is possible also with younger
students. Nevertheless, conducting experiments with older
students has several advantages besides the possibility of a
reasonable self-labelling: older students are able to focus on
the tasks longer than young students and the privacy issues
are not such strong as for younger students. Both facts lead
to more data. Hence, besides investigating the possibility of
adapting self for younger students, we have to identify
differences and similarities of the data from older and younger
students to nd out how to exploit older students data to
recognise a ects from multimodal input from younger
students.
      </p>
      <p>
        In gure 5 one can see the graphical user interface of our
selflabelling multimodal data collection tool self. To gain more
background information, in the beginning self asks some
information from the students as course of studies, number
of terms, age and gender. Subsequently, an instruction with
hints how to behave is shown to the students, which they can
have a look at also while interacting with the tool (button
"Anleitung\). self speaks to the students to motivate them
to speak with the system and records the speech input of the
students. The speech output of self is generated by means
of text to speech realised by the library MARY developed
at the DFKI ([
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]). While interacting with the system, the
student can type in numbers, ask for a hint (button "Hilfe"),
skip the task because it is too easy or because it is too hard
(left buttons) or submit the solution (button "Endergebnis
uberprufen"). Every action of the student, like asking for
a hint or submitting the answer, is written { together with
a time stamp { into a log le immediately after the action,
enabling also the extraction of typed input or mouse click
input features. Also a score depending on the number of
requested hints hr and the number of incorrect inputs w is
computed according to the approach in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and written into
the log le. The formula for this score is
1
( hr + (w 0:1)) ;
ht
(2)
where ht is the total number of available hints for the
considered task. The meaning behind the formula is that each
wrong input w(j) is punished with a factor of 0:1 and every
request of a hint h(rk) is punished with a factor of h1t , so that
if every hint was seen the score will be 0. After the student
submitted the correct answer, he is asked to evaluate, if this
task was too easy, too hard or appropriate for him (see
popup window in gure 5). The tasks implemented in self for
older students cover the following areas:
      </p>
      <p>Reducing fractions with numbers and variables
Fraction addition with and without intermediate steps
and with numbers and variables
Fraction subtraction with and without intermediate
steps and with numbers and variables
Fraction multiplication with and without intermediate
steps and with numbers and variables
Fraction division with and without intermediate steps
and with numbers and variables
Distributivity law with and without intermediate steps
Finite sums of unit fractions</p>
      <p>Rule of Three
After developing self, the next step will be to conduct
further studies with students to collect an adequate amount of
automatically labelled speech input, typed input and mouse
click input data for training an a ect recognition method
and supporting performance prediction and task sequencing.
Furthermore, we will investigate if there is a way to adapt
self so that also younger students can label themselves.</p>
    </sec>
    <sec id="sec-17">
      <title>8. CONCLUSIONS</title>
      <p>We proposed a new approach for supporting performance
prediction and task sequencing in adaptive intelligent
tutoring systems by a ect recognition on features gained from
multimodal input like students speech input. For this
approach we proposed and analysed appropriate speech
features and showed that there are statistically signi cant
feature combinations which are able to describe students a ect,
or perceived task-di culty respectively, as well as the
performance of a student. Furthermore, we proved the possibility
of supporting performance prediction and task sequencing
by perceived task-di culties by demonstrating that there is
a correlation between perceived task-di culty and
performance. Next steps will be to conduct more studies with
students by means of the presented self-labelling and
multimodal data collection tool to enable a training of an
appropriate a ect recognition method for supporting performance
prediction and task sequencing in adaptive intelligent
tutoring systems.</p>
    </sec>
    <sec id="sec-18">
      <title>9. ACKNOWLEDGMENTS</title>
      <p>The research leading to the results reported here has
received funding from the European Union Seventh
Framework Programme (FP7/2007 { 2013) under grant agreement
No. 318051 { iTalk2Learn project (www.italk2learn.eu).
Furthermore, we thank our project partner Ruhr
University Bochum for realising the study and data collection as
well as the IMAI of the University of Hildesheim for support
for the tutoring tool and preparation for future studies.
10.
skip task because ...</p>
      <p>too easy</p>
      <p>Hints
too hard
task was ... too easy
appropriate too hard</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cichocki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zdunek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phan</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Amari</surname>
            ,
            <given-names>S.I.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation</article-title>
          , Wiley.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Epp</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lippold</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandryk</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Identifying Emotional States Using Keystroke Dynamics</article-title>
          .
          <source>In Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems (CHI</source>
          <year>2011</year>
          ), Vancouver, BC, Canada, pp.
          <volume>715</volume>
          {
          <fpage>724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and Liu,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Exploiting Social Relations for Sentiment Analysis in Microblogging</article-title>
          .
          <source>In Proceedings of the Sixth ACM WSDM Conference (WSDM '13).</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Luz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Automatic Identi cation of Experts and Performance Prediction in the Multimodal Math Data Corpus through Analysis of Speech Interaction</article-title>
          . Second International Workshop on Multimodal Learning Analytics, Sydney Australia,
          <year>December 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D</given-names>
            <surname>'Mello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Picard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            and
            <surname>Graesser</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>Towards An A ect-Sensitive AutoTutor</article-title>
          .
          <source>Intelligent Systems, IEEE</source>
          , Vol.
          <volume>22</volume>
          , Issue 4, pp.
          <volume>53</volume>
          {
          <fpage>61</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D</given-names>
            <surname>'Mello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.K.</given-names>
            ,
            <surname>Craig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.D.</given-names>
            ,
            <surname>Witherspoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>McDaniel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            and
            <surname>Graesser</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2008</year>
          .
          <article-title>Automatic detection of learner's a ect from conversational cues. User Model User-Adap Inter</article-title>
          , DOI
          <volume>10</volume>
          .1007/s11257-007-9037-6.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oviatt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weibel</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Worsley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>ICMI 2013 grand challenge workshop on multimodal learning analytics</article-title>
          .
          <source>In Proceedings of the 15th ACM on International conference on multimodal interaction (ICMI</source>
          <year>2013</year>
          ), pp.
          <volume>373</volume>
          {
          <fpage>378</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Porayska-Pomsta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mavrikis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Mello</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conati</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Baker</surname>
          </string-name>
          , R.S.J.d.
          <year>2013</year>
          .
          <article-title>Knowledge Elicitation Methods for A ect Modelling in Education</article-title>
          .
          <source>International Journal of Arti cial Intelligence in Education, ISSN 1560-4292.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>A novel two-step SVM classi er for voiced/unvoiced/silence classi cation of speech</article-title>
          .
          <source>International Symposium on Chinese Spoken Language Processing</source>
          , pp.
          <volume>77</volume>
          {
          <fpage>80</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sadegh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ibrahim</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Othman</surname>
            ,
            <given-names>Z.A.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Opinion Mining and Sentiment Analysis: A Survey</article-title>
          .
          <source>International Journal of Computers &amp; Technology</source>
          , Vol.
          <volume>2</volume>
          , No.
          <volume>3</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Saif</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Alani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Semantic Sentiment Analysis of Twitter</article-title>
          .
          <source>In Proceedings of the 11th International Semantic Web Conference (ISWC</source>
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Schatten</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Schmidt-Thieme</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Adaptive Content Sequencing without Domain Information</article-title>
          .
          <source>In Proceedings of the Conference on computer supported education (CSEDU</source>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batliner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steidl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Seppi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Recognising realistic emotions and a ect in speech: State of the art and lessons learnt from the rst challenge</article-title>
          .
          <source>Speech Communication</source>
          , Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Vygotsky</surname>
            ,
            <given-names>L.L.S.</given-names>
          </string-name>
          <year>1978</year>
          .
          <article-title>Mind in society: The development of higher psychological processes</article-title>
          . Harvard university press.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>He</surname>
            ernan,
            <given-names>N.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Extending Knowledge Tracing to allow Partial Credit: Using Continuous versus Binary Nodes</article-title>
          .
          <source>Arti cial Intelligence in Education, Lecture Notes in Computer Science</source>
          , Vol.
          <volume>7926</volume>
          , pp.
          <volume>181</volume>
          {
          <fpage>188</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Woolf</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burleson</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arroyo</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dragon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cooper</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Picard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>A ect-aware tutors: recognising and responding to student a ect</article-title>
          .
          <source>Int. J. of Learning Technology</source>
          , Vol.
          <volume>4</volume>
          , No.
          <issue>3</issue>
          /4, pp.
          <volume>129</volume>
          {
          <fpage>164</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Worsley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Blikstein</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>What's an Expert? Using Learning Analytics to Identify Emergent Markers of Expertise through Automated Speech, Sentiment and Sketch Analysis</article-title>
          .
          <source>In Proceedings of the 4th International Conference on Educational Data Mining (EDM '11)</source>
          , pp.
          <volume>235</volume>
          {
          <fpage>240</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>The MARY Text-to-Speech System</article-title>
          , http://mary.dfki.de/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>