-

Multimodal Affect Recognition for Adaptive Intelligent Tutoring Systems

Ruth Janning

janning@ismll.uni- janning@ismll.unihildesheim.de 0

Carlotta Schatten

schatten@ismll.uni- schatten@ismll.unihildesheim.de 0

Lars Schmidt-Thieme

thieme@ismll.uni- thieme@ismll.unihildesheim.de 1 0 Information Systems and, Machine Learning Lab, (ISMLL), University of Hildesheim , Marienburger Platz 22, 31141, Hildesheim , Germany 1 Information Systems and, Machine Learning Lab, (ISMLL), University of Hildesheim , Marienburger Platz 22, 31141, Hildesheim, Germany, schmidt

The performance prediction and task sequencing in traditional adaptive intelligent tutoring systems needs information gained from expert and domain knowledge. In a former work a new e cient task sequencer based on a performance prediction system was presented, which only needs former performance information but not the expensive expert and domain knowledge. In this paper we aim to support this approach by automatically gained multimodal input like for instance speech input from the students. Our proposed approach extracts features from this multimodal input and applies to that features an automatic a ect recognition method. The recognised a ects shall nally be used to support the mentioned task sequencer and its performance prediction system. Consequently, in this paper we (1) propose a new approach for supporting task sequencing and performance prediction in adaptive intelligent tutoring systems by a ect recognition applied to multimodal input, (2) present an analysis of appropriate features for a ect recognition extracted from students speech input and show the suitability of the proposed features for a ect recognition for adaptive intelligent tutoring systems, and (3) present a tool for data collection and labelling which helps to construct an appropriate data set for training the desired a ect recognition approach.

eol>multimodal input a ect recognition feature analysis speech adaptive intelligent tutoring systems

students for instance in learning fractional arithmetic. The main advantages of intelligent tutoring systems are the possibility for a student to practice any time, as well as the possibility of adaptivity and individualisation for a single student. An adaptive intelligent tutoring system possesses an internal model of the student and a task sequencer which decides which tasks in which order are shown to the student. Originally, the task sequencing in adaptive intelligent tutoring systems is done using information gained from expert and domain knowledge and logged information about the performance of students in former exercises. In [ 12 ] a new e cient sequencer based on a performance prediction system was presented, which only uses former performance information from the students to sequence the tasks and does not need the expensive expert and domain knowledge. This approach applies the machine learning method matrix factorization (see e.g. [ 1 ]) for performance prediction to former performance information. Subsequently, it uses the output of the performance prediction process to sequence the tasks according to the theory of Vygotsky's Zone of Proximal Development [ 14 ]. That is the sequencer chooses the next task in order to neither bore nor frustrate the student or in other words, the next task should not be too easy or too hard for the student.

In this paper we propose to support the task sequencer and performance prediction system of the approach in [ 12 ] in a new way by further automatically to get and process multimodal information. One part of this multimodal information, which is investigated in this paper, is the speech input from the students interacting with the intelligent tutoring system while solving tasks. A further part will be the typed input or mouse click input from the students, which will be reported in upcoming works. The approach proposed in this paper extracts features from the mentioned multimodal information and applies to that features an automatic a ect recognition method. The output of the a ect recognition method indicates, if the last task was too easy, too hard or appropriate for the student. This information matches the theory of Vygotsky's Zone of Proximal Development, hence it is obviously suitable for supporting the performance prediction system and task sequencer of the approach in [ 12 ]. However, for the proposed approach we need a large amount of labelled data. For this reason we developed a tutoring tool which (a) records students speech input as well as typed input and mouse click input and (b) allows the students to label by themselves how di cult they perceived the shown tasks. This tool is presented in the second part of this paper and will be used to conduct further studies to gain the desired labelled data.

The main contributions of this paper are: (1) presentation of a new approach for supporting performance prediction and task sequencing in adaptive intelligent tutoring systems by a ect recognition on multimodal input, (2) identi cation and analysis of appropriate and statistically signi cant features for the presented approach, and (3) presentation of a new tutoring tool for multimodal data collection and selflabelling to gain automatically labelled data for training appropriate a ect recognition methods.

In the following, rst we will present some preliminary considerations along with state-of-the-art in section 2. Subsequently, we will describe in section 3 the real data set used for the feature analysis and investigate in section 4 for the data set the correlation between students a ects and their performance. In section 5 we will propose and analyse appropriate features for a ect recognition and in section 6 we will explain how to support performance prediction and task sequencing in intelligent tutoring systems by a ect recognition applied to multimodal input. Before we conclude, we will describe in section 7 the mentioned tool for multimodal data collection and self-labelling.

2. PREPARATION AND RELATED WORK

Before an automatic a ect recognition approach can be applied, one has to clarify three things: (1) What kind of features shall be used, (2) what kind of classes shall be used and (3) which instances shall be mapped to features and labelled with the class labels. After deciding which features, classes and instances shall be considered, one can apply a ect recognition methods to these input data. In the following subsections we will present possible features, classes, instances and methods for a ect recognition supporting performance prediction and task sequencing in adaptive intelligent tutoring systems along with the state-of-the-art.

2.1 Features

The rst step before applying automatic a ect recognition is to identify useful features for this process. For the purpose to recognise a ect in speech one can use two di erent kinds of features ([ 13 ]): acoustic and linguistic features. Further, one can distinct linguistics (like n-grams and bag-of-words) and dis uencies (like pauses). If linguistics features are used, a transcription or speech recognition process has to be applied to the speech input before a ect recognition can be conducted. Subsequently, approaches from the eld of sentiment classi cation or opinion mining (see e.g. [ 10 ]) can be applied to the output of this process. However, the methods of this eld have to be adjusted to be applicable to speech instead of written statements.

Another possibility for speech features is to use dis uencies features like it was done in [ 17 ], [ 7 ] and [ 4 ] for expert identi cation. The advantage of using such features is that instead of a full transcription or speech recognition approach only for instance a pause identi cation has to be applied before. That means that one does not inherit the error of the full speech recognition approach. Furthermore, these features are independent from the need that students use words related to a ects. For using this kind of features one has to investigate, which particular features are suitable for the special task of a ect classi cation in adaptive intelligent tutoring systems. Because of the mentioned advantage of dis uencies features in this work we focus on features extracted from information about speech pauses as one part of the multimodal input for a ect recognition.

As mentioned in the introduction the other part of the multimodal input will be features which are gained from information about typed input or mouse click input from the students. This kind of features is similar to the keystroke dynamics features used in [ 2 ]. In [ 2 ] emotional states were identi ed by analysing the rhythm of the typing patterns of persons on a keyboard.

2.2 Classes

The second step before applying automatic a ect recognition is to de ne the classes corresponding to emotions and a ective states, which shall be recognised by the used affect recognition approach. According to [ 6 ], [ 5 ] and [ 16 ] it is possible to recognise in intelligent tutoring systems students a ects like for instance confusion, frustration, boredom and ow. As mentioned above, we want to use the students behaviour information gained from speech and from typed input or mouse click input for supporting the performance prediction system and task sequencer of the approach in [ 12 ], which is based on the theory of Vygotsky's Zone of Proximal Development [ 14 ]. That means that the goal is to neither bore the student with too easy tasks nor to frustrate him with too hard tasks, but to keep him in the Zone of Proximal Development. Accordingly, we want to use the output of the automatic a ect recognition to get an answer to the question \Was this task too easy, too hard or appropriate for the student?", or with other words we want to nd out if the student felt under-challenged, over-challenged or like to be in a ow. However, the mapping between confusion, frustration, boredom and under-challenged, over-challenged is not unambiguous as one can infer e.g. from the studies mentioned in [ 16 ]. Hence, we will use instead of the above mentioned a ect classes three other classes for supporting performance prediction and task sequencing by automatic a ect recognition: under-challenged, over-challenged and ow. One could summarise these classes as perceived task-di culty classes, as we aim to recognise the individual perceived task-di culty from the view of the student.

2.3 Instances

The third step before applying automatic a ect recognition is deciding which instances shall be mapped to features and labelled with the class labels. If the goal of the a ect recognition is to provide a student motivation or hints according to his a ective state like e.g. in [ 16 ], then instances can be utterances. For supporting performance prediction and task sequencing by a ect recognition instead one needs at the end of a task the information, if the task overall was too easy, too hard or appropriate for the student. The reason is that this information shall help to choose the next task shown to the student. Hence, an instance for supporting performance prediction and task sequencing by a ect recognition has to be instead of an utterance the whole speech input of a student for one task.

2.4 Methods

The possible methods for an automatic a ect recognition depend on the kind of the features used as input. As mentioned above, for speech we distinct two kinds of features: linguistics features and dis uencies. Linguistics features are gained by a preceding speech recognition process and can be processed by methods coming from the areas sentiment analysis and opinion mining ([ 10 ]). Especially methods from the eld of opinion mining on microposts seem to be appropriate if linguistics features are considered. State-of-the-art approaches in opinion mining on microposts use methods for instance based on optimisation approaches ([ 3 ]) or Naive Bayes ([ 11 ]).

The process of gaining dis uencies like pauses is di erent to the full speech recognition process. For extracting for instance pauses usually an energy threshold on the decibel scale is used as in [ 4 ] or an SVM is applied for pause classi cation on acoustic features as in [ 9 ]. Appropriate stateof-the-art methods for automatic emotion and a ect recognition on dis uencies features as well as on features from information about typed input or mouse click input are { as proposed e.g. in [ 13 ] and [ 6 ] { classi cation methods like arti cial neural networks, SVM, decision trees or ensembles of those.

3. REAL DATA SET

After identifying features, classes, instances and methods for a ect recognition for supporting performance prediction and task sequencing like above one can collect data for a concrete feature analysis and a training of the chosen a ect classi cation method. We conducted a study in which the speech and actions of ten 10 to 12 years old German students were recorded and students a ective states as well as the perceived task-di culties were reported. The labelling of these data was done on the one hand concurrently by the tutor and on the other hand retrospectively by a second reviewer. Furthermore, a labelling per exercise (consisting of several subtasks) and an overall labelling per student as an aggregation of the labels per exercise was done. During the study a paper sheet with fraction tasks was shown to the students and they were asked to paint (with the software Paint) and explain their observations and answers. We made a screen recording to record the painting of the students and an acoustic recording to record the speech of the students. The screen recordings were used for the retrospective annotation. The speech recordings shall be used to gain the input for a ect recognition. The mentioned typed input or mouse click input information we will collect and investigate in further studies with the self-labelling and multimodal data collection tutoring tool described in section 7.1. In this paper we focus on speech features and hence in section 5 we will propose and analyse possible features extracted from speech pauses. But rst we will investigate in the following section 4 the correlation between perceived task-di culty labels and the performance of the students in the real data set.

4. CORRELATION OF PERCEIVED TASK DIFFICULTY LABELS AND SCORE

Before we present speech features for recognising perceived task-di culty, we want to show that there is a correlation between the proposed perceived task-di culty labels and the performance of the students, to underline the suitability of supporting performance prediction and task sequencing by the proposed a ect recognition approach. Hence, we mapped the overall perceived task-di culty labels to the overall score of the students (see gure 1). For this mapping we encoded the di erent overall perceived taskdi culty class labels as follows: 0 = over-challenged 1 = over-challenged/ ow 2 = ow 3 = ow/under-challenged 4 = under-challenged The overall score of a student i is computed by nci ; nti where nci is the number of correctly solved tasks of student i and nti is the number of tasks shown to student i. In gure 1 one can see that there is a clear correlation between perceived task-di culty labels and score. To substantiate this observation we applied a statistical test by conducting a linear regression and measuring the p-value, indicating the statistical signi cance, as well as the R2 and Adjusted R2 value, indicating how well the regression line can approximate the real data points. This approach delivers a p-value of 0:0027, (1) a R2 value of 0:6966, and an Adjusted R2 value of 0:6586. The small p-value indicates a strong statistical signi cance. The signi cant correlation between perceived task-di culty labels and scores, which demonstrate the performance, indicates that it makes sense to support performance prediction and task sequencing by perceived task-di culty classi cation.

5. SPEECH FEATURE ANALYSIS

The features we propose and analyse in this section are gained from speech pauses. Hence, rst one has to identify pauses within the speech input data. The most easy way is to de ne a threshold on the decibel scale as done e.g. in [ 4 ]. For our preliminary study of the data we also used such a threshold, which we adjusted by hand. More explicitly, we extracted the amplitudes of the sound les and computed the decibel values. Subsequently, we investigated which decibel values belong to speech and which ones to pauses (see gure 2). In larger data and in the application phase later on, one has to learn automatically the distinction between speech and pauses by either learn the threshold or train an SVM, which classi es speech and pauses.

5.1 Single Feature Analysis

Before we can introduce the features we want to investigate, we have to de ne some measurements: m: number of students pi: total length of pauses of student i si: total length of speech of student i npi : number of pause segments of student i nsi : number of speech segments of student i pi(x): xth pause segment of student i si(y): yth speech segment of student i nti : number of tasks shown to student i nci : number of correctly solved tasks by student i Overall score for student i: nnctii Our data set exists of acoustic recordings from m students, each of which saw nti tasks and solved nci tasks correctly. The overall score of a student i in this case is the number of correctly solved tasks nci divided by the number of seen tasks nti . After applying the above mentioned threshold to the data, we get for each student i the total length of pauses pi and the total length of speech si in his acoustic recoding. Furthermore, we can count connected pause and speech segments to get the number of pause segments npi and speech segments nsi of a student i. The xth pause segment is then p(x) and the yth speech segment si(y). By means of these i measurements and their combination we can create a set of features useful for a ect recognition supporting performance prediction and task sequencing:

Ratio between pauses and speech ( psii )

Frequency of speech pause changes ( maxnjp(ni+pjn+sinsj ) ) Percentage of pauses of input speech data ( (pip+isi) )

Length of maximal pause segment (maxx(pi(x))) Length of average pause segment ( Px pi(x) )

npi

Length of maximal speech segment (maxy(si(y)))

Length of average speech segment ( Pynssii(y) ) Average number of seconds needed per task ( (pi+si) ) nti The ratio between the total length of pauses and the total length of speech indicates, if one one them is notable larger than the other one, i.e. if the student made much more speech pauses than speaking or vice versa. The frequency of speech and pause segment changes indicates, if there are many short speech and pauses segments or just a few large ones and it is normalised by dividing it by the maximal sum of pause and speech segments over all students. From the percentage of pauses one can see if the total pause length was much larger than the total speech part, i.e. the student did not speak much but was more thinking silently. The length of maximal pause or speech segment indicates if there was e.g. a very long pause segment where the student was thinking silently or a very long speech segment where the student was in a speech ow. The length of average pause or speech segment give us an idea of how much on average the student was in a silent thinking phase or a speech ow. The average number of seconds needed per task indicates how long a student on average needed for solving a task. To investigate, if these features are suitable to describe perceived task-di culty as well as performance in our real data # 6 5 4 3

Features Frequency of changes, seconds per task, max. length of pause, average length of pause, max. length of speech average length of speech Frequency of changes, seconds per task, max. length of pause, average length of pause, average length of speech Frequency of changes, seconds per task, average length of pause, average length of speech Frequency of changes, frequency of changes, average length of speech pval. set, we mapped the values of each feature to the score as well as to the perceived task-di culty labels. Subsequently, we applied a linear regression to measure the p-value as well as the R2 and Adjusted R2 value. However, as expected, single features are not very signi cant. The feature with the best values for p-value, R2 and Adjusted R2 { mapped to score as well as to labels { is the Length of maximal pause segment. The statistical values for this feature are shown in table 1. These values are not very satisfactory, as one would desire a p-value smaller than 0:05 and values for R2 and Adjusted R2 which are closer to 1. A more reasonable approach is to combine several features instead of considering just one feature. Hence, in the following section we will investigate di erent combinations of features.

5.2 Feature Combination Analysis

We analysed di erent combinations of features by applying a multivariate linear regression to them to gain the p-value, R2 and Adjusted R2 for these combinations. The investigated combinations are combinations where all features are not strongly correlated, i.e. whenever we had two correlated features we put just one of them into the feature set for that combination. In further steps we removed from the considered feature sets feature by feature. Furthermore, in the multivariate linear regression we mapped the features on the one hand to the score and on the other hand to the labels. The results of the best combinations, i.e. such with a p-value at least smaller than 0:05, are shown in table 2 and 3. For the score there were no combinations with only 2 features with a p-value smaller than 0:05, hence in table 2 we just listed the best combinations with 3 up to 6 features. For the labels instead there were no such combinations, which have a p-value smaller than 0:05, with 6 features, so that in table 3 we only listed the best combinations of 2 up to 5 features. For both (score and labels) there are statistically signi cant feature combinations. That means that our proposed features are able to describe the score as well as the labels.

6. SUPPORTING PERFORMANCE PREDIC TION AND SEQUENCING

As mentioned in the introduction, our goal is to support the performance prediction system and task sequencer of the approach in [ 12 ] by a ect recognition, or by multimodal input respectively. Hence, in the following we will propose how to realise this support. In gure 3 a block diagram of the approach of supporting performance prediction and task sequencing by means of a ect recognition is presented. The approach in [ 12 ] is represented in gure 3 by the non-dotted arrows: the performance prediction gets input from former performances and computes by means of the machine learning method matrix factorization predictions for future performances, which are the input for the task sequencer. The task sequencer decides based on the theory of Vygotsky's Zone of Proximal Development from the performance prediction input which task shall be shown next to the student. This process can be supported by the multimodal input as follows: (1) The additional input for the performance predictor can be the output of the a ect recognition, i.e. the perceived task-di culty labels. In this case the performance predictor can take the perceived task-di culty of the last task (T (t)) to use the following rules for deciding how di cult the next task (T (t+1)) should be: { If T (t) was too easy (label under-challenged or ow/under-challenged ), then T (t+1) should be harder. { If T (t) was appropriate (label ow ), then T (t+1) should be similar di cult. { If T (t) was too hard (label over-challenged or overchallenged/ ow ), then T (t+1) should be easier. (2) The values of the features gained by feature extraction from speech, typed input and mouse click input

can be fed directly into the performance prediction without applying an a ect recognition. That means that the features are mapped to scores instead of perceived task-di culty classes. That this makes sense was shown in section 4 and 5. The performance predictor can then compare e.g. the di erences between performances, expressed as score, and the scores computed by means of the features (s[core). This di erence indicates outliers like if a student felt to be in a ow or under-challenged but his score is worse, i.e. s[core > score. In this case the student may not fully understand the principles of the considered task although he thinks so. Hence, next the system should show the student rather tasks which explain the approach of solving such kind of tasks.

In our studies we observed the behaviour of students described in (2), i.e. the student was labelled as to be in a ow or under-challenged, although he performed worse, as he just thought to understand how the tasks should be solved but he was wrong. In gure 4 this behaviour is indicated by the outliers.

7. LABELLING AND DATA COLLECTION

As mentioned in section 3 the labels of our real data set come from two sources: (a) a concurrent annotation by the tutor and (b) a retrospective annotation by another external reviewer on the basis of the tasks sheet, the sound les and the screen recording. However, in the literature one can nd further labelling strategies like self-labelling of the students (see e.g. [ 5 ], [ 6 ], [ 8 ]). The advantage of self-labelling is that one can gain automatically a labelled data set for a subsequent training of an a ect recognition method. Furthermore, as we want to recognise the perceived task-di culty from the view of the student, a label from the student himself seem to be more appropriate than labels from another person only reviewing the behaviour of the student. Hence, for further studies we developed a tool for collecting speech data and typed input and mouse click input data, labelled automatically with the task-di culty perceived by the student. This tool will be further described in the following section.

7.1 Self-Labelling Fractional Arithmetic Tutor for Multimodal Data Collection

To be able to conduct studies in which the students themselves label the task-di culty which they perceived, we developed a tutoring tool (self - self-labelling f ractional arithmetic tutor for multimodal data collection) written in Java. However, for little children it might be di cult to analyse themselves (see e.g. [ 8 ]). Hence, self-labelling is often applied in experiments with at least college students as for instance in [ 5 ]. Therefore, we will conduct the experiments with this tool rst with older students and more challenging tasks. Later on we will investigate if there is a way to adapt the tool so that a self-labelling is possible also with younger students. Nevertheless, conducting experiments with older students has several advantages besides the possibility of a reasonable self-labelling: older students are able to focus on the tasks longer than young students and the privacy issues are not such strong as for younger students. Both facts lead to more data. Hence, besides investigating the possibility of adapting self for younger students, we have to identify differences and similarities of the data from older and younger students to nd out how to exploit older students data to recognise a ects from multimodal input from younger students.

In gure 5 one can see the graphical user interface of our selflabelling multimodal data collection tool self. To gain more background information, in the beginning self asks some information from the students as course of studies, number of terms, age and gender. Subsequently, an instruction with hints how to behave is shown to the students, which they can have a look at also while interacting with the tool (button "Anleitung\). self speaks to the students to motivate them to speak with the system and records the speech input of the students. The speech output of self is generated by means of text to speech realised by the library MARY developed at the DFKI ([ 18 ]). While interacting with the system, the student can type in numbers, ask for a hint (button "Hilfe"), skip the task because it is too easy or because it is too hard (left buttons) or submit the solution (button "Endergebnis uberprufen"). Every action of the student, like asking for a hint or submitting the answer, is written { together with a time stamp { into a log le immediately after the action, enabling also the extraction of typed input or mouse click input features. Also a score depending on the number of requested hints hr and the number of incorrect inputs w is computed according to the approach in [ 15 ] and written into the log le. The formula for this score is 1 ( hr + (w 0:1)) ; ht (2) where ht is the total number of available hints for the considered task. The meaning behind the formula is that each wrong input w(j) is punished with a factor of 0:1 and every request of a hint h(rk) is punished with a factor of h1t , so that if every hint was seen the score will be 0. After the student submitted the correct answer, he is asked to evaluate, if this task was too easy, too hard or appropriate for him (see popup window in gure 5). The tasks implemented in self for older students cover the following areas:

Reducing fractions with numbers and variables Fraction addition with and without intermediate steps and with numbers and variables Fraction subtraction with and without intermediate steps and with numbers and variables Fraction multiplication with and without intermediate steps and with numbers and variables Fraction division with and without intermediate steps and with numbers and variables Distributivity law with and without intermediate steps Finite sums of unit fractions

Rule of Three After developing self, the next step will be to conduct further studies with students to collect an adequate amount of automatically labelled speech input, typed input and mouse click input data for training an a ect recognition method and supporting performance prediction and task sequencing. Furthermore, we will investigate if there is a way to adapt self so that also younger students can label themselves.

8. CONCLUSIONS

We proposed a new approach for supporting performance prediction and task sequencing in adaptive intelligent tutoring systems by a ect recognition on features gained from multimodal input like students speech input. For this approach we proposed and analysed appropriate speech features and showed that there are statistically signi cant feature combinations which are able to describe students a ect, or perceived task-di culty respectively, as well as the performance of a student. Furthermore, we proved the possibility of supporting performance prediction and task sequencing by perceived task-di culties by demonstrating that there is a correlation between perceived task-di culty and performance. Next steps will be to conduct more studies with students by means of the presented self-labelling and multimodal data collection tool to enable a training of an appropriate a ect recognition method for supporting performance prediction and task sequencing in adaptive intelligent tutoring systems.

9. ACKNOWLEDGMENTS

The research leading to the results reported here has received funding from the European Union Seventh Framework Programme (FP7/2007 { 2013) under grant agreement No. 318051 { iTalk2Learn project (www.italk2learn.eu). Furthermore, we thank our project partner Ruhr University Bochum for realising the study and data collection as well as the IMAI of the University of Hildesheim for support for the tutoring tool and preparation for future studies. 10. skip task because ...

too easy

Hints too hard task was ... too easy appropriate too hard

[1] Cichocki , A. , Zdunek , R. , Phan , A. H.

and

Amari , S.I.

2009 . Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation , Wiley.

[2] Epp , C. , Lippold , M. , Mandryk , R.L. 2011 . Identifying Emotional States Using Keystroke Dynamics . In Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems (CHI 2011 ), Vancouver, BC, Canada, pp. 715 { 724 .

[3] Hu , X. , Tang , L. , Tang , J. and Liu, H. 2013 . Exploiting Social Relations for Sentiment Analysis in Microblogging . In Proceedings of the Sixth ACM WSDM Conference (WSDM '13).

[4] Luz , S. 2013 . Automatic Identi cation of Experts and Performance Prediction in the Multimodal Math Data Corpus through Analysis of Speech Interaction . Second International Workshop on Multimodal Learning Analytics, Sydney Australia, December 2013 .

[5]

'Mello , S. , Picard , R. and Graesser , A. 2007 . Towards An A ect-Sensitive AutoTutor . Intelligent Systems, IEEE , Vol. 22 , Issue 4, pp. 53 { 61 .

[6]

'Mello , S.K. , Craig , S.D. , Witherspoon , A. , McDaniel , B. and Graesser , A. 2008 . Automatic detection of learner's a ect from conversational cues. User Model User-Adap Inter , DOI 10 .1007/s11257-007-9037-6.

[7] Morency , L.P. , Oviatt , S. , Scherer , S. , Weibel , N. and Worsley , M. 2013 . ICMI 2013 grand challenge workshop on multimodal learning analytics . In Proceedings of the 15th ACM on International conference on multimodal interaction (ICMI 2013 ), pp. 373 { 378 .

[8] Porayska-Pomsta , K. , Mavrikis , M. , D'Mello , S. , Conati , C. and Baker , R.S.J.d. 2013 . Knowledge Elicitation Methods for A ect Modelling in Education . International Journal of Arti cial Intelligence in Education, ISSN 1560-4292.

[9] Qi , F. , Bao , C. , Liu , Y. 2004 . A novel two-step SVM classi er for voiced/unvoiced/silence classi cation of speech . International Symposium on Chinese Spoken Language Processing , pp. 77 { 80 .

[10] Sadegh , M. , Ibrahim , R. , Othman , Z.A. 2012 . Opinion Mining and Sentiment Analysis: A Survey . International Journal of Computers & Technology , Vol. 2 , No. 3 .

[11] Saif , H. , He , Y. and Alani , H. 2012 . Semantic Sentiment Analysis of Twitter . In Proceedings of the 11th International Semantic Web Conference (ISWC 2012 ).

[12] Schatten , C. and Schmidt-Thieme , L. 2014 . Adaptive Content Sequencing without Domain Information . In Proceedings of the Conference on computer supported education (CSEDU 2014 ).

[13] Schuller , B. , Batliner , A. , Steidl , S. and Seppi , D. 2011 . Recognising realistic emotions and a ect in speech: State of the art and lessons learnt from the rst challenge . Speech Communication , Elsevier.

[14] Vygotsky , L.L.S. 1978 . Mind in society: The development of higher psychological processes . Harvard university press.

[15] Wang , Y. and He ernan, N. 2011 . Extending Knowledge Tracing to allow Partial Credit: Using Continuous versus Binary Nodes . Arti cial Intelligence in Education, Lecture Notes in Computer Science , Vol. 7926 , pp. 181 { 188 .

[16] Woolf , B. , Burleson , W. , Arroyo , I. , Dragon , T. , Cooper , D. and Picard , R. 2009 . A ect-aware tutors: recognising and responding to student a ect . Int. J. of Learning Technology , Vol. 4 , No. 3 /4, pp. 129 { 164 .

[17] Worsley , M. and Blikstein , P. 2011 . What's an Expert? Using Learning Analytics to Identify Emergent Markers of Expertise through Automated Speech, Sentiment and Sketch Analysis . In Proceedings of the 4th International Conference on Educational Data Mining (EDM '11) , pp. 235 { 240 .

[18] The MARY Text-to-Speech System , http://mary.dfki.de/