=Paper= {{Paper |id=Vol-1446/amadl_pap3 |storemode=property |title=Recognizing Perceived Task Difficulty from Speech and Pause Histograms |pdfUrl=https://ceur-ws.org/Vol-1446/amadl_pap3.pdf |volume=Vol-1446 |dblpUrl=https://dblp.org/rec/conf/edm/JanningSS15 }} ==Recognizing Perceived Task Difficulty from Speech and Pause Histograms== https://ceur-ws.org/Vol-1446/amadl_pap3.pdf
    Recognising Perceived Task Difficulty from
         Speech and Pause Histograms

         Ruth Janning, Carlotta Schatten, and Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
        {janning,schatten,schmidt-thieme}@ismll.uni-hildesheim.de



      Abstract. Currently, a lot of research in the field of intelligent tutoring
      systems is concerned with recognising student’s emotions and affects.
      The recognition is done by extracting features from information sources
      like speech, typing and mouse clicking behaviour or physiological sensors.
      In former work we proposed some low-level speech features for perceived
      task difficulty recognition in intelligent tutoring systems. However, by
      extracting these features some information hidden in the speech input is
      loosed. Hence, in this paper we propose and investigate speech and pause
      histograms as features, which preserve some of the loosed information.
      The approach of using speech and pause histograms for perceived task
      difficulty recognition is evaluated by experiments on data collected in a
      study with German students solving mathematical tasks.

      Keywords: Intelligent tutoring systems, perceived task difficulty recog-
      nition, low-level speech features, speech and pause histograms


1   Introduction
Automatic cognition, affect and emotion recognition is a relatively young and
very important research field in the area of adaptive intelligent tutoring systems.
Some research has been done to identify useful information sources and appro-
priate features able to describe student’s cognition, emotions and affects. Those
information sources can be speech input, written input, typing and mouse click-
ing behaviour or input from physiological sensors. In former work ([5], [6], [7])
we proposed low-level speech features for perceived task difficulty recognition in
intelligent tutoring systems. These features are extracted from the amplitudes
of speech input of students interacting with the system and contain for instance
the maximal and average length of speech phases and pauses. However, by ex-
tracting those features some more fine granulated information contained within
the sequence of speech and pause segments is loosed and the question arises if
there is a way to create features which preserve the loosed information. His-
tograms contain much more information than only the maximal, minimal and
average value. Hence, in this work we propose and investigate speech and pause
histograms as features for perceived task difficulty recognition, i.e. for recog-
nising if a student feels over-challenged or appropriately challenged by a task.
Speech and pause histograms share the advantages of low-level speech features
(they do not inherit the error from speech recognition and there is no need that
students use words related to emotions or affects, see also sec. 2) and avoid to
lose information hidden in the sequences of speech and pause segments.

2     Related Work
For the purpose to recognise emotion or affect in speech one can distinct linguis-
tics features, like n-grams and bag-of-words, and low-level features like prosodic
features, disfluencies, e.g. speech pauses ([5], [6]), (see e.g. [17]) or articulation
features ([7]). If linguistics features are not extracted from written but from spo-
ken input, a transcription or speech recognition process has to be applied to the
speech input before emotion or affect recognition can be conducted. Linguistic
features for affect and emotion recognition from conversational cues were pre-
sented and investigated e.g. in [10] and [11]. Low-level features are used in the
literature for instance for expert identification, as in [18], [13] and [8], for emo-
tion and affect recognition as in [12] and [5], [6], [7] or for humour recognition as
in [15]. The advantage of using low-level features like disfluencies is that instead
of a full transcription or speech recognition approach only for instance a pause
identification has to be applied before computing the features. That means that
one does not inherit the error of the full speech recognition approach. Further-
more, these features are independent from the need that students use words
related to emotions or affects. Another kind of features which is independent
from the need that students use words related to emotions or affects are features
gained from information about the actions of the students interacting with the
system (see e.g. [9]) like features extracted from a log-file (see e.g. [2], [16], [14]).
In [9] such kind of features is used to predict whether a student can answer cor-
rectly questions in an intelligent learning environment without requesting help
and whether a student’s interaction is beneficial in terms of learning. Also the
keystroke dynamics features used in [4] belong to this kind of features. In [4]
emotional states were identified by analysing the rhythm of the typing patterns
of persons on a keyboard. A further possibility of gaining features is using the
information from physiological sensors as for instance in [1]. However, bringing
sensors into classrooms is time consuming and expensive and one has to cope
with students’ acceptance of the sensors.

3     Speech and Pause Histograms
As mentioned above, in this paper we investigate the ability of speech and pause
histograms for perceived task difficulty recognition. How these speech and pause
histograms are created from students’ speech input is described in sec. 3.2 and
the data which we used for our experiments is described in the next section.

3.1   Data
We conducted a study in which the speech and actions of ten 10 to 12 years
old German students were recorded and their perceived task-difficulties were
Fig. 1. Graphic of the decibel scale of an example sound file of a student. The two
straight horizontal lines indicate the threshold.
                       0.25




                                                                                                            0.25
                                               over−challenged                                                                      over−challenged
                       0.20




                                                                                                            0.20
normalized frequency




                                                                                     normalized frequency
                       0.15




                                                                                                            0.15
                       0.10




                                                                                                            0.10
                       0.05




                                                                                                            0.05
                       0.00




                                                                                                            0.00




                              0.00   0.05        0.10      0.15        0.20   0.25                                 0.00   0.05        0.10      0.15        0.20   0.25

                                     normalized pause segment length                                                      normalized pause segment length
                       0.25




                                                                                                            0.25




                                            appropriately challenged                                                             appropriately challenged
                       0.20




                                                                                                            0.20
normalized frequency




                                                                                     normalized frequency
                       0.15




                                                                                                            0.15
                       0.10




                                                                                                            0.10
                       0.05




                                                                                                            0.05
                       0.00




                                                                                                            0.00




                              0.00   0.05        0.10      0.15        0.20   0.25                                 0.00   0.05        0.10      0.15        0.20   0.25

                                     normalized pause segment length                                                      normalized pause segment length



Fig. 2. Normalised pause histograms for a task of four different students, where two
are labelled as over-challenged and the other two as appropriately challenged.
reported per task. The labelling of these data was done on the one hand con-
currently by a human tutor and on the other hand retrospectively by a second
reviewer (with a Cohen’s kappa for inter-rater reliability of 0.747, p < 0.001).
Divergences in the both labellings were clarified later on by discussions between
the reviewers. During the study a paper sheet with fraction tasks was shown to
the students and they were asked to paint – by means of a software for painting
with a computer – their solution and they were prompt to explain aloud their
observations and answers. The fraction tasks were subdivided into similar sub-
tasks and covered exercises like assigning fractions to coloured parts of a circle
or rectangle, reducing, adding or subtracting fractions and fraction equivalence.
Originally, there were 10 tasks with 1 up to 10 subtasks but not each task was
seen by each student. We made a screen recording to record the painting of
the students and an acoustic recording to record the speech of the students.
The screen recordings were used for the retrospective annotation. The acoustic
speech recordings, consisting of 10 wav files with a length from 15 up to 20 min-
utes, were used to gain the speech and pause histograms. The data collection
resulted in 36 examples (tasks) labelled with over-challenged (12 examples) or
appropriately challenged (24 examples), respectively 48 examples (24 of class ap-
propriately challenged, 24 of class over-challenged ) after applying oversampling
to the smaller set of examples of class over-challenged to eliminate the unbalance
in the data.

3.2   Histograms for Classification
In the above mentioned study we observed that the children often exhibited
longer pauses of silence while thinking about the problem when they were over-
challenged or produced fewer and shorter pauses while communicating when
they were appropriately challenged. Hence, in this paper we investigate infor-
mation about pauses and speech segments within the speech input of students
in connection with the perceived task difficulty. The first step to gain this in-
formation is to segment the acoustic speech recordings for identifying segments
containing speech and segments corresponding to pauses. The most easy way
to do this is to define a threshold on the decibel scale as done e.g. in [8]. For
our study of the data we also used a threshold, which was estimated manually.
The manual threshold estimation was done by extracting the amplitudes of the
sound files, computing the decibel values and generating a graphic of it like the
one in fig. 1. Subsequently, it was investigated which decibel values belong to
speech and which ones to pauses to create from this information an appropriate
threshold. By means of this threshold the pause and speech segments can be
extracted. From the pause segments the pause histogram is generated by count-
ing how often each possible pause length occur. This pause histogram is then
normalised, to make the pause histograms of different speech inputs (of different
students, different tasks and different lengths) comparable. The normalisation is
done by dividing each occurring pause length by the length of the whole speech
input as well as dividing the frequency of each occurring pause length by the
number of all speech and pause segments, so that the resulting values stem
from the interval between 0 and 1. The same is done with the speech segments
for generating the speech histogram. Examples of normalised pause histograms
and speech histograms are shown in fig. 2 and fig. 3. The examples stem from
the speech input for a task of four different students, where two were labelled as
over-challenged and the other two as appropriately challenged. One can see some
                       0.25




                                                                                                      0.25
                                             over−challenged                                                                over−challenged
                       0.20




                                                                                                      0.20
normalized frequency




                                                                               normalized frequency
                       0.15




                                                                                                      0.15
                       0.10




                                                                                                      0.10
                       0.05




                                                                                                      0.05
                       0.00




                                                                                                      0.00




                              0.00          0.05           0.10         0.15                                 0.00          0.05           0.10         0.15

                                     normalized speech segment length                                               normalized speech segment length
                       0.25




                                                                                                      0.25




                                         appropriately challenged                                                       appropriately challenged
                       0.20




                                                                                                      0.20
normalized frequency




                                                                               normalized frequency
                       0.15




                                                                                                      0.15
                       0.10




                                                                                                      0.10
                       0.05




                                                                                                      0.05
                       0.00




                                                                                                      0.00




                              0.00          0.05           0.10         0.15                                 0.00          0.05           0.10         0.15

                                     normalized speech segment length                                               normalized speech segment length



Fig. 3. Normalised speech histograms for a task of four different students, where two
are labelled as over-challenged and the other two as appropriately challenged.



differences between the histograms of the over-challenged students and the ap-
propriately challenged students as well as some similarities of the examples with
the same label. The pause histograms of the appropriately challenged students
show that there are a lot of very small pauses within their speech, but no very
large pauses. The pause histograms of the over-challenged students in contrast
report long pauses and less smaller pauses than for the appropriately challenged
students. In the speech histograms one can see that the over-challenged stu-
dents used a lot of very small speech segments of the same length whereas for
appropriately challenged students there is a large variance in the speech segment
length. In the following section we investigate how these histograms can be used
for classifying the speech input of a student for a task as either over-challenged
or appropriately challenged.


4     Experiments
To investigate if the above described speech and pause histograms are appli-
cable for distinguishing over-challenged and appropriately challenged students
we conducted experiments with the perprocessing and settings described in the
following section. The experimental results are reported in sec. 4.2.

4.1   Preprocessing and Experimental Settings
To be computationally comparable the normalised histograms still need to be
preprocessed, or more explicitly generalised, as the set of possibly occurring
segment lengths is infinite (it is a real value between 0 and 1). Hence, we divide
the x-axis (the different normalised lengths of pause or speech segments) into a
number of equal sized intervals, the buckets. Each occurring normalised segment
length is then put into the bucket to whose interval it belongs. The number
of buckets, or the bucket size respectively, is a hyper parameter and in the
experiments we investigated different values for that parameter, i.e. we conducted
experiments with 2 up to 1, 000, 000 buckets (bucket size 0.5 up to 1.0E-6) where
the numbers of buckets are multiples of the numbers by which 100 is divisible
without remainder. A comparison of two different histograms can now be done
by comparing the content of each bucket in both histograms, that means that for
each bucket the normalised frequencies of segments belonging to that bucket are
compared. In our experiments we compute the difference between two histograms
by computing the differences between the frequencies in all buckets by means of
the root mean square error (RMSE):
                                 s
                                    Pb                          2
                                       i=1 (bi (Hx ) − bi (Hy ))
                      RMSE =                                      ,            (1)
                                                  b

where Hx and Hy are the two histograms to compare, bi (Hx ) and bi (Hy ) are the
normalised frequency values belonging to bucket bi of Hx and Hy and b is the
number of buckets. For deciding to which class (over-challenged or appropriately
challenged ) a histogram belongs we applied the K-Nearest-Neighbour (KNN)
approach. KNN (see e.g. [3]) classifies an example by a majority vote of its
neighbours, that is the example is assigned to the class most common among
its K nearest neighbours. These K nearest neighbours are the K closest training
examples in the feature space. The closeness in our case is measured by means
of the RMSE. That is a histogram is assigned to that class to which the majority
of the K closest (in terms of RMSE) histograms belongs. K is a further hyper
parameter and also for that parameter we tried out different values, i.e. we
conducted experiments with a number of 1 up to 35 neighbours where that value
is an odd number less than the number of unique examples. For the evaluation
we used a Leave-one-out cross-validation in the experiments. The results of our
experiments with pause and speech histograms are discussed in the next section.




                                                          pauses                                                                                                           pauses
                                60




                                                                                                                                               60
                                                                                                 Minimal error (%), best num. of buckets (%)
                                50




                                                                                                                                               50
Minimal error (%), best K (%)
                                40




                                                                                                                                               40
                                30




                                                                                                                                               30
                                                                                Min. error
                                                                                Best K
                                20




                                                                                                                                               20




                                                                                                                                                                Min. error
                                                                                                                                                                Best num. buckets
                                10




                                                                                                                                               10
                                0




                                                                                                                                               0




                                     2   5   20   50   200 500   2500   20000    1e+05   5e+05                                                      1   3   5   7   9 11   15       19     23   27   31      35

                                                   Number of buckets                                                                                                            K




                                                          speech                                                                                                           speech
                                60




                                                                                                                                               60
                                                                                                 Minimal error (%), best num. of buckets (%)




                                                                           Min. error
                                50




                                                                                                                                               50
Minimal error (%), best K (%)




                                                                           Best K
                                40




                                                                                                                                               40
                                30




                                                                                                                                               30




                                                                                                                                                                                         Min. error
                                                                                                                                                                                         Best num. buckets
                                20




                                                                                                                                               20
                                10




                                                                                                                                               10
                                0




                                                                                                                                               0




                                     2   5   20   50   200 500   2500   20000    1e+05   5e+05                                                      1   3   5   7   9 11   15       19     23   27   31      35

                                                   Number of buckets                                                                                                            K



Fig. 4. Different numbers of buckets and different numbers K of neighbours mapped
to the minimal classification error (%) and the belonging best value for K (% of the
number of examples) and for the number of buckets (% of the max. number of buckets)
for pause and speech histograms.
4.2   Experiments with Speech and Pause Histograms

As mentioned above, we conducted experiments with different numbers of buck-
ets and different values for the K nearest neighbours. In fig. 4 we report the
minimal classification error and the belonging best value of K for each bucket
number as well as the the minimal classification error and the belonging best
number of buckets for each value of K for the pause and the speech histograms.
The classification error is the number of incorrectly classified histograms divided
by the number of all histograms. The black dots in fig. 4 indicate the best re-
sults which are also reported in tab. 1 and 2. As one can see in fig. 4 for the


Table 1. Number of buckets, bucket size, K, classification error and F-measures of class
over-challenged & appropriately challenged of the experiments with pause histograms
with best result (classification error < 34%, black dots in fig. 4).

  Number of buckets     2           2          2         50        250        500
  Bucket size          0.5        0.5        0.5        0.02      0.004      0.002
  K                     9          19         23          7         3          1
  Error (%)           31.25      33.33      25.00      33.33      33.33      31.25
  F-measure         0.57, 0.82 0.55, 0.80 0.67, 0.83 0.59, 0.63 0.59, 0.57 0.60, 0.71




Table 2. Number of buckets, bucket size, K, classification error and F-measures of class
over-challenged & appropriately challenged of the experiments with speech histograms
with best result (classification error < 34%, black dots in fig. 4).

   Number of buckets 20000 25000 50000 100000 200000 250000 500000 1000000
   Bucket size       5.0E-5 4.0E-5 2.0E-5 1.0E-5 5.0E-6 4.0E-6 2.0E-6 1.0E-6
   K                   11     11     11     11     11     11     11     11
   Error (%)         33.33 33.33 29.17 27.08 27.08 27.08 27.08 27.08
   F-measure          0.57, 0.57, 0.62, 0.64, 0.64, 0.64, 0.64,        0.64,
                      0.73 0.73 0.78 0.77         0.77   0.77   0.77   0.77



pause histograms a smaller number of buckets delivers the best results whereas
for the speech histograms the number of buckets has to be large, i.e. a more fine
granulated division of the x-axis is needed for good results. The reason might be
that the pause histograms of over-challenged and appropriately challenged stu-
dents are easier distinguishable as in the pause histogram of an over-challenged
student there are typically long pause segments which usually do not occur in
the speech of appropriately challenged students (see also fig. 2). As fig. 3 shows,
speech histograms of over-challenged and appropriately challenged students are
not so easy to distinct. Tab. 1 and 2 show the results of the best choices for hyper
parameter K and number of buckets and reports the classification error as well as
the F-measures of both classes (over-challenged and appropriately challenged ).
The F-measure is a value between 0 and 1 and the closer it is to 1 the better.
It is the harmonic mean between the ratio of examples of a class c which are
correctly recognised as members of that class (recall ) and the ratio of examples
classified as belonging to class c which actually belong to class c (precision).
In our experiments the F-measures of class appropriately challenged are better
than those of class over-challenged. The reason could be that originally there
were more examples of class appropriately challenged and we just oversampled
class over-challenged to receive a balanced example set. Nevertheless, the best
classification errors of 25% and 27.08% and F-measures 0.67, 0.83 and 0.64, 0.77
in tab. 1 and 2 indicate that speech and pause histograms are applicable for
perceived task difficulty recognition.


5    Conclusions and Future Work
We proposed and investigated speech and pause histograms, build from the se-
quences of speech and pause segments within the speech input of students, as fea-
tures for perceived task difficulty recognition. To evaluate the approach of using
the histograms for distinguishing over-challenged and appropriately challenged
students we applied a K-Nearest-Neighbour classification delivering a classifica-
tion error of 25% for pause histograms and 27.08% for speech histograms. Next
steps will be to try out other classification approaches, for instance from time se-
ries classification. Furthermore, the information from the speech histograms and
pause histograms could be combined to reach a better classification performance,
e.g. by ensemble methods.

Acknowledgements. This work is co-funded by the EU project iTalk2Learn
(www.italk2learn.eu) under grant agreement no. 318051.


References
1. Arroyo, I., Woolf, B.P., Burelson, W., Muldner, K. , Rai, D. and Tai, M.: A Multime-
   dia Adaptive Tutoring System for Mathematics that Addresses Cognition, Metacog-
   nition and Affect. In International Journal of Artificial Intelligence in Education,
   Springer, Vol. 24, pp. 387–426 (2014)
2. Baker, R.S.J.D., Gowda, S., Wixon, M., Kalka, J., Wagner, A., Salvi, A., Aleven,
   V., Kusbit, G., Ocumpaugh, J. and Rossi,L.: Towards Sensor-Free Affect Detection
   in Cognitive Tutor Algebra. In Proceedings of the 5th International Conference on
   Educational Data Mining (EDM 2012), pp. 126–133 (2012)
3. Cover, T. and Hart, P.: Nearest neighbor pattern classification. EEE Transactions
   on Information Theory, Vol. 13(1), pp. 21–27, doi:10.1109/TIT.1967.1053964 (1967)
4. Epp, C., Lippold, M. and Mandryk, R.L.: Identifying Emotional States Using
   Keystroke Dynamics. In Proceedings of the 2011 Annual Conference on Human
   Factors in Computing Systems (CHI 2011), pp. 715–724 (2011)
5. Janning, R., Schatten, C., Schmidt-Thieme, L.: Multimodal Affect Recognition for
   Adaptive Intelligent Tutoring Systems. In Extended Proceedings of the 7th Inter-
   national Conference on Educational Data Mining (EDM 2014), pp. 171–178 (2014)
6. Janning, R., Schatten, C., Schmidt-Thieme, L.: Feature Analysis for Affect Recog-
   nition Supporting Task Sequencing in Adaptive Intelligent Tutoring Systems. In
   Proceedings of the European Conference on Technology Enhanced Learning (EC-
   TEL 2014), pp. 179–192 (2014)
7. Janning, R., Schatten, C., Schmidt-Thieme, L. and Backfried, G.: An SVM Plait
   for Improving Affect Recognition in Intelligent Tutoring Systems. In Proceedings
   of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI)
   (2014)
8. Luz, S.: Automatic Identification of Experts and Performance Prediction in the
   Multimodal Math Data Corpus through Analysis of Speech Interaction. Second
   International Workshop on Multimodal Learning Analytics, Sydney Australia (2013)
9. Mavrikis, M.: Data-driven modelling of students interactions in an ILE. In Proceed-
   ings of the International Conference on Educational Data Mining (EDM 2008), pp.
   87–96 (2008)
10. DMello, S.K., Craig, S.D., Witherspoon, A., McDaniel, B. and Graesser, A.: Auto-
   matic detection of learners affect from conversational cues. User Model User-Adap
   Inter, DOI 10.1007/s11257-007-9037-6 (2008)
11. D’Mello, S.K. and Graesser, A.: Language and Discourse Are Powerful Signals of
   Student Emotions during Tutoring. IEEE Transactions on Learning Technologies,
   Vol. 5(4), pp. 304–317, IEEE Computer Society (2012)
12. Moore, J.D., Tian, L. and Lai, C.: Word-Level Emotion Recognition Using High-
   Level Features. Computational Linguistics and Intelligent Text Processing (CICLing
   2014), pp. 17–31 (2014)
13. Morency, L.P., Oviatt, S., Scherer, S., Weibel, N. and Worsley, M.: ICMI 2013
   grand challenge workshop on multimodal learning analytics. In Proceedings of the
   15th ACM on International conference on multimodal interaction (ICMI 2013), pp.
   373–378 (2013)
14. Pardos, Z.A., Baker, R.S.J.D, San Pedro, M., Gowda, S.M. and Gowda, S.M.:
   Affective States and State Tests: Investigating How Affect and Engagement dur-
   ing the School Year Predict End-of-Year Learning Outcomes. Journal of Learning
   Analytics, Vol. 1(1), Inaugural issue, pp. 107–128 (2014)
15. Purandare, A. and Litman, D.: Humor: Prosody Analysis and Automatic Recog-
   nition for F * R * I * E * N * D * S *. In Proceedings of the 2006 Conference on
   Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 208–215
   (2006)
16. San Pedro, M.O.C., Baker, R.S.J.D., Bowers, A. and Heffernan, N.: Predicting
   College Enrollment from Student Interaction with an Intelligent Tutoring System
   in Middle School. In Proceedings of the 6th International Conference on Educational
   Data Mining (EDM 2013), pp. 177–184 (2013)
17. Schuller, B., Batliner, A., Steidl, S. and Seppi, D.: Recognising realistic emotions
   and affect in speech: State of the art and lessons learnt from the first challenge.
   Speech Communication, Elsevier (2011)
18. Worsley, M. and Blikstein, P.: What’s an Expert? Using Learning Analytics to
   Identify Emergent Markers of Expertise through Automated Speech, Sentiment and
   Sketch Analysis. In Proceedings of the 4th International Conference on Educational
   Data Mining (EDM ’11), pp. 235–240 (2011)