Correlation of perceived fluency with phonetic measures of speech rate and pausing Peter Kleman Department of English and American studies Faculty of Philosophy Constantine the Philosopher University Štefánikova trieda 38/67, Nitra, 949 10, Slovakia peter.kleman@ukf.sk Štefan Beňuš Department of English and American studies Faculty of Philosophy Constantine the Philosopher University Štefánikova trieda 38/67, Nitra, 949 10, Slovakia Institute of Informatics of the Slovak Academy of Sciences Dúbravská cesta 9, 841 04 Bratislava, Slovakia sbenus@ukf.sk Abstract. The paper studies the relationship between Kallio, Suni, Virkkunen, and Šimko conducted a study in perceived fluency of L2 semi-spontaneous utterances and 2018 [2] on whether prosodic prominence levels of phonetic measures such as speech rate and the number of syllables could be used to predict the prosodic competence pauses. The data for the correlation analysis comes from a of L2 speakers of Swedish. They used a continuous wavelet word guessing experiment conducted with Slovaks speaking transformation analysis of syllable prominence with English. Subjects provided cues for target words intended to combinations of f0, energy, and duration features. The data facilitate the correct guessing of those words. In the second for the test was gathered from a larger corpus created phase, speakers were asked to guess the words to which the during a computer-aided oral test. They manually annotated interlocutors were providing cues. The guessers were also the data to syllable-level and measured f0 using PRAAT. asked to evaluate the fluency of the interlocutors for each of This data was assessed using wavelet transformation the words that the speakers were guessing. The data from the recordings is analysed through a correlation analysis of the analysis. The second set of assessments was gathered from phonetic measures extracted from the acoustic signal and the expert raters. The results showed that the assessments level of perceived fluency that was elicited for each target correlated to the assessments of expert raters. This data word. The study found that phonetic measures do correlate provided strong support for future use of wavelet-based with the levels of perceived fluency. The findings may be used prominence estimation in automatic assessment of L2 for improvements in automated computer assisted fluency proficiency. assessment. Ramanarayanan, Lange, and Evanini studied the human and automated scoring of fluency, pronunciation, and intonation [3]. They collected interactions of L2 speakers 1 Introduction1 of English and used both human and machine learning for creation of scores for each of the aspects. The study The study of the relationship of fluency and phonetic showed that trained scoring models were generally on par measures is an endeavour that will prove to be useful when with human raters’ scores. it comes to fully understanding how humans perceive Therefore, for such automated assessments we need two fluency of their peers and will aid in the pursuit of creating separate sets of data. The first set consists of subjective of automatic fluency measuring algorithms and programs. data gathered from evaluation of fluency provided by Such technological advances will be useful in the coming subjects [4]. The second set of data consists of phonetic age of intelligent self-learning computer that will be able to measures that were previously studied and had their understand, evaluate, and perhaps even study human importance assessed [5]. Such approach to data gathering languages. was also used in the following study. With the increased De Jong and Wempe conducted a study in 2009 [1] using volume of such data available, the algorithms can be PRAAT to automatically detect syllable nuclei in order to improved to incorporate more measures that aid the measure speech rate. The data used in the study came from computer in better assessing various aspects of human experiments performed by 8 participants with tasks such as speech. reading aloud syllable lists and informal storytelling. They The aim of the study was to search for a statistically significant conducted a correlation analysis on the predicted data correlation between perceived fluency and phonetic measures. obtained from the analysis in relation to human syllable This was firstly studied across the data from all the speakers in counts done on the data from the experiments. This study one group. Secondly, they were also divided into groups, which concluded that automatic syllable count could reliably consisted of assessors of the same proficiency level. We expected assess and compare speech rates. that the correlation should be better with all subjects taken into account as assessors.as opposed to only same proficiency group assessors. The rationale behind this statement is that the more varied points of view we have on assessment, the better the Copyright ©2019 for this paper by its authors. Use permitted under correlation results will be. This was also meant to avoid the Creative Commons License Attribution 4.0 International (CC BY 4.0). extremes that were predicted to come up in the analyses. 1.1 Definitions subjects listened to both of the cues for each word, they were asked to evaluate the fluency level of the interlocutors For a number of L2 speakers of English, fluency seems on a scale of 1 to 7. Since all subjects were naïve assessors, to be an elusive language feature that they can never quite they were mainly asked to focus on guessing the words master. Various disfluencies can have an impact on the from cues. They were asked to provide spontaneous speech of a person, both natives and non-natives, as assessments of fluency. The experimenter marked the previously demonstrated in research [4]. Previous research perceived fluency assessments for each of the words. Each in fluency provides several definitions of what fluency of the subjects provided 10 assessments for all of the 12 actually is [3, 7, 8, 9, 10, 11, 12], but there does not seem to speakers resulting in a data set of 120 assessments for each be an agreed upon definition that is accepted by all. In subject. general, fluency is considered to be the overall proficiency of a speaker that uses a language at a high level [13, 14, 2.1 Data processing 15]. The same general definition can be used for L2 Fluency as well. Fluency was also used as an umbrella In the data processing, the recordings from phase one term, when it was divided into a broad sense and a narrow were labelled using PRAAT speech analysis software. Each sense of fluency [16]. The broad sense shares a similar recording was annotated in three tiers. The first was the cue definition to the previously mentioned, while the narrow tier in which the cues were labelled from their beginning to sense of fluency is referring only to the speed and their end. The second was the word tier, where each of the smoothness of delivery. words was labelled from its beginning to its end. And the Perceived fluency is defined as “inferences listeners third was the pause tier, where each of the pauses was make about a speaker’s cognitive fluency based on their labelled from its beginning to its end. perception of utterance fluency” [17]. This aspect of A Praat script was then used to extract the number of fluency was important for the creation of the experiment, words in each cue and their length, and also the number of since it provided understanding of how subjective fluency inside cue pauses and their length from these annotations. is perceived and what constitutes as fluent speech in the The data was transferred into an Excel sheet where the narrow sense that can be used for analysis. The analysis of words per second were counted as the sum of words in both perceived fluency and phonetic measures is a new direction cues divided by the sum of word durations in both cues and for the automated assessment of fluency. the inside cue pause duration in both cues. The overall wordcount was calculated as the sum of words in both cues. 2 Methodology The overall pause count was calculated as the sum of inside cue pauses in both cues. Lastly, the overall duration of Two previously mentioned ideas [16, 17] were joined in pauses was calculated as the sum of inside cue pause the creation of the current study. Smoothness was be duration in both cues. The levels of perceived fluency were represented by the frequency and length of pauses and the also added to each word as evaluated by each of the speed with words per second and the overall wordcount. subjects. Perceived fluency [17] was used as a subjective measure The first data set was created from the evaluations of that was collected from subjects in the experiment. fluency that were provided by subjects during the word The basis for the study was a semi-spontaneous word guessing experiment. The second set of data consisted of guessing experiment conducted on 13 L2 speakers of four different phonetic measures that were chosen for the English with proficiency levels of C1, B2, and B1. The correlation analysis in relation with the evaluated levels of experiment was divided into two phases, where in the first fluency. These measures are words per second, wordcount, phase the subjects were tasked with creating cues for a set length of pauses, and the number of pauses. Such pair of of provided words. These words were randomly chosen data is referred to as an objective-subjective pair or from the British National Corpus with the criteria of being subjective-objective approach [18]. The measures were at most three syllables long and were either a noun, verb, or used as an objective means of assessing fluency in relation an adjective. Each speaker was given a set of ten words and to the subjective evaluation of perceived fluency that were they were asked to create two cues for each word. They provided by the participants while listening to cues from were asked not to use the words that they were hinting at. their peers. The cues that they provided were recorded and The research examined the correlation of perceived concatenated into a single recording for each of the fluency and phonetic measures analysed in the recording speakers. These recordings always consisted of the first cue data from phase one. The average level of perceived for the word, three second pause provided for the guessers fluency was calculated for each of the words from the as thinking space, then the second cue for the word, normalised fluency evaluations in the following way. Since followed by another three second pause. the data was displayed as a chart, we had the perceived The recordings processed in this way were used in phase fluency evaluations from each speaker as columns. Each of two, where the subjects were asked to try and guess the the cue pairs had an original evaluation value of one to words to which the interlocutors were providing cues. Each seven and was represented as a row. In order to normalise subject listened to the recordings of all other subjects. They the data, we took each of the evaluations and subtracted were asked to listen to the cues and try to guess the word from it the minimum score that the speaker provided in that the interlocutor was providing the cues for. The their entire column. This number was divided by the success of guesses was recorded for future use. After the difference between the maximum per column and minimum per column. The result was a number between 0 and 1, where 0 represented the lowest score provided by the speaker and 1 the highest score. 2.2 Data analysis The correlation of data was studied in four cases calculating the Pearson correlation coefficient and also multiple linear regression. Each pair for the calculation of Pearson correlation coefficient consisted of perceived proficiency evaluation, and a phonetic measure. The first pair used words per second as the independent variable, the second used wordcount, the third used the number of pause, and the fourth used the total duration of inside cue pauses Fig. 2. Correlation data for wordcount and perceived as its independent variable. fluency 3 Results In the third pair of data sets, which consisted of the sum 3.1 Results for all speakers of the number of inside cue pauses and perceived fluency, a Pearson r was computed to assess the relationship between As mentioned before, four pairs of data sets were created perceived fluency and total pause count. We have not found for the calculation of Pearson correlation coefficient. In the a significant relationship suggesting that the pair does not first pair of data sets, which consisted of words per second correlate (r = -0.098, p < 0.579). The data visualisation is and perceived fluency, a Pearson r was computed to assess available in a scatterplot graph as shown in Fig. 3. the relationship between perceived fluency and words per second. We found positive significant relationship (r = 0.574, p < 0.001). The relationship between the two variables is visualised in a scatterplot shown in Fig. 1. Fig. 3. Correlation data for the number of pauses and perceived fluency In the fourth pair of data sets, which consisted of the total duration of pauses inside both cues per word and Fig. 1. Correlation data for words per minute and perceived fluency, a Pearson r was computed to assess the perceived fluency relationship between perceived fluency and total pause duration. We found negative significant relationship (r = - In the second pair of data sets, which consisted of the 0.479, p < 0.001). The data visualisation is visible in Figure wordcount in both cues per word and perceived fluency, a 4. Pearson r was computed to assess the relationship between perceived fluency and wordcount. We found positive significant relationship (r = 0.316, p < 0.001). The data sets were visualised in a scatterplot graph as shown in Fig. 2. Fig. 4. Correlation data for the total inside cue pause p-value p < 0.001 duration and perceived fluency. R_wc_pf 0.339 A multiple linear regression was calculated to predict p-value p < 0.001 perceived fluency based on the words per second, R_icpc_pf -0.069 wordcount, and pause duration. Pause count was omitted, p-value p < 0.437 as it did not seem to have an effect on perceived fluency based on the correlation result above. A significant R_icpd_pf -0.410 regression model was found (F (3,126) = 39.333, p < p-value p < 0.001 0.001), with an R2 of 0.484. Subject’s predicted perceived fluency is shown in Table 1. Subject’s perceived fluency In their first pair of data sets, which consisted of words increased by 0.068 for each word per second, by 0.013 for per second and perceived fluency, the Pearson r suggests each word, and decreased by -0.046 for each second in total positive significant relationship (r = 0.500, p < 0,001). pause duration. The coefficients in the table represent each In their second pair of data sets, which consisted of the of the phonetic measure that were used. The Intercept wordcount in both cues per word and perceived fluency, the represents the perceived fluency. All three measures were Pearson r suggests positive significant relationship (r = significant predictors of perceived fluency. 0.339, p < 0,001). In their third pair of data sets, which consisted of the total number of inside cue pauses and perceived fluency, the Pearson r suggests no significant relationship (r = - 0.069, p < 0,437). In their fourth pair of data sets, which consisted of the total duration of pauses inside both cues per word and perceived fluency, the Pearson r suggests negative Table 1. Results of multiple linear regression significant relationship (r = -0.410, p < 0,001). calculations A multiple linear regression was calculated to predict perceived fluency based on the words per second, R Square 0.484 wordcount, and pause duration. Pause count was omitted, as it did not seem to have an effect on perceived fluency Coef t Stat P-value based on the correlation result above. A significant Intercept 0.243 3.076 0.003 regression model was found (F (3,126) = 29.793, p < wps 0.068 2.195 0.030 0.001), with an R2 of 0.415. Subject’s predicted perceived fluency is shown in Table 3. Subject’s perceived fluency wordcount 0.013 5.921 0.000 increased by 0.049 for each word per second, by 0.014 for icp_dur -0.046 -4.322 0.000 each word, and decreased by -0.046 for each second in total pause duration. The coefficients in the table represent each of the phonetic measure that were used. The Intercept 3.2 Results for each proficiency group represents the perceived fluency All three measures were significant predictors of perceived fluency. The data was then divided into three proficiency groups and was again analysed using the Pearson correlation Table 3. Results of multiple linear regression calculation coefficient and multiple linear regression. This was done in in group C1 order to study which phonetic measures influence the R Square 0.415 relationship between produced and perceived fluency in Coef t Stat P-value each of the proficiency groups. Three groups were created, each consisting of either only C1 level speakers, B2 level Intercept 0.238 2.772 0.006 speakers, or B1 level speakers. All the assessments made wps 0.049 1.441 0.152 by these speakers were taken into account and a new value wordcount 0.014 5.856 0.000 for perceived fluency was calculated from their evaluations. 3.2.1 Level C1 icp_dur -0.046 -3.936 0.000 Firstly, we will talk about the results for the group of C1 assessors. Four Pearson r values were computed to assess the relationship between the four data pairs. In this group, 3.2.2 Level B2 only the perceived fluency values of the C1 subjects were taken into account. The Pearson r values were also The second set of analyses was conducted on the B2 measured for their statistical significance with a p-value. group. The results for the group are shown below in the This data is visible in Table 2. tables and they consist of four Pearson r values, which were computed to asses the relationship between the data pairs. Table 2. Pearson r results for group C1 In this group, only the perceived fluency values of the B2 R_wps_pf 0.500 subjects were taken into account. The p-values were also measured for their statistical significance. This data is 3.2.3 Level B1 visible in Table 4. The final group of assessors that we will talk about is the Table 4. Pearson r results for group B2 B1 group. The relationship between the four data pairs was R_wps_pf 0.487 assessed with the help of four Pearson r values, which were p-value p < 0.001 computed. These values were also measure for their statistical significance with a p-value. All the data R_wc_pf 0.257 belonging to B1 group can be seen in Table 6. p-value p < 0.003 R_icpc_pf 0.019 Table 6. Person r results for group B1 p-value p < 0.828 R_wps_pf 0.579 R_icpd_pf -0.408 p-value p < 0.001 p-value p < 0.001 R_wc_pf 0.246 p-value p < 0.003 In their first pair of data sets, which consisted of words R_icpc_pf -0.104 per second and perceived fluency, the Pearson r suggests p-value p < 0.309 positive significant relationship (r = 0.487, p < 0,001). In their second pair of data sets, which consisted of the R_icpd_pf -0.505 wordcount in both cues per word and perceived fluency, the p-value p < 0.001 Pearson r suggests positive significant relationship (r = 0.257, p < 0,003). In their first pair of data sets, which consisted of words In their third pair of data sets, which consisted of the per second and perceived fluency, the Pearson r suggests total number of inside cue pauses and perceived fluency, positive significant relationship (r = 0.579, p < 0,001). the Pearson r suggests no significant relationship (r = In their second pair of data sets, which consisted of the 0.019, p < 0, 828). wordcount in both cues per word and perceived fluency, the In their fourth pair of data sets, which consisted of the Pearson r suggests positive significant relationship (r = total duration of pauses inside both cues per word and 0.246, p < 0,003). perceived fluency, the Pearson r suggests negative In their third pair of data sets, which consisted of the significant relationship (r = -0.408, p < 0,001). total number of inside cue pauses and perceived fluency, A multiple linear regression was calculated to predict the Pearson r suggests no significant relationship (r = - perceived fluency based on the words per second, 0.104, p < 0, 309). wordcount, and pause duration. Pause count was omitted, In their fourth pair of data sets, which consisted of the as it did not seem to have an effect on perceived fluency total duration of pauses inside both cues per word and based on the correlation result above. A significant perceived fluency, the Pearson r suggests negative regression model was found (F (3,126) = 21.742, p < significant relationship (r = -0.505, p < 0,001). 0.001), with an R2 of 0.341. Subject’s predicted perceived A multiple linear regression was calculated to predict fluency is shown in Table 5. Subject’s perceived fluency perceived fluency based on the words per second, increased by 0.071 for each word per second, by 0.013 for wordcount, and pause duration. Pause count was omitted, each word, and decreased by -0.046 for each second in total as it did not seem to have an effect on perceived fluency pause duration. The coefficients in the table represent each based on the correlation result above. A significant of the phonetic measure that were used. The Intercept regression model was found (F (3,126) = 35.438, p < represents the perceived fluency. All three measures were 0.001), with an R2 of 0.458. Subject’s predicted perceived significant predictors of perceived fluency. fluency is shown in Table 7. Subject’s perceived fluency increased by 0.079 for each word per second, by 0.013 for Table 5. Results of multiple linear regression calculation each word, and decreased by -0.050 for each second in total in group B2 pause duration. All three measures were significant R Square 0.341 predictors of perceived fluency. Coef t Stat P-value Table 7. Results of multiple linear regression calculation Intercept 0.276 2.598 0.011 in group B1 wps 0.071 1.697 0.092 R Square 0.458 wordcount 0.013 4.284 0.000 Coef t Stat P-value icp_dur -0.046 -3.192 0.002 Intercept 0.239 2.697 0.008 wps 0.079 2.257 0.026 wordcount 0.013 5.032 0.000 icp_dur -0.050 -4.152 0.000 4 Discussion speech”, grant No. 2/0161/18 and also by University Grant Agency UGA “Manipulation of acoustic signal of speech In this study we set out to search for a statistically for improvement of fluency in a foreign language and significant correlation between perceived fluency and targeted reduction of mother tongue interference”, grant phonetic measures that would be observable across the data No. I-19-208-02. from all the speakers and also in groups, which consist of assessors of the same proficiency level. We expected that the correlation should be better with all subjects taken into References account as assessors.as opposed to only using assessors of certain proficiency groups. The rationale behind this [1] N.H. De Jong and T. Wempe, “Praat script to detect statement is that the more varied points of view we have on syllable nuclei and measure speech rate assessment, the more accurate the results will be. automatically,” Behavior Research Methods 41, pp. The study found some of the phonetic measures seemed 385-390, 2009. to correlate with perceived fluency much more in simple [2] H. Kallio, A. Suni, P. Virkkunen, and J. Simko, pair tests. One such measure is words per second. If we “Prominence-based evaluation of L2 prosody,” Interspeech 2018, pp. 1838-1842, 2018. look purely at its relationship to perceived fluency, we see [3] V. Ramanarayanan, P. Lange, K. Evanini, H. Molloy, a moderately high positive correlation. However, this did and D. Suendermann-Oeft, “Human and automated not seem right, since such analysis did not take into account scoring of fluency, pronunciation and intonation the relation with the other measures. The pause count during human–machine spoken dialog interactions,” showed no significant relationship. This could probably be Interspeech 2017, pp. 1711–1715, 2017. caused, because the subjects were mainly tasked with [4] H. R. Bosker, H. Quené, T. Sanders, and N. H. Jong, . guessing a word from the cues. Since they were probably “The perception of fluency in native and non-native more focused on the message, the number of pauses did not speech,” Language Learning, 64 (3), pp. 579-614, seem to play a role. They started noticing the pauses only 2014. [5] J. Kormos and M. Dénes, “Exploring measures and when their duration was too long. perceptions of fluency in the speech of second Even though the Pearson r showed a lesser correlation in language learners,” System 32 (2), pp. 145-164, 2004. the initial analyses, this changed after a linear regression [6] T. Rasinski, “The Fluent Reader: Oral Reading analysis was used. This analysis took into account all the Strategies for Building Word Recognition, Fluency, data necessary for the correlation analysis. This means that and Comprehension,” Scholastic Inc., 2003. it measured the significance of all the measures in relation [7] A. Hasselgreen, “Testing the Spoken English of to perceived fluency at the same time and not only in Young Norwegians: A Study of Testing Validity and individual pairs. The results of this analysis showed a the Role of Smallwords in Contributing to Pupils' different picture of the measure significance. The most Fluency,” Cambridge University Press, 2005. prominent became the wordcount with its positive [8] Z. Breznitz, “Fluency in Reading: Synchronization of relationship, the second was the duration of pauses with a Processes,” Routledge, 2006. [9] J. B. Gilquin and S. De Cock, “Errors and negative relationship, and words per second were third with Disfluencies in Spoken Corpora,” John Benjamins a positive relationship. Publishing, 2013. The same ordering of measures was also observed in the [10] A. Khateb and I. Bar-Kochva, “Reading Fluency: group phase of analyses. The speakers were divided into Current Insights from Neurocognitive Research and groups based on their proficiency levels. In these groups Intervention Studies,” Springer, 2016. only their fluency assessments were taken into account. We [11] P. Kendale, “WORKBOOK for Spoken English saw a change in the strength of correlation of all the pairs in Fluency Development – 4,” Independently Published, all the groups. This means that pair one, which is the words 2017. per second and perceived fluency pair, had a completely [12] L. Wang, J. Zhang, F. Pan, B. Dong, and Y. Yan, “Automatic Fluency Assessment of Non/native different value in all the pairs. This difference is easily English Reading,” Journal of Convergence observed between the B2 pair one r = 0.487 and B1 pair Information Technology 7, pp. 636-642, 2012. one r = 0.579. Such differences were observed across all [13] P. Lennon, “Investigating fluency in EFL: A the pairs and suggest that each different proficiency level quantitative approach,” Language Learning, vol. 40, evaluates fluency based on different criteria. pp. 387-417, 1990. The study showed that the best correlating data was [14] H. Riggenbach, “Toward an understanding of fluency: observed, when all speaker were used as assessors. This A microanalysis of non-native speaker suggests that the before mentioned differences in pair conversations,” Discourse Processes, vol. 14, pp. 423- correlations are equalized. This offers a better correlation 441, 1991. [15] J. Kormos, “Speech production and second language analysis partially also because of the higher number of acquisition,” Lawrence Erlbaum Associates, 2006. assessors. [16] P. Lennon, “The lexical element in spoken second language fluency,” In H. Riggenbach (Ed.), Acknowledgment Perspectives on fluency Ann Arbor, University of Michigan Press, pp. 25-42, 2000. [17] N. Segalowitz, “Cognitive bases of second language This work was funded by the Slovak Scientific Grant fluency,” New York: Routledge, 2010. Agency VEGA “Automatic assessment of acute stress from [18] N. H. De Jong, et. al. “Facets of Speaking Proficiency,” Studies in Second Language Acquisition, vol. 34 (1), pp. 5-34, 2010.