Crowdsourcing for Research on Automatic Speech Recognition-enabled CALL Catia Cucchiarini1, Helmer Strik1, 2, 3 CLST1, CLS2, Donders3 Radboud University, Nijmegen, The Netherlands {C.Cucchiarini, W.Strik}@let.ru.nl Abstract Despite long-standing interest and recent innovative developments in ASR-based pronunciation instruction and CALL, there is still scepticism about the added value of ASR technology. In this paper we first review recent trends in pronunciation research and important requirements for pronunciation instruction. We go on to consider the difficulties involved in developing ASR-based systems for pronunciation instruction and the possible causes for the paucity of effectiveness studies in ASR-based CALL. We suggest that crowdsourcing could offer solutions for analyzing the large amounts of L2 speech that can be collected through ASR-based CALL applications and that are necessary for effectiveness studies. We provide a brief overview of our own research on ASR-based CALL and of the lessons we learned. Finally, we discuss possible future avenues for research and development. Keywords: Computer Assisted Language Learning, Automatic Speech Recognition, Pronunciation Instruction, Crowdsourcing 1. Introduction The “intelligibility principle”, on the other hand, holds the Speaking skills have always been considered particularly view that pronunciation instruction should help L2 learners challenging in language teaching, because of the time and achieve intelligibility in the L2, which should be possible individual attention they require for practice and feedback. even if traces of an L1 accent remain. In line with this This has been one of the reasons for the sustained interest distinction, different constructs have been introduced in in using Automatic Speech Recognition (ASR) technology pronunciation research (Munro & Derwing, 1995a). Accent in CALL applications. ASR technology has been around has been taken to refer to subjective judgments of the extent for more than 30 years and its potential for CALL has been to which L2 speech is close to native speech and is usually emphasized from the beginning, but ASR-based CALL expressed by scalar ratings. Intelligibility has been defined systems have not really found their way in language as the extent to which L2 speech can be correctly teaching contexts. This might have to do with a variety of reproduced in terms of orthographic transcription (Munro factors. The relatively high costs involved in the & Derwing, 1995a). A third construct, comprehensibility, development of new applications or in the acquisition of has been introduced to indicate the ease with which some commercial products might have been a hurdle to listeners understand L2 speech, again expressed through large-scale adoption, while for some products that are scalar ratings (Munro & Derwing, 1995a). Research has available for free privacy issues might have played a role. shown that communication can be successful even in the However, there is also another possible explanation for the presence of a non-native accent (Munro & Derwing, general reluctance to embrace ASR technology in CALL. 1995b). This combined with the knowledge that achieving As a matter of fact, there are relatively few studies that have a nativelike accent is beyond reach for most language thoroughly investigated the effectiveness of ASR-based learners, has led pronunciation researchers to advocate a CALL in real-life environments, under realistic conditions focus on intelligibility in pronunciation instruction as with real users. This also applies to pronunciation opposed to nativeness (Levis, 2005; 2007; Munro & instruction and training, which is the topic that has received Derwing 2015). most attention in ASR-based research and development, because of its potential for both language learning and 3. Requirements for ASR-based speech therapy applications. pronunciation research In the remainder of this paper we discuss the difficulties In line with these distinctions, pronunciation researchers involved in developing ASR-based systems for are interested in research that investigates to what extent pronunciation instruction, possible causes for the paucity of ASR-based pronunciation instruction contributes to effectiveness studies and then consider possible solutions. improving constructs such as accent, intelligibility or In Section 2 we first discuss recent trends in pronunciation comprehensibility of L2 learners. However, convincing research and requirements for pronunciation instruction. evidence is lacking (Thomson & Derwing, 2015). Most of We then consider important requirements for ASR-based the research on ASR-based pronunciation training has been CALL research in Section 3. Sections 4 and 5 provide a conducted offline on annotated speech corpora brief overview of our own research on ASR-based CALL (Cucchiarini & Strik, 2017). In general, such studies and crowdsourcing, respectively. Discussion and evaluate the accuracy of specific algorithms (Stanley, conclusions are presented in Section 6 and 7. Hacioglu, & Pellom, 2011; Qian, Meng, Soong, 2012; Lee, Zhang, & Glass, 2013) in identifying pronunciation errors 2. Pronunciation Instruction or in grading L2 speech. To investigate the effectiveness of In pronunciation research there are different views on what ASR-based CALL complete systems are needed, in which the aim of pronunciation instruction should be. According these algorithms are incorporated to provide speaking to the “nativeness principle” (Levis, 2005: 370), practice and feedback on the utterances produced by L2 pronunciation instruction should help L2 learners lose any learners under realistic conditions. In addition, a certain traces of their L1 accent in order to achieve a nativelike amount of learning content is needed so that learners can accent. practice for a sufficient amount of time. It is the kind of EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 31 longitudinal research that is needed to increase our The more recent systems have been equipped with logging understanding of the contribution of ASR-based CALL to capabilities (Bodnar et al., 2017; Penning de Vries et al., pronunciation teaching and language learning in general. 2016), so that they can collect huge amounts of speech data Unfortunately, there are not so many complete systems that produced by L2 learners practicing with the system, while employ ASR and that could be used in open, online at the same time recording all system-user interactions. effectiveness research in real life conditions. This has to do These logged data can provide useful knowledge on with a series of difficulties (Cucchiarini & Strik, 2017). learners’ progress, increasing our insights not only into the First of all, the limited availability of large corpora that can ultimate outcome of learning, but also into the processes be used to develop, test and optimize the specific speech that are conducive to learning. technology that is required for learning applications. One of the problems we have encountered in this research Another difficulty is related to the nature of the expertise is, however, how to process and analyze these large sets of required, which is highly varied and interdisciplinary as it speech data that are produced by language learners or covers engineering, system design, pedagogy and language patients during practice or therapy and that need to be learning. This can also pose problems in finding the scored and analyzed to study the effectiveness of ASR- necessary funds for this type of cross-disciplinary research. based applications. To be able to provide information on learning and effectiveness, these data need first of all to be 4. Our own research on ASR-based CALL transcribed and/or scored, to obtain the subjective In our own research over the last twenty years we have judgments necessary to measure the constructs mentioned above (accent, intelligibility, comprehensibility). This is pursued the goal of developing complete ASR-based extremely time-consuming and expensive. In fact, the CALL systems. This research has been conducted in close cooperation with speech technologists, language learning amount of data is such that manual annotations are actually not feasible. A possible alternative solution to obtain researchers and teachers. The aim was to develop systems annotations and scoring of vast amounts of speech data at that could be used to conduct more comprehensive research contributing insights to both speech technology and relatively low costs would then seem to be to employ crowdsourcing, as will be explained in the next section. language learning research (Cucchiarini et al. 2009, 2011, 2014; Strik, 2012; Strik et al. 2012; Van Doremalen et al., 2010, 2013; 2016). An important aspect in this research 5. Crowdsourcing for ASR-based CALL was also how to boost user motivation either by providing In ASR-based CALL pronunciation research appealing, useful feedback (Bodnar et al., 2016, 2017; crowdsourcing could play a more prominent role by Cucchiarini et al., 2009; Penning de Vries et al., 2015, providing transcriptions or intelligibility scores, which can 2016, 2019) or by introducing gaming elements, see e.g. in turn be used for effectiveness evaluation. In our own Figure 1 (Ganzeboom et al. 2016). research, for example, we have used crowdsourcing to obtain evaluations of intelligibility of L2 learner speech (Burgos et al., 2015, Sanders et al., 2016) and pathological speech (Ganzeboom et al., 2016). For the study described in Ganzeboom et al. (2016) an online listening experiment was carried out. Participants were invited by email or via Facebook. They filled in a questionnaire to gather some meta-information about native language, gender, age, etc. In total 36 listeners participated, 8 male and 28 female (age range 19-73), who rated 50 utterances on intelligibility in three ways:  Likert: 1. very low, to 7. very high  Visual Analogue Scale (VAS): 0. very low, to 100. very high  Orthographic Transcription (Orthog. Transc.)  The latter was used to calculate three extra scores:  OTW = Orthog. Transc. scored at Word level  OTP = Orthog. Transc. scored at Phoneme level  OTG = Orthog. Transc. scored at Grapheme level VAS and Likert are intelligibility scores on utterance level and were calculated as scores representing a percentage (%) of intelligibility. The VAS scores were already on a 0-100 scale, while the scores on the 1-7 Likert scale were transformed to percentage scores by first subtracting 1 and then multiplying by 16.67 (i.e. 1=0%, 2=16.67%, 3=33%, ..., 7=100%). To obtain an intelligibility score at word level (OTW), we Figure 1: In “treasure hunters”, serious gaming is compared the raters’ orthographic transcriptions to the used to motivate patients to practice for ASR- reference transcriptions, we counted the number of based speech therapy (Ganzeboom et al. 2016). identical word matches and calculated a percentage correct score. EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 32 Intelligibility scores at the grapheme and phoneme level In the L2 speech crowdsourcing experiment Palabras (see (OTG and OTP, resp.) were automatically obtained from Figure 2), a web application was developed for obtaining the orthographic transcriptions through the Algorithm for transcriptions of Dutch words spoken by Spanish L2 Dynamic Alignment of Phonetic Transcriptions (ADAPT) learners that was accessible via Facebook. Participants (Elffers, et al. 2013) which computes the optimal alignment would listen and write down what they heard. Different between two strings of phonetic symbols using a matrix types of feedback were provided, like percentage correct, that contains distances between the individual phonetic words still to transcribe and the majority transcription symbols. For the intelligibility scores on phoneme level (Sanders et al. 2016). (OTP), the orthographic transcriptions were converted to Also in this case the quality of the data was checked by their phonemic equivalent using the canonical applying filters to remove transcribers who did not conform pronunciation variants from the lexicon of the Spoken to our quality criteria (with other native languages than Dutch Corpus (Oostdijk, 2000). Some results are presented Dutch, who did not reach our threshold of intra and inter in Table 1. For more details see Ganzeboom et al. (2016). transcriber agreement, who entered more than once when the server was slow in response). In total useful data were n = 50 M (SD) obtained from 159 participants, which is definitely more VAS OTW OTP OTG than would have been the case with traditional experiments. Likert 63.1 .998 .733 -.763 -.773 (21.1) 6. Discussion VAS 63.2 .732 -.755 -.764 So far crowdsourcing has been mainly used to produce (19.0) language resources like learner speech corpora (Eskenazi OTW 78.3 -.805 -.869 et al., 2013), to obtain speech recordings with annotations (16.1) (Loukina et al. 2015a, b), or to collect more complex and OTP 8.0 (6.5) .954 realistic speech data such as dialogues through OTG 8.9 (7.4) conversational technologies (Sydorenko et al. 2018). Table 1: Means (SDs) and correlations of the five The experiences described in Section 5 would seem to be intelligibility measures (n = 50 speech fragments). good reasons for extending the use of crowdsourcing to the larger sets of data that are obtained through the loggings in For Likert, VAS and OTW, higher scores correspond to ASR-based CALL systems. These would constitute an higher intelligibility (higher percentage correct); for OTP enormous rich source of information for improving both and OTG lower scores correspond to lower distance and the technology and the learning systems. In addition, these thus higher intelligibility. All correlations were significant annotated data and speech files could be used to further (p < .01). train and adapt the algorithms employed in the system and Important for research data in general, and especially for thus to enhance the quality of the ASR technology. data obtained by means of crowdsourcing, is their This approach could be extended to ASR-based CALL that reliability. In our study the reliability of each of the five addresses other aspects of L2 speaking to obtain intelligibility measures was calculated using Intraclass annotations of learner speaking performance, evaluations Correlation Coefficients (ICC) based on groups of raters. of L2 proficiency in grammar and vocabulary or of turn The ICC values for all 36 raters together were very high, taking abilities, pragmatic competence, politeness ranging from .95 (OTP, OTG) to .97 (Likert, VAS, OTW). strategies and formulaic language in spoken dialogue As such a large number of raters may not always be applications. An additional solution could be so-called achievable, we also calculated average ICCs based on implicit crowdsourcing, which could be applied by randomly selected smaller subsets of the data (e.g. 9 sub- collecting additional speech data and subjective sets of 4 raters, or 6 of 6 raters). On average, for the evaluations when users engage with ASR-based CALL utterance and word level scorings sufficient reliability is systems. In other words, in this case the users of CALL obtained with four raters (resulting in mean ICC values systems would form the crowd. There are some important ranging from .79 to .84), while for subword scorings at least caveats to be taken into account, though. First of all, GDPR six raters are required (resulting in mean ICC values puts limitations to using spoken data in crowdsourcing as ranging from .79 to .80). speech data are by definition sensitive data. Speech intrinsically contains information on identity and other personal features. Speech corpora often impose restrictions to making speech fragments audible to the public. In any case prior explicit consent has to be obtained for employing user data for research and development purposes. Finally, the reliability of the subjective data obtained through crowdsourcing has to be checked before these data are used for further research. 7. Conclusions ASR-based CALL applications hold great potential for innovative research on language learning and future developments for language teaching. Effectiveness studies Fig. 2. Crowdsourcing experiment Palabras. At the end, could help clarify their added value, but so far these studies participants can share their final score on Facebook. have been few and far between, among other things because they require subjective judgments of large amounts of L2 EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 33 speech. Crowdsourcing can be usefully applied for this Detailed Scores. In: Proceedings of Interspeech purpose. For the two crowdsourcing initiatives described in 2016, pp. 2503-2507; San Francisco, CA, USA. section 5, the results were satisfactory as larger sets of data Hu, W., Qian, Y., Soong, F.K., Wang, Y. (2015). Improved could be annotated and scored than would have been the mispronunciation detection with deep neural network case with traditional experiments. In turn these data trained acoustic models and transfer learning based provided useful insights into important aspects of logistic regression classifiers. Speech Communication, intelligibility scoring measures with different degrees of 67, 154-166. granularity. To conclude, there seem to be good reasons for Lee, Y. Zhang, & J. Glass, (2013). Mispronunciation extending this approach to ASR-based CALL that detection via Dynamic Time Warping on Deep Belief addresses other aspects of L2 speaking to obtain much Network-based posteriorgrams. Proceedings ICASSP wanted subjective annotations and evaluations of learner 2013, Vancouver, BC, 8227–8231. speaking performance. Levis, J.M. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 8. Bibliographical References 39(3), 369-377. Bodnar, S., Cucchiarini, C., Penning de Vries, B., Strik, H., Levis, J. (2007). Computer technology in teaching and researching. Annual Review of Applied Linguistics, 27, & van Hout, R. (2017). Learner affect in computerised 184–202. L2 oral grammar practice with corrective feedback. Computer Assisted Language Learning, 30, 223-246. Loukina, A., Lopez, M., Evanini, K., Suendermann-Oeft, D., Zechner, K. (2015). Expert and crowdsourced Bodnar, S.E., Cucchiarini, C., Strik, H. & Hout, R.W.N.M. annotation of pronunciation errors for automatic scoring van (2016). Evaluating the motivational impact of CALL systems: current practices and future directions. systems, Proceedings INTERSPEECH-2015, 2809- 2813. Computer Assisted Language Learning, 29, (1), 186- Munro, M. J., & Derwing, T. M. (1995a). Foreign accent, 212. Burgos, P.; Sanders, E.; Cucchiarini, C.; Hout, R. van; comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45, 73- Strik, H. (2015) Auris populi: crowdsourced native 97. transcriptions of Dutch vowels spoken by adult Spanish learners. In: Proc. of Interspeech 2015, 2819-2823. Munro, M. J., & Derwing, T. M. (1995b). Processing time, accent, and comprehensibility in the perception of native Cooke, M., Barker, J., & Lecumberri, M. L. G. (2013). and foreign accented speech. Language and Speech, 38, Crowdsourcing in speech perception. In: M. Eskenazi, G-A. Levow, H. Meng, G. Parent & D. Suendermann 289–306. Munro, M. J., Derwing, T. M., & Thomson, R. I. (2015). (Eds.), Crowdsourcing for speech processing: Setting segmental priorities for English learners: Applications to data collection, transcription and assessment (pp. 137-172). Somerset, GB: Wiley. Evidence from a longitudinal study. Int. Review of Applied Linguistics in Language Teaching, 53(1), 39-60. Cucchiarini, C., Bodnar, S., Penning de Vries, B., van Oostdijk, N. (2000). The Spoken Dutch Corpus: Overview Hout, R., & Strik, H. (2014). ASR-based CALL systems and learner speech data: new resources and opportunities and first evaluation. Proceedings of LREC 2000, 886– 894, Athens, Greece. for research and development in second language Penning de Vries, B.W.F., Cucchiarini, C., Bodnar, S.E., learning. Proceedings of LREC, Reykiavik. Cucchiarini, C., Heuvel, H. van den, Sanders, E.P. & Strik, Strik, H. & Hout, R.W.N.M. van (2015). Spoken grammar practice and feedback in an ASR-based CALL H. (2011). Error selection for ASR-based English system. Computer Assisted Language Learning, 28 (6), pronunciation training in 'My Pronunciation Coach'. Proceedings of Interspeech, 1165-1168, Florence, Italy. 550-576. Penning de Vries, B., Cucchiarini, C., Bodnar, S., Strik, H., Cucchiarini, C, Neri, A., Strik, H. (2009). Oral proficiency & van Hout, R. (2016). Effect of corrective feedback for training in Dutch L2: The contribution of ASR-based corrective feedback. Speech Communication, 51 (10), learning verb second, Int. Review of Applied Linguistics in Language Teaching (IRAL), 54(4), 347-386. 853-863. Penning de Vries, B., Cucchiarini, C., Strik, H., van Hout, Cucchiarini, C., Strik, H. (2017). Automatic speech recognition for L2 pronunciation assessment and R. (2019). Spoken grammar practice in CALL: The effect of corrective feedback and education level in adult training. In O. Kang, R. Thomson & M. Murphy (Eds.) L2 learning, Language Teaching Research. The Routledge handbook of English pronunciation. Derwing, T. M., & Munro, M. J. (2015). Pronunciation Qian, X., Meng, H., Soong, F. (2012). The Use of DBN- HMMs for Mispronunciation Detection and Diagnosis in fundamentals: Evidence-based perspectives for L2 L2 English to Support Computer-Aided Pronunciation teaching. Amsterdam: John Benjamins. Elffers, B., van Bael, C., and Strik, H. (2013). ADAPT: Training. In: Proc. of Interspeech, 775-778, Portland. Sanders, E.P.; Burgos, P.; Cucchiarini, C.; Hout, R.W.N.M. Algorithm for Dynamic Alignment of Phonetic van (2016) Palabras. Crowdsourcing transcriptions of L2 Transcriptions. Internal report, CLST, Radboud University Nijmegen, The Netherlands. speech. In: Proceedings of the Int. Conf. on Language Resources and Evaluation (LREC) 2016, pp. 3186-3191. Eskenazi, M., G. Levow, H. Meng, G. Parent & D. Stanley, T., Hacioglu, K & Pellom, B. (2011). Statistical Suendermann (eds.) (2013). Crowdsourcing for speech processing: Applications to data collection, transcription machine translation framework for modeling phonological errors in computer assisted pronunciation assessment. New York: Wiley. training system. SLaTE 2011, Venice, Italy. Ganzeboom, M.S.; Bakker, M.; Cucchiarini, C.; Strik, H. (2016) Intelligibility of Disordered Speech: Global and Strik, H. (2012). ASR-based systems for language learning and therapy. International Symposium on Automatic EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 34 Detection of Errors in Pronunciation Training (IS- Van Doremalen, J., Boves, L., Colpaert, J., Cucchiarini, C., Adept). KTH: Stockholm, Sweden, June 6-8. Strik, H. (2016). Evaluating ASR-based language Strik, H. Colpaert, J., Van Doremalen, J. & Cucchiarini, C. learning systems: A case study. Computer Assisted (2012). The DISCO ASR-based CALL system: Language Learning, 29(4), 833-851. practicing L2 oral skills and beyond. Proceedings LREC, Van Doremalen, J., Cucchiarini, C., & Strik (2010). Istanbul. Optimizing automatic speech recognition for low- Sydorenko, T., Smits, T., Evanini, K. & Ramanarayanan, proficient non-native speakers. EURASIP Journal on V. (2018). Simulated speaking environments for Audio, Speech, and Music Processing. language learning: insights from three cases. Computer Van Doremalen, J., Cucchiarini, C., & Strik (2013). Assisted Language Learning. Automatic pronunciation error detection in non-native Thomson, R. I., & Derwing, T. M. (2015) The effectiveness speech: the case of vowel errors in Dutch. Journal of the of L2 pronunciation instruction: A narrative review. Acoustical Society of America, 134, 1336-1347. Applied Linguistics, 36(3), 326–344. EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 35