  Crowdsourcing for Research on Automatic Speech Recognition-enabled CALL
                                         Catia Cucchiarini1, Helmer Strik1, 2, 3
                                                  CLST1, CLS2, Donders3
                                        Radboud University, Nijmegen, The Netherlands
                                             {C.Cucchiarini, W.Strik}@let.ru.nl

Despite long-standing interest and recent innovative developments in ASR-based pronunciation instruction and CALL, there is still
scepticism about the added value of ASR technology. In this paper we first review recent trends in pronunciation research and important
requirements for pronunciation instruction. We go on to consider the difficulties involved in developing ASR-based systems for
pronunciation instruction and the possible causes for the paucity of effectiveness studies in ASR-based CALL. We suggest that
crowdsourcing could offer solutions for analyzing the large amounts of L2 speech that can be collected through ASR-based CALL
applications and that are necessary for effectiveness studies. We provide a brief overview of our own research on ASR-based CALL and
of the lessons we learned. Finally, we discuss possible future avenues for research and development.

Keywords: Computer Assisted Language Learning, Automatic Speech Recognition, Pronunciation Instruction, Crowdsourcing

                    1.    Introduction                               The “intelligibility principle”, on the other hand, holds the
Speaking skills have always been considered particularly             view that pronunciation instruction should help L2 learners
challenging in language teaching, because of the time and            achieve intelligibility in the L2, which should be possible
individual attention they require for practice and feedback.         even if traces of an L1 accent remain. In line with this
This has been one of the reasons for the sustained interest          distinction, different constructs have been introduced in
in using Automatic Speech Recognition (ASR) technology               pronunciation research (Munro & Derwing, 1995a). Accent
in CALL applications. ASR technology has been around                 has been taken to refer to subjective judgments of the extent
for more than 30 years and its potential for CALL has been           to which L2 speech is close to native speech and is usually
emphasized from the beginning, but ASR-based CALL                    expressed by scalar ratings. Intelligibility has been defined
systems have not really found their way in language                  as the extent to which L2 speech can be correctly
teaching contexts. This might have to do with a variety of           reproduced in terms of orthographic transcription (Munro
factors. The relatively high costs involved in the                   & Derwing, 1995a). A third construct, comprehensibility,
development of new applications or in the acquisition of             has been introduced to indicate the ease with which
some commercial products might have been a hurdle to                 listeners understand L2 speech, again expressed through
large-scale adoption, while for some products that are               scalar ratings (Munro & Derwing, 1995a). Research has
available for free privacy issues might have played a role.          shown that communication can be successful even in the
However, there is also another possible explanation for the          presence of a non-native accent (Munro & Derwing,
general reluctance to embrace ASR technology in CALL.                1995b). This combined with the knowledge that achieving
As a matter of fact, there are relatively few studies that have      a nativelike accent is beyond reach for most language
thoroughly investigated the effectiveness of ASR-based               learners, has led pronunciation researchers to advocate a
CALL in real-life environments, under realistic conditions           focus on intelligibility in pronunciation instruction as
with real users. This also applies to pronunciation                  opposed to nativeness (Levis, 2005; 2007; Munro &
instruction and training, which is the topic that has received       Derwing 2015).
most attention in ASR-based research and development,
because of its potential for both language learning and                       3.    Requirements for ASR-based
speech therapy applications.                                                          pronunciation research
In the remainder of this paper we discuss the difficulties           In line with these distinctions, pronunciation researchers
involved in developing ASR-based systems for                         are interested in research that investigates to what extent
pronunciation instruction, possible causes for the paucity of        ASR-based pronunciation instruction contributes to
effectiveness studies and then consider possible solutions.          improving constructs such as accent, intelligibility or
In Section 2 we first discuss recent trends in pronunciation         comprehensibility of L2 learners. However, convincing
research and requirements for pronunciation instruction.             evidence is lacking (Thomson & Derwing, 2015). Most of
We then consider important requirements for ASR-based                the research on ASR-based pronunciation training has been
CALL research in Section 3. Sections 4 and 5 provide a               conducted offline on annotated speech corpora
brief overview of our own research on ASR-based CALL                 (Cucchiarini & Strik, 2017). In general, such studies
and crowdsourcing, respectively. Discussion and                      evaluate the accuracy of specific algorithms (Stanley,
conclusions are presented in Section 6 and 7.                        Hacioglu, & Pellom, 2011; Qian, Meng, Soong, 2012; Lee,
                                                                     Zhang, & Glass, 2013) in identifying pronunciation errors
           2.   Pronunciation Instruction                            or in grading L2 speech. To investigate the effectiveness of
In pronunciation research there are different views on what          ASR-based CALL complete systems are needed, in which
the aim of pronunciation instruction should be. According            these algorithms are incorporated to provide speaking
to the “nativeness principle” (Levis, 2005: 370),                    practice and feedback on the utterances produced by L2
pronunciation instruction should help L2 learners lose any           learners under realistic conditions. In addition, a certain
traces of their L1 accent in order to achieve a nativelike           amount of learning content is needed so that learners can
accent.                                                              practice for a sufficient amount of time. It is the kind of

longitudinal research that is needed to increase our               The more recent systems have been equipped with logging
understanding of the contribution of ASR-based CALL to             capabilities (Bodnar et al., 2017; Penning de Vries et al.,
pronunciation teaching and language learning in general.           2016), so that they can collect huge amounts of speech data
Unfortunately, there are not so many complete systems that         produced by L2 learners practicing with the system, while
employ ASR and that could be used in open, online                  at the same time recording all system-user interactions.
effectiveness research in real life conditions. This has to do     These logged data can provide useful knowledge on
with a series of difficulties (Cucchiarini & Strik, 2017).         learners’ progress, increasing our insights not only into the
First of all, the limited availability of large corpora that can   ultimate outcome of learning, but also into the processes
be used to develop, test and optimize the specific speech          that are conducive to learning.
technology that is required for learning applications.             One of the problems we have encountered in this research
Another difficulty is related to the nature of the expertise       is, however, how to process and analyze these large sets of
required, which is highly varied and interdisciplinary as it       speech data that are produced by language learners or
covers engineering, system design, pedagogy and language           patients during practice or therapy and that need to be
learning. This can also pose problems in finding the               scored and analyzed to study the effectiveness of ASR-
necessary funds for this type of cross-disciplinary research.      based applications. To be able to provide information on
                                                                   learning and effectiveness, these data need first of all to be
 4.     Our own research on ASR-based CALL                         transcribed and/or scored, to obtain the subjective
In our own research over the last twenty years we have             judgments necessary to measure the constructs mentioned
                                                                   above (accent, intelligibility, comprehensibility). This is
pursued the goal of developing complete ASR-based
                                                                   extremely time-consuming and expensive. In fact, the
CALL systems. This research has been conducted in close
cooperation with speech technologists, language learning           amount of data is such that manual annotations are actually
                                                                   not feasible. A possible alternative solution to obtain
researchers and teachers. The aim was to develop systems
                                                                   annotations and scoring of vast amounts of speech data at
that could be used to conduct more comprehensive research
contributing insights to both speech technology and                relatively low costs would then seem to be to employ
                                                                   crowdsourcing, as will be explained in the next section.
language learning research (Cucchiarini et al. 2009, 2011,
2014; Strik, 2012; Strik et al. 2012; Van Doremalen et al.,
2010, 2013; 2016). An important aspect in this research              5.    Crowdsourcing for ASR-based CALL
was also how to boost user motivation either by providing          In     ASR-based       CALL       pronunciation      research
appealing, useful feedback (Bodnar et al., 2016, 2017;             crowdsourcing could play a more prominent role by
Cucchiarini et al., 2009; Penning de Vries et al., 2015,           providing transcriptions or intelligibility scores, which can
2016, 2019) or by introducing gaming elements, see e.g.            in turn be used for effectiveness evaluation. In our own
Figure 1 (Ganzeboom et al. 2016).                                  research, for example, we have used crowdsourcing to
                                                                    obtain evaluations of intelligibility of L2 learner speech
                                                                    (Burgos et al., 2015, Sanders et al., 2016) and pathological
                                                                    speech (Ganzeboom et al., 2016).
                                                                    For the study described in Ganzeboom et al. (2016) an
                                                                    online listening experiment was carried out. Participants
                                                                    were invited by email or via Facebook. They filled in a
                                                                    questionnaire to gather some meta-information about
                                                                    native language, gender, age, etc. In total 36 listeners
                                                                    participated, 8 male and 28 female (age range 19-73), who
                                                                    rated 50 utterances on intelligibility in three ways:
                                                                         Likert: 1. very low, to 7. very high
                                                                         Visual Analogue Scale (VAS): 0. very low, to
                                                                            100. very high
                                                                         Orthographic Transcription (Orthog. Transc.)
                                                                         The latter was used to calculate three extra scores:
                                                                         OTW = Orthog. Transc. scored at Word level
                                                                         OTP = Orthog. Transc. scored at Phoneme level
                                                                         OTG = Orthog. Transc. scored at Grapheme level
                                                                    VAS and Likert are intelligibility scores on utterance level
                                                                    and were calculated as scores representing a percentage
                                                                    (%) of intelligibility. The VAS scores were already on a
                                                                    0-100 scale, while the scores on the 1-7 Likert scale were
                                                                    transformed to percentage scores by first subtracting 1 and
                                                                    then multiplying by 16.67 (i.e. 1=0%, 2=16.67%, 3=33%,
                                                                    ..., 7=100%).
                                                                    To obtain an intelligibility score at word level (OTW), we
      Figure 1: In “treasure hunters”, serious gaming is            compared the raters’ orthographic transcriptions to the
        used to motivate patients to practice for ASR-              reference transcriptions, we counted the number of
       based speech therapy (Ganzeboom et al. 2016).                identical word matches and calculated a percentage
                                                                   correct score.

Intelligibility scores at the grapheme and phoneme level         In the L2 speech crowdsourcing experiment Palabras (see
(OTG and OTP, resp.) were automatically obtained from            Figure 2), a web application was developed for obtaining
the orthographic transcriptions through the Algorithm for        transcriptions of Dutch words spoken by Spanish L2
Dynamic Alignment of Phonetic Transcriptions (ADAPT)             learners that was accessible via Facebook. Participants
(Elffers, et al. 2013) which computes the optimal alignment      would listen and write down what they heard. Different
between two strings of phonetic symbols using a matrix           types of feedback were provided, like percentage correct,
that contains distances between the individual phonetic          words still to transcribe and the majority transcription
symbols. For the intelligibility scores on phoneme level         (Sanders et al. 2016).
(OTP), the orthographic transcriptions were converted to         Also in this case the quality of the data was checked by
their phonemic equivalent using the canonical                    applying filters to remove transcribers who did not conform
pronunciation variants from the lexicon of the Spoken            to our quality criteria (with other native languages than
Dutch Corpus (Oostdijk, 2000). Some results are presented        Dutch, who did not reach our threshold of intra and inter
in Table 1. For more details see Ganzeboom et al. (2016).        transcriber agreement, who entered more than once when
                                                                 the server was slow in response). In total useful data were
 n = 50    M (SD)                                                obtained from 159 participants, which is definitely more
                         VAS       OTW      OTP      OTG         than would have been the case with traditional experiments.
 Likert    63.1          .998      .733     -.763    -.773
           (21.1)                                                                    6.    Discussion
 VAS       63.2                    .732     -.755    -.764
                                                                 So far crowdsourcing has been mainly used to produce
                                                                 language resources like learner speech corpora (Eskenazi
 OTW       78.3                             -.805    -.869
                                                                 et al., 2013), to obtain speech recordings with annotations
                                                                 (Loukina et al. 2015a, b), or to collect more complex and
 OTP       8.0 (6.5)                                 .954
                                                                 realistic speech data such as dialogues through
 OTG       8.9 (7.4)                                             conversational technologies (Sydorenko et al. 2018).
     Table 1: Means (SDs) and correlations of the five           The experiences described in Section 5 would seem to be
    intelligibility measures (n = 50 speech fragments).          good reasons for extending the use of crowdsourcing to the
                                                                 larger sets of data that are obtained through the loggings in
For Likert, VAS and OTW, higher scores correspond to             ASR-based CALL systems. These would constitute an
higher intelligibility (higher percentage correct); for OTP      enormous rich source of information for improving both
and OTG lower scores correspond to lower distance and            the technology and the learning systems. In addition, these
thus higher intelligibility. All correlations were significant   annotated data and speech files could be used to further
(p < .01).                                                       train and adapt the algorithms employed in the system and
Important for research data in general, and especially for       thus to enhance the quality of the ASR technology.
data obtained by means of crowdsourcing, is their                This approach could be extended to ASR-based CALL that
reliability. In our study the reliability of each of the five    addresses other aspects of L2 speaking to obtain
intelligibility measures was calculated using Intraclass         annotations of learner speaking performance, evaluations
Correlation Coefficients (ICC) based on groups of raters.        of L2 proficiency in grammar and vocabulary or of turn
The ICC values for all 36 raters together were very high,        taking abilities, pragmatic competence, politeness
ranging from .95 (OTP, OTG) to .97 (Likert, VAS, OTW).           strategies and formulaic language in spoken dialogue
As such a large number of raters may not always be               applications. An additional solution could be so-called
achievable, we also calculated average ICCs based on             implicit crowdsourcing, which could be applied by
randomly selected smaller subsets of the data (e.g. 9 sub-       collecting additional speech data and subjective
sets of 4 raters, or 6 of 6 raters). On average, for the         evaluations when users engage with ASR-based CALL
utterance and word level scorings sufficient reliability is      systems. In other words, in this case the users of CALL
obtained with four raters (resulting in mean ICC values          systems would form the crowd. There are some important
ranging from .79 to .84), while for subword scorings at least    caveats to be taken into account, though. First of all, GDPR
six raters are required (resulting in mean ICC values            puts limitations to using spoken data in crowdsourcing as
ranging from .79 to .80).                                        speech data are by definition sensitive data. Speech
                                                                 intrinsically contains information on identity and other
                                                                 personal features. Speech corpora often impose restrictions
                                                                 to making speech fragments audible to the public. In any
                                                                 case prior explicit consent has to be obtained for employing
                                                                 user data for research and development purposes. Finally,
                                                                 the reliability of the subjective data obtained through
                                                                 crowdsourcing has to be checked before these data are used
                                                                 for further research.

                                                                                    7.    Conclusions
                                                         ASR-based CALL applications hold great potential for
                                                         innovative research on language learning and future
                                                         developments for language teaching. Effectiveness studies
 Fig. 2. Crowdsourcing experiment Palabras. At the end,  could help clarify their added value, but so far these studies
   participants can share their final score on Facebook. have been few and far between, among other things because
                                                         they require subjective judgments of large amounts of L2
speech. Crowdsourcing can be usefully applied for this              Detailed Scores. In:      Proceedings of Interspeech
purpose. For the two crowdsourcing initiatives described in         2016, pp. 2503-2507; San Francisco, CA, USA.
section 5, the results were satisfactory as larger sets of data   Hu, W., Qian, Y., Soong, F.K., Wang, Y. (2015). Improved
could be annotated and scored than would have been the              mispronunciation detection with deep neural network
case with traditional experiments. In turn these data               trained acoustic models and transfer learning based
provided useful insights into important aspects of                  logistic regression classifiers. Speech Communication,
intelligibility scoring measures with different degrees of          67, 154-166.
granularity. To conclude, there seem to be good reasons for       Lee, Y. Zhang, & J. Glass, (2013). Mispronunciation
extending this approach to ASR-based CALL that                      detection via Dynamic Time Warping on Deep Belief
addresses other aspects of L2 speaking to obtain much               Network-based posteriorgrams. Proceedings ICASSP
wanted subjective annotations and evaluations of learner            2013, Vancouver, BC, 8227–8231.
speaking performance.                                             Levis, J.M. (2005). Changing contexts and shifting
                                                                    paradigms in pronunciation teaching. TESOL Quarterly,
