Design Issues in Language Learning Based on Crowdsourcing:
                    The Critical Role of Gameful Corrective Feedback
                                                      Frederik Cornillie
                                          KU Leuven – ITEC, also at imec
                     KU Leuven campus Kulak Kortrijk, Etienne Sabbelaan 51, 8500 Kortrijk, Belgium
                                          frederik.cornillie@kuleuven.be

                                                              Abstract
Crowdsourcing has revolutionized the software market, affecting the quality, adoption and business models of consumer software
applications in many domains of human behaviour. In language learning, however, its impact is still to be seen. Through the lens of the
commercial application Duolingo as well as the research prototype DialogDungeon, this paper discusses corrective feedback, a design
feature of (technology-enhanced) language learning environments that can be a key driver for both learning success and platform
adoption, and which will equally need to be considered in the design of language learning based on crowdsourcing. We address this
topic from the literature at the intersection of second language (L2) acquisition, computer-asssisted language learning (CALL), human
motivation, and gamification. We conclude with a call for collaboration between educators, L2 acquisition researchers and developers
of crowdsourcing-based applications.

Keywords: digital game-based language instruction, corrective feedback, crowdsourcing
                                                                     From a cognitive perspective on L2 learning, this is a
 1.    Crowdsourcing and corrective feedback                         valuable evolution, when we consider that the effectiveness
                   in Duolingo                                       of corrective feedback depends to a great extent on
                                                                     individual differences (Sheen, 2011).
As a result of the Web 2.0 revolution, crowdsourcing has
had a tremendous impact on the quality and adoption of
many consumer software applications. Much more slowly,
crowdsourcing is finding its way into research on language
learning (e.g. Keuleers, Stevens, Mandera, & Brysbaert,
2015) and – arguably less effectively – into online language
learning applications. The currently most popular
commercial example is the gamified language learning
application Duolingo, with 25 million active users on a
monthly basis (Lardinois, 2018). Originally designed as a
project to translate the web into every major language (von
Ahn, 2013), DuoLingo is not undisputed on a pedagogical
level because of its behaviourist approach to second
language (L2) learning (Reinhardt, 2017; Teske, 2017; for
related discussion see Cornillie & Desmet, 2016).
However, its use of crowdsourcing may be useful in the L2
learning process.
On the one hand, implicit crowdsourcing of learner                           Figure 1: explicit crowdsourcing in Duolingo
responses in Duolingo exercises can serve to improve the
language models and learner modelling modules that                   On the other hand, the language learning platform also
among other things provide automated corrective feedback,            involves its users in explicit crowdsourcing. For instance,
a feature of (online) language learning environments that            learners can request that the system accepts their alternative
can be very effective when considered carefully in the               responses, they can indicate that the language in the
instructional design process (see e.g. the meta-analysis of          exercises sounds unnatural or contains mistakes, or they
Li, 2010). In 2018, Duolingo organized a shared task on              can discuss solutions with their peers on an online forum
second language acquisition modelling, in conjunction with           (see Figure 1). These activities can recruit language
the 13th workshop on the innovative use of natural language          awareness both individually and in interaction with other
processing for building educational applications (BEA)               L2 users, equally relevant in the L2 learning process,
(Settles, Brust, Gustafson, Hagiwara, & Madnani, 2018).              particularly from a (socio-)constructivist point of view (for
For this shared task, the company released a dataset                 an illustration of this approach, see Ai, 2017).
comprising log files from millions of exercises completed            In addition to optimizing their platform through
by thousands of students during their first 30 days of               crowdsourcing, Duolingo have disclosed their interest in
learning on Duolingo. The goal for participants of the BEA           putting crowdsourcing to use in order to investigate L2
workshop was to predict what mistakes each learner would             learning processes. Luis von Ahn, creator of Duolingo,
make in the future, with a view to improving personalized            stated that their data-driven approach and online
instruction in the application. This shared task shows that          experiments at scale can figure out “which students pick up
Duolingo are actively working on leveraging state-of-the-            the new concept and when”, and that they can do this a lot
art machine learning and psychometric techniques to                  faster than “the offline education system” (Gannes, 2014).
improve their learner modelling and feedback generation.


EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands                                                               6
With “the offline education system”, von Ahn seems to hint
at the research field of L2 acquisition. Many L2 researchers
and other educational scientists will agree that this bold
claim is rather simplistic – in a highly controlled
environment inspired by behaviourist models of L2
learning, manipulating parameters and measuring learning
outcomes is a lot easier than in more authentic language
learning tasks and conditions, but the question is whether
such experiments speak to ecological. Additionally, the
claim seems completely ignorant of an important empirical
research strand in the history of CALL, which will be
discussed next.

 2.   Crowdsourcing and corrective feedback
         in the CALL research prototype
                 DialogDungeon
                                                                 Figure 2: learner completing a turn in DialogDungeon
Long before the heydays of Duolingo, CALL researchers
were already exploring ideas inherent in crowdsourcing. In     The language technology that generated feedback for the
his keynote at the 12th International CALL Research            learner at a given turn in the dialogue was remarkably
Conference that addressed the theme “How are we doing?         simple, but sufficient for the task at hand, when combined
CALL and Monitoring the Learner”, CALL pioneer Robert          with crowdsourcing. It consisted of an approximate string
Fischer reviewed studies since the early 1990s that made       matching technique (based on Levinshtein edit distance,
use of “computer-based tracking”, and argued vehemently        part-of-speech tagging and lemmatization) that computed
for the analysis of tracking data with a view to “putting      the distance between the learner’s response and a set of
CALL on solid empirical footing” (Fischer, 2007).              ‘canned’ (expected) responses, which were developed by
Although the scale at which these data were collected was      the author of the materials for each learner turn in the
inferior to the massive scale of data collection in            dialogue.
contemporary applications such as Duolingo, the goals –        As for the ideas related to crowdsourcing, the vision of the
understanding learning processes and improving CALL            DialogDungeon team was that the application had to be
applications – were not fundamentally different.               interesting both for language learners and native speakers.
More recently, Cornillie, et al.(2013) developed and           In this way, the application could collect examples of
evaluated a gamified dialogue-based CALL research              authentic language use and leverage both native speaker
prototype that uses crowdsourcing in language learning         and learner data to enrich the dialogue models with
tasks intended to engage learners in meaningful language       alternative responses (both ‘correct’ responses and
processing rather than in forms-focused practice (of which     ‘incorrect’ ones) that were not anticipated by the dialogue
Duolingo is primarily an example). The goal of the project,    author (i.e. implicit crowdsourcing from language users).
coined DialogDungeon, was to design a web-based proof-         In a second stage the original author of the dialogue or a
of-concept application for language learning inspired by       teacher would annotate the collected responses for
gaming, with a primary emphasis on storytelling, dialogue      parameters like context-fit, appropriateness, and linguistic
and learner creativity. The prototype adopted principles       accuracy (i.e. explicit crowdsourcing from authors or
from the framework of Purushotma, Thorne, & Wheatley           teachers). A possible extension (not implemented in the
(2008) for designing video games for foreign language          prototype) was that machine learning algorithms would
learning in an evidence-based way, drawing on theory and       suggest possible scores for new responses based on their
practice in L2 learning and teaching, in particular task-      similarity to previous responses. As the application was
based language teaching (TBLT).                                intended to be suitable for use in instructed L2
In the proof-of-concept, the task for the user was to solve    environments, it also provided corrective feedback (based
essentially non-language-focused problems – for instance,      on the string matching algorithm and a set of simple rules)
solving a murder mystery – by using language                   that consisted of highlighted (underlined) tokens and
meaningfully – for instance, asking questions as a             metalinguistic hints that could help learners to revise their
detective. These questions and other learner responses were    response (see Figure 3). Finally, learners could request the
embedded in semi-open written activities in which the          responses given by their peers, ranked by frequency. This
learner was required to provide a response that matched a      was intended as a support tool for when users got stuck in
given context. This context consisted of both the preceding    the dialogue, but the team also tinkered with the idea of
and subsequent turn in the dialogue, uttered by a non-player   using this as an entry point for having more advanced
character (see grey speech bubbles in Figure 2), as well as    learners (or native speakers) rate their peers’ responses
other specific knowledge and language related to a given       (explicit crowdsourcing).
dialogue or story (e.g. a bloody knife encountered in a        An evaluation with a questionnaire showed that the
previous scene). In addition to its task-based nature, the     majority of learners found the corrective feedback mostly
environment was gamified: completing dialogue turns            useful, with a median score of 4.75 on a seven-point Likert
successfully resulted in ideas, represented as light bulbs,    scale, and that learners with higher prior knowledge of
allowing the learner to level up from constable to             grammar used the feedback more often (Cornillie et al.,
superintendent detective. Successful completion of             2013).
dialogues yielded the learner-detective with evidence
(photographs with written clues) to solve the case.
EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands                                                    7
                                                                However, if designers want to translate insights from ‘in
                                                                the wild’ case studies to formal, instructed L2 learning
                                                                contexts, we need to be wary of what Larsen-Freeman
                                                                (2003) called the reflex fallacy:
                                                                   the assumption that it is our job to re-create in our
                                                                   classrooms the natural conditions of acquisition present
                                                                   in the external environment. Instead, what we want to
                                                                   do as language teachers, it seems to me, is to improve
                                                                   upon natural acquisition, not emulate it … we want to
                                                                   accelerate the actual rate of acquisition beyond what the
                                                                   students could achieve on their own … accelerating
                                                                   natural learning is, after all, the purpose of formal
                                                                   education (p. 20)
                                                                One of the ways in which natural learning can be
                                                                accelerated is by providing the learner in such task-based,
Figure 3: corrective feedback in DialogDungeon                  meaning-focused environments with form-focused
                                                                corrective feedback. Such feedback can recruit learner
 3.    Gameful corrective feedback: potential                   noticing and language awareness, focusing the learner’s
         for crowdsourcing-based CALL                           attention on linguistic form, which is essential for L2
One of the challenges for designers of crowdsourcing-           development in instructed contexts. Building on empirical
based applications is to capture the user’s attention for as    (including experimental) studies in the CALL literature on
long as possible, so that more (informative) user data can      gaming as well as a motivational model of video game
be collected to improve the service. Many have therefore        engagement grounded in Self-Determination Theory
turned to gamification, which we define as “the use of game     (Przybylski, Rigby, & Ryan, 2010), Cornillie (2014, 2017)
design elements in non-game contexts” (Deterding, Dixon,        elaborated a model of gameful corrective feedback that can
Khaled, & Nacke, 2011). However, from a L2 learning and         support ‘learner engagement in game-based CALL’. He
teaching perspective, it is crucial that such gamified          defined this as learner behaviour that is driven by intrinsic
applications are equally based on proven models of L2           motivation, that is focused primarily on language meaning
learning as well as sound and widely accepted principles        and communicative use, and that involves attention to
for L2 teaching. In other words, designers will also want       linguistic form through corrective feedback (2017).
user engagement with their applications to be effective, and    Notably, he found that gameful corrective feedback can
transfer to real-life situations of communicative L2 use.       accelerate natural L2 learning, while simultaneously
Grounding the design of a crowdsourcing-based language          stimulating intrinsic motivation, which will be associated
learning application on largely discredited models of L2        with continued use of the environment. Designers of
learning (e.g. behaviourism) is therefore not a good starting   crowdsourcing-based CALL environments can build on
point.                                                          this model to both enable data collection at scale and
Instead, it is imperative that designers of game-based          deliver effective learning experiences.
language learning applications start from the rich research
literature in CALL that explores the intersections of                4.    Conclusion: call for collaboration
gaming and task-based learning. Case studies in digital         Crowdsourcing offers exciting opportunities for L2
game-based language learning ‘in the wild’ (i.e. in non-        educators, L2 learning researchers, and developers of
instructed, informal online environments) show that such        CALL applications. Educators will want to use
environments are particularly fecund environments for the       crowdsourcing for at least three reasons. First,
acquisition of communicative L2 skills. In an attempt to        crowdsourcing allows them to personalize the learning
explain this phenomenon, a number of applied linguists          environment for each individual learner. Second, in semi-
(e.g. Cornillie, Thorne, & Desmet, 2012; Purushotma et al.,     open L2 learning tasks, it can power the generation of
2008) have observed that (digital) games align                  automated corrective feedback, necessary for accelerating
exceptionally well with principles of task-based language       natural L2 learning. Third, educators may believe in the
learning. First, games are all about achieving (non-            pedagogical value of crowdsourcing because authentic
linguistic) goals, such as saving the princess – pardon the     language learning tasks such as storytelling are so much
masculine example. Second, in order to attain these goals,      more interesting when the audience is actively involved, as
players use language (lexicogrammatical form-function-          is evident in the growing interest in fan fiction for language
meaning mappings) meaningfully and communicatively.             learning (e.g. Sauro, 2017).
Language is therefore not learned intentionally, but as the     Next, L2 learning researchers also have reasons to embrace
by-product of engaging in tasks that are relevant to the        crowdsourcing. It provides them with a much more fine-
needs of learners, which has been shown highly effective        grained lens, combined with logistically much less
for L2 learning. Third, gaming is not play in a sandbox; it     demanding data collection processes, to unravel learning
is structured play: games are structured around scenarios       processes. It also allows them a methodological toolkit to
and mechanics. This echoes Ellis’ (2003) criterial feature      study the interactions between language and its users (both
of a task as being a workplan. And fourth, games are            ‘native speakers’ and ‘language learners’) over time, in a
intensively interactive: they react instantly to players’       complex and dynamic system (De Bot, Lowie, & Verspoor,
actions, and because players make tons of choices, this         2007).
results in an endless stream of feedback.
EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands                                                      8
Finally, crowdsourcing enables developers of CALL               Ellis, R. (2003). Task-based language learning and
applications to launch prototypes much sooner and evaluate        teaching. Oxford: Oxford University Press.
basic interactions at scale in order to optimize                Fischer, R. (2007). How do we know what students are
functionalities such as automated corrective feedback at a        actually doing? Monitoring students’ behavior in CALL.
later stage. Thus, much is to be gained from an intensive         Computer Assisted Language Learning, 20(5), 409–442.
collaboration between educators, researchers and                Gannes, L. (2014). Why a Computer Is Often the Best
developers on the topic of crowdsourcing-based CALL.              Teacher, According to Duolingo’s Luis Von Ahn.
                                                                  Retrieved        February       3,       2019,      from
              5.    Acknowledgements                              https://www.recode.net/2014/11/3/11632536/why-a-
                                                                  computer-is-often-the-best-teacher-according-to-
The     design,   development    and     evaluation of
                                                                  duolingos-luis
DialogDungeon was realized through the ICON project
LLINGO (Language Learning in INteractive Gaming                 Keuleers, E., Stevens, M., Mandera, P., & Brysbaert, M.
                                                                  (2015). Word knowledge in the crowd: Measuring
envirOnment; 2009-2011). LLINGO was funded by iMinds
                                                                  vocabulary size and word prevalence in a massive online
(now: imec) and IWT (now: Flanders Innovation &
Entrepreneurship), and was carried out in collaboration           experiment. Quarterly Journal of Experimental
                                                                  Psychology, 68(8), 1665–1692.
with game developer Larian Studios, the Flemish radio and
                                                                Lardinois, F. (2018). Duolingo hires its first chief
television broadcasting organisation VRT, Business
Language and Communication Centre, and Televic                    marketing officer as active user numbers stagnate but
                                                                  revenue grows. Retrieved February 2, 2019, from
Education.
                                                                  https://techcrunch.com/2018/08/01/duolingo-hires-its-
                                                                  first-chief-marketing-officer-as-active-user-numbers-
         6.    Bibliographical References                         stagnate/
Ai, H. (2017). Providing graduated corrective feedback in       Larsen-Freeman, D. (2003). Teaching language. From
  an intelligent computer-assisted language learning              grammar to grammaring. Boston: Thomson/Heinle.
  environment.        ReCALL,        29(May),      313–334.     Li, S. (2010). The Effectiveness of Corrective Feedback in
  https://doi.org/https://doi.org/10.1017/S0958344017000          SLA: A Meta-Analysis. Language Learning, 60(2), 309–
  12X                                                             365.      Retrieved    February      2,    2019,    from
Cornillie, F. (2014). Adventures in red ink. Effectiveness of     https://doi.org/10.1111/j.1467-9922.2010.00561.x
  corrective feedback in digital game-based language            Przybylski, A. K., Rigby, C. S., & Ryan, R. M. (2010). A
  learning (Unpublished doctoral dissertation). KU                motivational model of video game engagement. Review
  Leuven (University of Leuven).                                  of      General     Psychology,       14(2),    154–166.
Cornillie, F. (2017). Educationally Designed Game                 https://doi.org/10.1037/a0019440
  Environments and Feedback. In S. Thorne & S. May              Purushotma, R., Thorne, S. L., & Wheatley, J. (2008). 10
  (Eds.), Language, Education and Technology (pp. 361–            key principles for designing video games for foreign
  374). Cham: Springer International Publishing.                  language learning. Retrieved February 2, 2019, from
  https://doi.org/10.1007/978-3-319-02237-6_28                    http://knol.google.com/k/ravi-purushotma/10-key-
Cornillie, F., & Desmet, P. (2016). Mini-games for                principles-for-designing-video/27mkxqba7b13d/2#done
  language learning. In F. Farr & L. Murray (Eds.), The         Reinhardt, J. (2017). Digital Gaming in L2 Teaching and
  Routledge Handbook of Language Learning and                     Learning. In C. Chapelle & S. Sauro (Eds.), The
  Technology (pp. 431–445). Abingdon: Routledge.                  Handbook of Technology in Second Language Teaching
Cornillie, F., Lagatie, R., Vandewaetere, M., Clarebout, G.,      and Learning (pp. 202–216). Wiley-Blackwell.
  & Desmet, P. (2013). Tools that detectives use: in search     Sauro, S. (2017). Online Fan Practices and CALL.
  of learner-related determinants for usage of optional           CALICO Journal, 34(2), 131–146.
  feedback in a written murder mystery. In P. Hubbard, M.       Settles, B., Brust, C., Gustafson, E., Hagiwara, M., &
  Schulze, & B. Smith (Eds.), Learner-Computer                    Madnani, N. (2018). Second Language Acquisition
  Interaction in Language Education: A Festschrift in             Modeling. In Proceedings of the NAACL-HLT Workshop
  Honor of Robert Fischer (pp. 22–45). San Marcos, TX:            on Innovative Use of NLP for Building Educational
  Computer Assisted Language Instruction Consortium               Applications (BEA). ACL.
  (CALICO).                                                     Sheen, Y. (2011). Corrective Feedback, Individual
Cornillie, F., Thorne, S. L., & Desmet, P. (2012). Digital        Differences and Second Language Learning. London:
  games for language learning: from hype to insight?              Springer.
  ReCALL, 24(3), 243–256.                                       Teske, K. (2017). Learning Technology Review. Duolingo.
e Bot, K., Lowie, W., & Verspoor, M. (2007). A Dynamic            CALICO Journal, 34(3), 393–401.
  Systems Theory approach to second language                    von Ahn, L. (2013). Duolingo: Learn a Language for Free
  acquisition. Bilingualism: Language and Cognition,              while Helping to Translate the Web. In Proceedings of
  10(01), 7.                                                      the 2013 international conference on Intelligent user
Deterding, S., Dixon, D., Khaled, R., & Nacke, L. (2011).         interfaces (IUI ’13) (pp. 1–2). New York: ACM.
  From Game Design Elements to Gamefulness : Defining
  “ Gamification .” In Mindtrek 2011 Proceedings.
  Tampere: ACM Press.


EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands                                                  9