Design Issues in Language Learning Based on Crowdsourcing: The Critical Role of Gameful Corrective Feedback Frederik Cornillie KU Leuven – ITEC, also at imec KU Leuven campus Kulak Kortrijk, Etienne Sabbelaan 51, 8500 Kortrijk, Belgium frederik.cornillie@kuleuven.be Abstract Crowdsourcing has revolutionized the software market, affecting the quality, adoption and business models of consumer software applications in many domains of human behaviour. In language learning, however, its impact is still to be seen. Through the lens of the commercial application Duolingo as well as the research prototype DialogDungeon, this paper discusses corrective feedback, a design feature of (technology-enhanced) language learning environments that can be a key driver for both learning success and platform adoption, and which will equally need to be considered in the design of language learning based on crowdsourcing. We address this topic from the literature at the intersection of second language (L2) acquisition, computer-asssisted language learning (CALL), human motivation, and gamification. We conclude with a call for collaboration between educators, L2 acquisition researchers and developers of crowdsourcing-based applications. Keywords: digital game-based language instruction, corrective feedback, crowdsourcing From a cognitive perspective on L2 learning, this is a 1. Crowdsourcing and corrective feedback valuable evolution, when we consider that the effectiveness in Duolingo of corrective feedback depends to a great extent on individual differences (Sheen, 2011). As a result of the Web 2.0 revolution, crowdsourcing has had a tremendous impact on the quality and adoption of many consumer software applications. Much more slowly, crowdsourcing is finding its way into research on language learning (e.g. Keuleers, Stevens, Mandera, & Brysbaert, 2015) and – arguably less effectively – into online language learning applications. The currently most popular commercial example is the gamified language learning application Duolingo, with 25 million active users on a monthly basis (Lardinois, 2018). Originally designed as a project to translate the web into every major language (von Ahn, 2013), DuoLingo is not undisputed on a pedagogical level because of its behaviourist approach to second language (L2) learning (Reinhardt, 2017; Teske, 2017; for related discussion see Cornillie & Desmet, 2016). However, its use of crowdsourcing may be useful in the L2 learning process. On the one hand, implicit crowdsourcing of learner Figure 1: explicit crowdsourcing in Duolingo responses in Duolingo exercises can serve to improve the language models and learner modelling modules that On the other hand, the language learning platform also among other things provide automated corrective feedback, involves its users in explicit crowdsourcing. For instance, a feature of (online) language learning environments that learners can request that the system accepts their alternative can be very effective when considered carefully in the responses, they can indicate that the language in the instructional design process (see e.g. the meta-analysis of exercises sounds unnatural or contains mistakes, or they Li, 2010). In 2018, Duolingo organized a shared task on can discuss solutions with their peers on an online forum second language acquisition modelling, in conjunction with (see Figure 1). These activities can recruit language the 13th workshop on the innovative use of natural language awareness both individually and in interaction with other processing for building educational applications (BEA) L2 users, equally relevant in the L2 learning process, (Settles, Brust, Gustafson, Hagiwara, & Madnani, 2018). particularly from a (socio-)constructivist point of view (for For this shared task, the company released a dataset an illustration of this approach, see Ai, 2017). comprising log files from millions of exercises completed In addition to optimizing their platform through by thousands of students during their first 30 days of crowdsourcing, Duolingo have disclosed their interest in learning on Duolingo. The goal for participants of the BEA putting crowdsourcing to use in order to investigate L2 workshop was to predict what mistakes each learner would learning processes. Luis von Ahn, creator of Duolingo, make in the future, with a view to improving personalized stated that their data-driven approach and online instruction in the application. This shared task shows that experiments at scale can figure out “which students pick up Duolingo are actively working on leveraging state-of-the- the new concept and when”, and that they can do this a lot art machine learning and psychometric techniques to faster than “the offline education system” (Gannes, 2014). improve their learner modelling and feedback generation. EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 6 With “the offline education system”, von Ahn seems to hint at the research field of L2 acquisition. Many L2 researchers and other educational scientists will agree that this bold claim is rather simplistic – in a highly controlled environment inspired by behaviourist models of L2 learning, manipulating parameters and measuring learning outcomes is a lot easier than in more authentic language learning tasks and conditions, but the question is whether such experiments speak to ecological. Additionally, the claim seems completely ignorant of an important empirical research strand in the history of CALL, which will be discussed next. 2. Crowdsourcing and corrective feedback in the CALL research prototype DialogDungeon Figure 2: learner completing a turn in DialogDungeon Long before the heydays of Duolingo, CALL researchers were already exploring ideas inherent in crowdsourcing. In The language technology that generated feedback for the his keynote at the 12th International CALL Research learner at a given turn in the dialogue was remarkably Conference that addressed the theme “How are we doing? simple, but sufficient for the task at hand, when combined CALL and Monitoring the Learner”, CALL pioneer Robert with crowdsourcing. It consisted of an approximate string Fischer reviewed studies since the early 1990s that made matching technique (based on Levinshtein edit distance, use of “computer-based tracking”, and argued vehemently part-of-speech tagging and lemmatization) that computed for the analysis of tracking data with a view to “putting the distance between the learner’s response and a set of CALL on solid empirical footing” (Fischer, 2007). ‘canned’ (expected) responses, which were developed by Although the scale at which these data were collected was the author of the materials for each learner turn in the inferior to the massive scale of data collection in dialogue. contemporary applications such as Duolingo, the goals – As for the ideas related to crowdsourcing, the vision of the understanding learning processes and improving CALL DialogDungeon team was that the application had to be applications – were not fundamentally different. interesting both for language learners and native speakers. More recently, Cornillie, et al.(2013) developed and In this way, the application could collect examples of evaluated a gamified dialogue-based CALL research authentic language use and leverage both native speaker prototype that uses crowdsourcing in language learning and learner data to enrich the dialogue models with tasks intended to engage learners in meaningful language alternative responses (both ‘correct’ responses and processing rather than in forms-focused practice (of which ‘incorrect’ ones) that were not anticipated by the dialogue Duolingo is primarily an example). The goal of the project, author (i.e. implicit crowdsourcing from language users). coined DialogDungeon, was to design a web-based proof- In a second stage the original author of the dialogue or a of-concept application for language learning inspired by teacher would annotate the collected responses for gaming, with a primary emphasis on storytelling, dialogue parameters like context-fit, appropriateness, and linguistic and learner creativity. The prototype adopted principles accuracy (i.e. explicit crowdsourcing from authors or from the framework of Purushotma, Thorne, & Wheatley teachers). A possible extension (not implemented in the (2008) for designing video games for foreign language prototype) was that machine learning algorithms would learning in an evidence-based way, drawing on theory and suggest possible scores for new responses based on their practice in L2 learning and teaching, in particular task- similarity to previous responses. As the application was based language teaching (TBLT). intended to be suitable for use in instructed L2 In the proof-of-concept, the task for the user was to solve environments, it also provided corrective feedback (based essentially non-language-focused problems – for instance, on the string matching algorithm and a set of simple rules) solving a murder mystery – by using language that consisted of highlighted (underlined) tokens and meaningfully – for instance, asking questions as a metalinguistic hints that could help learners to revise their detective. These questions and other learner responses were response (see Figure 3). Finally, learners could request the embedded in semi-open written activities in which the responses given by their peers, ranked by frequency. This learner was required to provide a response that matched a was intended as a support tool for when users got stuck in given context. This context consisted of both the preceding the dialogue, but the team also tinkered with the idea of and subsequent turn in the dialogue, uttered by a non-player using this as an entry point for having more advanced character (see grey speech bubbles in Figure 2), as well as learners (or native speakers) rate their peers’ responses other specific knowledge and language related to a given (explicit crowdsourcing). dialogue or story (e.g. a bloody knife encountered in a An evaluation with a questionnaire showed that the previous scene). In addition to its task-based nature, the majority of learners found the corrective feedback mostly environment was gamified: completing dialogue turns useful, with a median score of 4.75 on a seven-point Likert successfully resulted in ideas, represented as light bulbs, scale, and that learners with higher prior knowledge of allowing the learner to level up from constable to grammar used the feedback more often (Cornillie et al., superintendent detective. Successful completion of 2013). dialogues yielded the learner-detective with evidence (photographs with written clues) to solve the case. EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 7 However, if designers want to translate insights from ‘in the wild’ case studies to formal, instructed L2 learning contexts, we need to be wary of what Larsen-Freeman (2003) called the reflex fallacy: the assumption that it is our job to re-create in our classrooms the natural conditions of acquisition present in the external environment. Instead, what we want to do as language teachers, it seems to me, is to improve upon natural acquisition, not emulate it … we want to accelerate the actual rate of acquisition beyond what the students could achieve on their own … accelerating natural learning is, after all, the purpose of formal education (p. 20) One of the ways in which natural learning can be accelerated is by providing the learner in such task-based, Figure 3: corrective feedback in DialogDungeon meaning-focused environments with form-focused corrective feedback. Such feedback can recruit learner 3. Gameful corrective feedback: potential noticing and language awareness, focusing the learner’s for crowdsourcing-based CALL attention on linguistic form, which is essential for L2 One of the challenges for designers of crowdsourcing- development in instructed contexts. Building on empirical based applications is to capture the user’s attention for as (including experimental) studies in the CALL literature on long as possible, so that more (informative) user data can gaming as well as a motivational model of video game be collected to improve the service. Many have therefore engagement grounded in Self-Determination Theory turned to gamification, which we define as “the use of game (Przybylski, Rigby, & Ryan, 2010), Cornillie (2014, 2017) design elements in non-game contexts” (Deterding, Dixon, elaborated a model of gameful corrective feedback that can Khaled, & Nacke, 2011). However, from a L2 learning and support ‘learner engagement in game-based CALL’. He teaching perspective, it is crucial that such gamified defined this as learner behaviour that is driven by intrinsic applications are equally based on proven models of L2 motivation, that is focused primarily on language meaning learning as well as sound and widely accepted principles and communicative use, and that involves attention to for L2 teaching. In other words, designers will also want linguistic form through corrective feedback (2017). user engagement with their applications to be effective, and Notably, he found that gameful corrective feedback can transfer to real-life situations of communicative L2 use. accelerate natural L2 learning, while simultaneously Grounding the design of a crowdsourcing-based language stimulating intrinsic motivation, which will be associated learning application on largely discredited models of L2 with continued use of the environment. Designers of learning (e.g. behaviourism) is therefore not a good starting crowdsourcing-based CALL environments can build on point. this model to both enable data collection at scale and Instead, it is imperative that designers of game-based deliver effective learning experiences. language learning applications start from the rich research literature in CALL that explores the intersections of 4. Conclusion: call for collaboration gaming and task-based learning. Case studies in digital Crowdsourcing offers exciting opportunities for L2 game-based language learning ‘in the wild’ (i.e. in non- educators, L2 learning researchers, and developers of instructed, informal online environments) show that such CALL applications. Educators will want to use environments are particularly fecund environments for the crowdsourcing for at least three reasons. First, acquisition of communicative L2 skills. In an attempt to crowdsourcing allows them to personalize the learning explain this phenomenon, a number of applied linguists environment for each individual learner. Second, in semi- (e.g. Cornillie, Thorne, & Desmet, 2012; Purushotma et al., open L2 learning tasks, it can power the generation of 2008) have observed that (digital) games align automated corrective feedback, necessary for accelerating exceptionally well with principles of task-based language natural L2 learning. Third, educators may believe in the learning. First, games are all about achieving (non- pedagogical value of crowdsourcing because authentic linguistic) goals, such as saving the princess – pardon the language learning tasks such as storytelling are so much masculine example. Second, in order to attain these goals, more interesting when the audience is actively involved, as players use language (lexicogrammatical form-function- is evident in the growing interest in fan fiction for language meaning mappings) meaningfully and communicatively. learning (e.g. Sauro, 2017). Language is therefore not learned intentionally, but as the Next, L2 learning researchers also have reasons to embrace by-product of engaging in tasks that are relevant to the crowdsourcing. It provides them with a much more fine- needs of learners, which has been shown highly effective grained lens, combined with logistically much less for L2 learning. Third, gaming is not play in a sandbox; it demanding data collection processes, to unravel learning is structured play: games are structured around scenarios processes. It also allows them a methodological toolkit to and mechanics. This echoes Ellis’ (2003) criterial feature study the interactions between language and its users (both of a task as being a workplan. And fourth, games are ‘native speakers’ and ‘language learners’) over time, in a intensively interactive: they react instantly to players’ complex and dynamic system (De Bot, Lowie, & Verspoor, actions, and because players make tons of choices, this 2007). results in an endless stream of feedback. EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 8 Finally, crowdsourcing enables developers of CALL Ellis, R. (2003). Task-based language learning and applications to launch prototypes much sooner and evaluate teaching. Oxford: Oxford University Press. basic interactions at scale in order to optimize Fischer, R. (2007). How do we know what students are functionalities such as automated corrective feedback at a actually doing? Monitoring students’ behavior in CALL. later stage. Thus, much is to be gained from an intensive Computer Assisted Language Learning, 20(5), 409–442. collaboration between educators, researchers and Gannes, L. (2014). Why a Computer Is Often the Best developers on the topic of crowdsourcing-based CALL. Teacher, According to Duolingo’s Luis Von Ahn. Retrieved February 3, 2019, from 5. Acknowledgements https://www.recode.net/2014/11/3/11632536/why-a- computer-is-often-the-best-teacher-according-to- The design, development and evaluation of duolingos-luis DialogDungeon was realized through the ICON project LLINGO (Language Learning in INteractive Gaming Keuleers, E., Stevens, M., Mandera, P., & Brysbaert, M. (2015). Word knowledge in the crowd: Measuring envirOnment; 2009-2011). LLINGO was funded by iMinds vocabulary size and word prevalence in a massive online (now: imec) and IWT (now: Flanders Innovation & Entrepreneurship), and was carried out in collaboration experiment. Quarterly Journal of Experimental Psychology, 68(8), 1665–1692. with game developer Larian Studios, the Flemish radio and Lardinois, F. (2018). Duolingo hires its first chief television broadcasting organisation VRT, Business Language and Communication Centre, and Televic marketing officer as active user numbers stagnate but revenue grows. Retrieved February 2, 2019, from Education. https://techcrunch.com/2018/08/01/duolingo-hires-its- first-chief-marketing-officer-as-active-user-numbers- 6. Bibliographical References stagnate/ Ai, H. (2017). Providing graduated corrective feedback in Larsen-Freeman, D. (2003). Teaching language. From an intelligent computer-assisted language learning grammar to grammaring. Boston: Thomson/Heinle. environment. ReCALL, 29(May), 313–334. Li, S. (2010). The Effectiveness of Corrective Feedback in https://doi.org/https://doi.org/10.1017/S0958344017000 SLA: A Meta-Analysis. Language Learning, 60(2), 309– 12X 365. Retrieved February 2, 2019, from Cornillie, F. (2014). Adventures in red ink. Effectiveness of https://doi.org/10.1111/j.1467-9922.2010.00561.x corrective feedback in digital game-based language Przybylski, A. K., Rigby, C. S., & Ryan, R. M. (2010). A learning (Unpublished doctoral dissertation). KU motivational model of video game engagement. Review Leuven (University of Leuven). of General Psychology, 14(2), 154–166. Cornillie, F. (2017). Educationally Designed Game https://doi.org/10.1037/a0019440 Environments and Feedback. In S. Thorne & S. May Purushotma, R., Thorne, S. L., & Wheatley, J. (2008). 10 (Eds.), Language, Education and Technology (pp. 361– key principles for designing video games for foreign 374). Cham: Springer International Publishing. language learning. Retrieved February 2, 2019, from https://doi.org/10.1007/978-3-319-02237-6_28 http://knol.google.com/k/ravi-purushotma/10-key- Cornillie, F., & Desmet, P. (2016). Mini-games for principles-for-designing-video/27mkxqba7b13d/2#done language learning. In F. Farr & L. Murray (Eds.), The Reinhardt, J. (2017). Digital Gaming in L2 Teaching and Routledge Handbook of Language Learning and Learning. In C. Chapelle & S. Sauro (Eds.), The Technology (pp. 431–445). Abingdon: Routledge. Handbook of Technology in Second Language Teaching Cornillie, F., Lagatie, R., Vandewaetere, M., Clarebout, G., and Learning (pp. 202–216). Wiley-Blackwell. & Desmet, P. (2013). Tools that detectives use: in search Sauro, S. (2017). Online Fan Practices and CALL. of learner-related determinants for usage of optional CALICO Journal, 34(2), 131–146. feedback in a written murder mystery. In P. Hubbard, M. Settles, B., Brust, C., Gustafson, E., Hagiwara, M., & Schulze, & B. Smith (Eds.), Learner-Computer Madnani, N. (2018). Second Language Acquisition Interaction in Language Education: A Festschrift in Modeling. In Proceedings of the NAACL-HLT Workshop Honor of Robert Fischer (pp. 22–45). San Marcos, TX: on Innovative Use of NLP for Building Educational Computer Assisted Language Instruction Consortium Applications (BEA). ACL. (CALICO). Sheen, Y. (2011). Corrective Feedback, Individual Cornillie, F., Thorne, S. L., & Desmet, P. (2012). Digital Differences and Second Language Learning. London: games for language learning: from hype to insight? Springer. ReCALL, 24(3), 243–256. Teske, K. (2017). Learning Technology Review. Duolingo. e Bot, K., Lowie, W., & Verspoor, M. (2007). A Dynamic CALICO Journal, 34(3), 393–401. Systems Theory approach to second language von Ahn, L. (2013). Duolingo: Learn a Language for Free acquisition. Bilingualism: Language and Cognition, while Helping to Translate the Web. In Proceedings of 10(01), 7. the 2013 international conference on Intelligent user Deterding, S., Dixon, D., Khaled, R., & Nacke, L. (2011). interfaces (IUI ’13) (pp. 1–2). New York: ACM. From Game Design Elements to Gamefulness : Defining “ Gamification .” In Mindtrek 2011 Proceedings. Tampere: ACM Press. EnetCollect WG3 & WG5 Meeting, 24-25 October 2018, Leiden, Netherlands 9