=Paper=
{{Paper
|id=Vol-2848/user2agent_paper_5
|storemode=property
|title=Assessing Language Learners’ Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems
|pdfUrl=https://ceur-ws.org/Vol-2848/user2agent-paper-4.pdf
|volume=Vol-2848
|authors=Dolça Tellols,Takenobu Tokunaga,Hilofumi Yamamoto
|dblpUrl=https://dblp.org/rec/conf/iui/TellolsTY20
}}
==Assessing Language Learners’ Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems==
Assessing Language Learners’ Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems Dolça Tellols Takenobu Tokunaga Hilofumi Yamamoto Tokyo Institute of Technology Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan Tokyo, Japan tellols.d.aa@m.titech.ac.jp take@c.titech.ac.jp yamagen@ila.titech.ac.jp ABSTRACT opened the door to the possibility of more sophisticated Intelli- This paper proposes a new task to assess language learners’ free gent Computer Assisted Language Learning (ICALL) [18]. Among productive vocabulary, which is related to being able to articu- others, vocabulary assessment by computers has been an active late certain words without getting explicit hints about them. To research area with studies focusing on the automatic generation perform the task, we propose the use of a new kind of dialogue of vocabulary evaluation questions [3] [8] or the measurement of systems which induce learners to use specific words during a natu- vocabulary size through computerised adaptive testing (CAT) [27]. ral conversation to assess if they are part of their free productive However, these studies concerned the assessment of receptive vocab- vocabulary. Though systems have a task, it is hidden from the ulary, which is used to comprehend texts or utterances. In contrast, users. Consequently, these may consider systems task-less. Because there is a lack of studies on the computerised assessment of pro- these systems do not fall into the existing categories for dialogue ductive vocabulary, which is used to speak and write [29]. From the systems (task-oriented and non-task-oriented), we named them viewpoint of linguistic proficiency, receptive vocabulary is related as hidden-task-oriented dialogue systems. To study the feasibility to language understanding and productive vocabulary to language of our approach, we conducted three experiments. The Question production. It is said that there is a gap between understanding the Answering experiment evaluated how easily learners could recall meaning of a particular word (passive or receptive vocabulary) and a target word from its dictionary gloss. Through the Wizard of being able to articulate it (active or productive vocabulary) [12]. Oz experiment, we confirmed that the proposed task is hard, but Although there exist many approaches to evaluate receptive humans can achieve it to some extent. Finally, the Context Contin- vocabulary, studies that focus on the assessment of productive vo- uation experiment showed that a simple corpus-retrieval approach cabulary are scarce. Meara et al. [17] and Laufer et al. [13], who might not work to implement the proposed dialogue systems. In propose the Lex30 task and the LFP (Lexical Frequency Profile) mea- this work, we analyse the experiments results in detail and discuss sure respectively, are two exceptions. Lex30 is a word association the implementation of dialogue systems capable of performing the task where learners have to provide words given another word proposed task. stimulus. LFP measures vocabulary size based on the proportion of words in different vocabulary-frequency levels that learners use in CCS CONCEPTS their writing. It is considered that productive ability may comprise different • Computing methodologies → Intelligent agents; • Applied degrees of knowledge. We refer to the ability to use a word at one’s computing → Education. free will as free productive ability, while controlled productive ability refers to the ability to use a word when driven to do so [14]. Fill-in- KEYWORDS the-blank tasks evaluate controlled productive ability and, though Computer Aided Language Learning, Dialogue Systems, Productive the Lex30 task wants to asses free productive ability, stimulus words Vocabulary make it controlled to some extent. We can use the Lexical Frequency ACM Reference Format: Profile to measure free productive vocabulary size, but it is unable Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto. 2020. Assessing to determine if learners are capable of freely using specific words. Language Learners’ Free Productive Vocabulary with Hidden-task-oriented We may ideally assess free productive ability in conversational Dialogue Systems. In IUI ’20 Workshops, March 17, 2020, Cagliari, Italy. ACM, contexts but this complicates, even more, the design of tasks for New York, NY, USA, 6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn this purpose. Speaking tests used in language certification exams are one option to overcome this deficiency, but they require human 1 INTRODUCTION resources for the evaluation and hardly specify words to test if Second language (L2) learning has attracted much attention in learners can use them. Suendermann-Oeft et al. [25] tried to solve recent years since revitalised Artificial Intelligence (AI) research the human resource problem by replacing the evaluators with a multi-modal dialogue system, but they do not provide solutions to Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the latter, the evaluation of specific words. Against this backdrop, the present work proposes a new task for dialogue systems to evaluate free productive vocabulary by inducing learners to naturally use the words to assess during a conversation without providing explicit hints about them. Our hypothesis for the assessment is that a certain set of words forms IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto part of people’s free productive vocabulary if they can naturally use learning commercial applications provide conversations with chat- those words in a conversation without having been asked explicitly bots, e.g. Duolingo Bots1 , Andy2 , Mondly3 and Eggbun Education4 . to do so. However, most of them base their interactions on predefined an- Dialogue systems are usually divided into two categories: task- swers or have a rigidly guided task-oriented dialogue. Research oriented and non-task-oriented. Systems capable of performing level systems are more versatile than commercial ones. As an exam- the proposed task can be considered non-task-oriented from the ple, Genie tutor [10] is a dialogue-based language learning system user point of view and task-oriented from the system point of view that is designed for native Korean speakers to learn English. It (though the task is hidden from the user). Given the asymmetrical accepts free text input in a given scenario and can respond by re- nature of the proposed systems, it is hard to fit them into one of the trieving utterances in a dialogue corpus based on their context available categories. Consequently, we propose a new one named similarity. Höhn [9] introduces an Artificial Intelligence Markup hidden-task-oriented dialogue systems. We will further explain this Language (AIML)-based chat-bot for conversational practice, which new category in section 4. recognises repair initiations and generates repair carry-outs. And In our previous work, we briefly presented the proposed task Wilske [30] also examines how NLP, particularly dialogue systems, and investigated some of the difficulties that its implementation can contribute to language learning. In her dialogue system, learn- may have to deal with [26]. In this work, we review the experi- ers can receive feedback on their utterances. ments and expand them. Additionally, we analyse the requirements Research on automated language proficiency evaluation through for the design of dialogue systems capable of performing the pro- dialogue is scarce. Some studies include the assessment of the ver- posed task and discuss the techniques that we may use for their bal skill of English learners through task-oriented dialogues [15] implementation, which we leave as future work. or through simulated conversations [5]. There is also an already mentioned proposal of a multimodal dialogue system for the evalu- ation of English learners’ speech capabilities [25]. Our contribution is proposing a new free productive vocabulary assessment method- 2 RELATED WORK ology in the form of a new task for dialogue systems. Because Recent studies on vocabulary assessment concern various aspects, our dialogue systems do not fall into any of the existing categories e.g. asking words with or without a context, and different forms of (task-oriented and non-task-oriented), we propose a new one named questions, e.g. multiple-choice or fill-in-the-blank questions [23, 24]. hidden-task-oriented dialogue systems. Others also point out the importance of domain when assessing lex- ical knowledge [21]. We focus on the distinction between receptive 3 PROPOSED TASK TO ASSESS FREE and productive vocabulary and, more specifically, propose a new PRODUCTIVE VOCABULARY method for assessing language learners’ free productive vocabulary through dialogue systems. 3.1 Hypotheses Laufer and Nation [14] proposed evaluating controlled produc- This work takes base on the following hypothesis: “If a person can tive vocabulary by using sentence completion tasks where they naturally use a certain word during a conversation, we can assume gave some initial letters of the target word. However, this technique that it belongs to their free productive vocabulary”. is controversial because it may assess receptive vocabulary instead of productive vocabulary as they provided a hint to guess the tar- 3.2 Task goal get words [19]. Others used translation tasks that ask learners to Taking into consideration this hypothesis, we propose a new task translate L1 (mother tongue) expressions into L2 (language being for dialogue systems (DS) that will be used to evaluate free pro- learned) [29]. The problem of this approach is that they need to ductive vocabulary. The goal of the task is inducing learners to adapt tests according to the L1 language of the learners. Moreover, naturally use certain target words (TWs) during a conversation by target words need to be chosen carefully to ensure that learners generating an appropriate dialogue context. Directly asking the use the expected target word and not a synonym. In our proposal, words or providing explicit hints about them is prohibited. Fig- we do not plan on giving any explicit hints for the target words ure 1 illustrates appropriate and inappropriate examples of the DS and neither need adaptation according to the L1, since dialogues behaviour. will be directly in the L2. To motivate this task goal, we took inspiration from a theory Regarding computer-assisted vocabulary assessment, Brown at about second language acquisition called the Natural Approach [11]. al. [3] and Heilman and Eskenazi [8] studied the automatic genera- This theory states that conversation is the base of language learning. tion of vocabulary assessment questions and Tseng [27] focused on As our proposal is a task for dialogue systems, it follows its main the measurement of English learners’ vocabulary size. Allen and principle. McNamara [1] utilised Natural Language Processing (NLP) tools There is also a technique some teachers use, named dialogue to analyse the lexical sophistication of learners’ essays to estimate journals, which also relates to our proposal. Peyton [22] describes their vocabulary size. They also pointed out the importance of dialogue journals as written conversations between a teacher and providing personalised instructions to each learner. We take this aspect into account by controlling dialogue topics according to the 1 http://bots.duolingo.com learner’s interests and the words being assessed. 2 https://andychatbot.com Fryer and Carpenter [6] discuss the possibility of utilising dia- 3 https://www.mondly.com logue systems in language education. Nowadays, many language 4 https://web.eggbun.net Assessing Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Appropriate Inappropriate Recently, storytelling dialogue systems are emerging [20]. They S: I think I want to travel S: How do you call a rail- usually interact with the user to reach the end of a story plot, but somewhere. What would you way vehicle that is self- dialogue can diverge during the process by getting questions or recommend to me? propelled on a track carry- ideas from the user. Though they can be considered as a hybridi- L: You could go to London. ing people and luggage sation of task-oriented and non-task-oriented systems and may S: Nice idea! How could I get thanks to electricity? resemble our proposed dialogue systems, there is a clear difference there? between them. During the flow of the dialogue, storytelling dia- L: I think nowadays you can go L: A train. logue systems change between task-oriented and non-task-oriented by plane or train. interactions. However, our proposed systems always have the same S: system, L: learner kind of interaction, but they look different depending on the di- alogue participant roles: the user vs. the system. Additionally, if Figure 1: Appropriate and inappropriate dialogue examples we consider our systems in general, they have a clear task, with of the proposed task (TW: “train”) the peculiarity that this task is hidden from the user. Consequently, we do not consider that the term hybrid is appropriate enough and named our proposed systems hidden-task-oriented dialogue a student, where the teacher avoids acting as an evaluator. Bau- systems. drand [2] researched the impact of using this technique in a foreign Note that Yoshida [31] also used the word ’hidden task’ to de- language class where students had to communicate through the di- scribe the dialogue journals task referenced in section 3. Because aries in the target language. While journals are closer to exchanging the teacher responds naturally while keeping in mind the student’s letters without a clear evaluation purpose, we propose the use of language ability and interests, what the teacher does can be consid- real-time written conversations aiming at the assessment of specific ered a ’hidden task’ from the user’s point of view. terms. 4 HIDDEN-TASK-ORIENTED DIALOGUE SYSTEMS Though there is a huge variety of dialogue systems deployed, they 5 EXPERIMENTS AND RESULTS are usually classified into one of the two categories: task-oriented 5.1 Experimental design and non-task-oriented. Task-oriented dialogue systems are usually topic-constrained To study the feasibility of the task and to analyse ideas for the and their goal is to help the user achieve a certain task. Into this implementation of hidden-task-oriented dialogue systems capable category fall reservation, shopping or personal assistant systems of achieving the proposed task, we conducted three different kinds like Apple’s Siri5 or Google Assistant6 . of experiments. On the other hand, non-task-oriented (or conversational) systems The Question Answering (QA) experiment asks a word by provid- are commonly chit-chat dialogue systems whose only purpose is ing learners with its definition, taken from a dictionary and turned to keep the conversation with the user ongoing. Conversations into a question, as shown in the inappropriate example in Figure 1. are usually not restrained to a certain topic; they are considered This experiment is not assessing free productive but it serves us open-domain or free. Consequently, if systems want to provide as a reference and shows how easily learners can recall a specific informative responses, large amounts of data are necessary for their target word from their definition. Additionally, it can also help us implementation. However, if that is not the case, conversations can detect if there are certain words harder to assess. easily keep going by giving generic answers that may make the In the Wizard of Oz (WOZ) experiment, one of a pair plays the user assume the system understanding. Some examples of this kind system role and tries to make their counterpart, playing the learner of systems include Microsoft’s Japanese chatbot Rinna7 or ALICE, role, use the target word in their utterances. System role participants a chatbot implemented using AIML (Artificial Intelligence Markup must not reveal their intention nor use the target word in their Language) [28]. utterances. Learner role participants believe they are doing goal- To achieve the task proposed in section 3, we need dialogue less chatting. The dialogue, for which we did not set a time limit, systems such that: can be terminated by anyone at any time and is performed through a text chat interface. The aim of this experiment is showing the • From the user point of view, since we are aiming for free topic difficulty of the proposed task for humans and gathering data that chit-chat conversation, they look like a non-task-oriented may serve to implement the proposed dialogue systems. dialogue system. The Context Continuation (CC) experiment asks learners to esti- • From the system point of view, as the system has the goal mate the next utterance given a dialogue context. We made the con- of making the user use a certain target word during the text by extracting a sequence of utterances from a human-human dialogue, they are task-oriented dialogue systems. Their pe- dialogue corpus so that the next utterance of the sequence (not culiarity is that the task is hidden from the user. shown in the experiment) includes the TW (see example in Figure 2). 5 https://www.apple.com/siri/ This experiment shows if such a corpus-retrieval approach might 6 https://assistant.google.com/ work for the implementation of the dialogue systems. 7 https://www.rinna.jp/profile In all the experiments, tasks succeed if learners use the TWs. IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto B: Aren’t 3 books a little bit expensive? with a given username and password and the application automati- A: I don’t think so. cally lead them to the appropriate experiment instructions screen. B: But it is quite a lot, right? Figure 3 illustrates how dialogue took place in the WOZ experiment. A: (utterance in the original corpus) Well, but if the number of words increases, it makes sense that 5.3 Results the price also increases. A: (success) I think their price is quite appropriate. Table 1: Results of the QA experiment A: (failure) I don’t think so, but if you do, don’t buy them. Target word Success rate upper: context, middle: corpus continuation, bottom: answer examples “face” 5/5 “primary school” 5/5 Figure 2: CC experiment example (TW: “price”) “audio recording” 4/5 “computer” 1/5 “cheese” 1/5 “atmosphere” 3/5 5.2 Material Total 19/30 Language. Our target language is Japanese, but the methodology can apply to any language. Target words. We decided six nouns as the TWs by the follow- ing criteria. Since we wanted to implement the CC experiment, we QA experiment. Table 1 shows the results of the QA experiment. selected words that frequently appear in the Nagoya University The success rate (19/30 = 63.3%) is rather low considering that Conversation Corpus [7], which consists of 129 transcripted dia- participants are native speakers, i.e. they should know the target logues by 161 persons with an approximate total duration of 100 words. In addition, we can observe how the success rate differs hours. We chose words appearing in utterances with more than across individual words. The gloss we used to ask the target word is two and less than eleven preceding utterances, not counting the written originally to explain the headword and not vice versa. This ones with less than four words if they did not contain a noun. We directionality may explain this low success rate. For instance, the filtered out words categorised into N1 (the hardest) and N5 (the gloss of “cheese” can be similar to that of other dairy products like easiest) levels in terms of the Japanese Language Proficiency Test yogurt and butter, which are examples of wrong answers given by (JLPT), and further filtered out those having a one-word gloss as the participants. From these, we can deduce, due to the same reason, their definition in the employed dictionary [16]. We picked up these how the gloss is not specific enough to identify the headword. six words from the remaining ones: “kao (face)”, “syôgakkô (primary school)”, “rokuon (audio recording)”, “konpyûtâ (computer)”, “tîzu (cheese)” and “fun’iki (atmosphere)”. Table 2: Results of the WOZ experiment Participants. We recruited ten native Japanese speakers and di- Success Dialogue Number of Naturalness vided them into two groups: S and L. Group S performed the QA rate length (min) utterances (1–5) experiment first and then played the system role in the WOZ ex- “face” 1/5 16.4 35.0 3.6 periment, while group L played the learner role in WOZ, and then, “primary school” 3/5 14.4 41.2 4.0 performed the CC experiment. Each pair performed six dialogues “audio recording” 1/5 15.9 38.4 4.8 (one per target word). After every WOZ dialogue, group L evaluated “computer” 0/4* 13.9 32.3 4.0 the dialogue naturalness. “cheese” 2/4* 14.9 22.3 3.4 Group S answered six questions in the QA experiment (one per “atmosphere” 3/5 13.0 26.4 4.4 target word). We explicitly informed participants they should only Pair 1 1/5* 18.8 47.0 3.4 rely on their knowledge and do not check any other external infor- Pair 2 1/6 11.4 18.0 4.7 mation source when providing the answers. Pair 3 1/5* 21.1 64.0 3.8 Group L continued eighteen contexts (three per target word) in Pair 4 4/6 7.3 18.8 4.7 the CC experiment. Pair 5 3/6 13.7 19.8 3.3 Assuming that native speakers have large enough vocabulary, Success 10.3 24.7 4.0 we can assess the feasibility of our approach itself. Failure 16.1 33.4 4.0 Platform. We designed a system that consists of a Unity8 applica- Total 10/28 14.1 30.8 4.0 tion communicating with a Django9 Python server to perform the Dialogue length, number of utterances and naturalness indicate the experiments and gather the data. Participants accessed the system average value across dialogues. Participants accidentally skipped two dialogues (*). 8 https://unity.com/ 9 https://www.djangoproject.com/ Assessing Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems IUI ’20 Workshops, March 17, 2020, Cagliari, Italy System side User (Learner) side Figure 3: Screenshots of the application used to perform the WOZ experiment (translated from Japanese to English) WOZ experiment. Table 2 shows the target word-wise (upper sec- TW with a certain utterance, it should stick to the TW and try a tion), pair-wise (middle section) and success/failure-wise (bottom different utterance (strategy) even though the current context may section) statistics of the WOZ experiment. The overall success rate have varied and may be more related to different (potential target) (10/28 = 35.7%) is lower than that of the QA experiment. This sug- words. We should redefine the proposed task such that the systems gests that it is harder to make learners think about a specific word consider a pool of target words simultaneously during the dialogue. within a dialogue. The success rate across words is diverse, but This pool could contain words from different difficulty levels and be it is not directly related to the word difficulty level. It is rather updated dynamically according to the current conversation topic, related to the abundance of synonyms. For instance, learner role word difficulty in user utterances and the already-achieved TWs. participants used words like “PC” instead of “computer”. Since we Based on the achieved TWs and their difficulty level, we may be strictly required using the exact same word, such synonyms did not able to assess the user’s free productive vocabulary automatically. lead to success. When assessing learners’ productive vocabulary, As for user profiles, they facilitate choosing appropriate dialogue we need to decide what ability we evaluate, i.e. an ability to express topics. For example, given “graduation” as TW, knowing that the a concept or that to use an exact word. user has just graduated from a school makes it easier to bring a re- The middle section indicates the difference in performance among lated topic into the conversation. Consequently, we should consider the pairs. Pair 4 and 5 performed better than Pair 1, 2 and 3. In par- introducing user modelling into the proposed dialogue systems. ticular, Pair 4 performed the best in terms of both dialogue length Gathering dialogue data. The results of the “Context Continu- and dialogue naturalness. We should aim at realising a dialogue ation (CC) experiment” suggest that the amount of available data system that performs at least as well as Pair 4. is so limited that it is difficult to implement the proposed systems The bottom section indicates that there is no big difference in using a simple retrieval-based approach. We expected that the WOZ naturalness between successful and failed dialogues but failed dia- experiment would also serve to gather dialogue data which would logues tend to be longer. Note that we did not set a time limit for a be more appropriate to implement dialogue systems capable of dialogue in the present experiments and this sometimes leads to performing the proposed task. During the arrangement of the WOZ quite long conversations. The average failed dialogue length would experiment, however, we had difficulties in finding participants and be a good reference for the time limit in future experiments. matching them for the dialogue. There were also some problems CC experiment. Lastly, there was no success case among 90 in during the data gathering process due to internet connection prob- the CC experiment. In terms of linguistic quality of utterances, the lems and platform instability. We plan on developing a simpler and retrieval-based approach has an advantage, but it is hard to retrieve more accessible system to avoid the manual search of participants. an appropriate context from a corpus of this size. To cope with these problems in data gathering, we plan to imple- ment and launch a gamified platform in which players (dialogue 6 DISCUSSION participants) will be automatically matched and try to compete to make their counterparts use the target words. In this gamified Reflections about the proposed task. The results of the WOZ exper- setting, each player takes both the learner and the system role. iment lead to reflections regarding the number of target words and the knowledge about the user. Concerning the number of target Implementing dialogue systems with limited amounts of data. In words, the current experiment systems (Wizards) focus on a single our case, as users will be language learners, system utterances target word at a time. As we can see from the results, it is quite should be grammatically correct. Retrieval-based approaches are hard for systems to succeed in this scenario. One of the reasons advantageous in this respect. As we did in the CC experiment, we is that having just a single target word constrains the freedom of can retrieve contexts from the dialogue corpora that are similar the dialogue, i.e. restricts the choice of topics and the flow of the to the current context and precede an utterance that includes the dialogue. Thus, it becomes difficult to induce the user to use the target word. Then, we can use the previous utterance to the utter- target word. For instance, when the system failed to induce the ance that includes the target word as a system utterance. However, IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto insufficient dialogue data might prevent us from retrieving the con- [4] Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, Wai Lam, and Shum- texts in the first place. We need to use query expansion techniques ing Shi. 2019. Skeleton-to-response: Dialogue generation guided by retrieval memory. In Proceedings of the 2019 Conference of the North American Chapter by considering synonyms and similar words of the target word to of the Association for Computational Linguistics: Human Language Technologies, cope with this problem. The contexts retrieved by query expansion, Volume 1 (Long and Short Papers). 1219–1228. [5] Keelan Evanini, Sandeep Singh, Anastassia Loukina, Xinhao Wang, and however, might provide system utterances irrelevant to the current Chong Min Lee. 2015. Content-based automated assessment of non-native spoken context at the lexical level. One possibility to solve this inappro- language proficiency in a simulated conversation. In NIPS Workshop on Machine priateness would be adopting the skeleton-to-response method [4], Learning for Spoken Language Understanding and Interaction. [6] Luke Fryer and Rollo Carpenter. 2006. Emerging Technologies. Language Learning which replaces not-context-related words in the utterance with & Technology 10, 3 (2006), 8–14. open slots (skeleton generation) and applies a generative model to [7] Itsuko Fujimura, Shoju Chiba, and Mieko Ohso. 2012. Lexical and grammatical fill the slots with appropriate words. features of spoken and written Japanese in contrast: Exploring a lexical profiling approach to comparing spoken and written corpora. In Proceedings of the VIIth If we also consider implementing the pool of target words as GSCP International Conference. Speech and Corpora. 393–398. mentioned above, we could retrieve a set of contexts for each target [8] Michael Heilman and Maxine Eskenazi. 2007. Application of automatic thesaurus extraction for computer generation of vocabulary questions. In Workshop on word in the pool in parallel. We then construct the system utterance Speech and Language Technology in Education. from all contexts across the different target words. This method [9] Sviatlana Höhn. 2017. A data-driven model of explanations for a chatbot that would increase the task success rate because we can choose the helps to practice conversation in a foreign language. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. 395–405. most appropriately-contextualised target word in the pool. [10] Jin-Xia Huang, Kyung-Soon Lee, Oh-Woog Kwon, and Young-Kil Kim. 2017. A chatbot for a dialogue-based second language learning system. CALL in a climate 7 CONCLUSIONS AND FUTURE WORK of change: adapting to turbulent global conditions (2017), 151. [11] Stephen D Krashen and Tracy D Terrell. 1983. The natural approach: Language This paper proposed a novel task to assess language learners’ free acquisition in the classroom. Alemany Press. [12] Batia Laufer and Zahava Goldstein. 2004. Testing vocabulary knowledge: Size, productive vocabulary. The task goal is making learners use a cer- strength, and computer adaptiveness. Language learning 54, 3 (2004), 399–436. tain word in their utterances during a natural dialogue. It aims to [13] Batia Laufer and Paul Nation. 1995. Vocabulary size and use: Lexical richness in verify if the word is in the vocabulary learners use (productive) L2 written production. Applied linguistics 16, 3 (1995), 307–322. [14] Batia Laufer and Paul Nation. 1999. A vocabulary-size test of controlled produc- rather than in the one they understand (receptive). To perform this tive ability. Language testing 16, 1 (1999), 33–51. task, we proposed a new category of dialogue systems, namely [15] Diane Litman, Steve Young, Mark Gales, Kate Knill, Karen Ottewell, Rogier van hidden-task-oriented dialogue systems. To study the feasibility Dalen, and David Vandyke. 2016. Towards using conversations with spoken dialogue systems in the automated assessment of non-native speakers of english. of our proposal, we conducted three experiments, including one In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse employing the WOZ approach. The experiments showed that the and Dialogue. 270–275. [16] Akira Matsumura. 2010, 2013. Super Daijirin Japanese Dictionary. Sanseido Co., proposed task is more difficult than a simple QA task to answer the Ltd. target word but can be achieved by humans to some extent. The [17] Paul Meara and Tess Fitzpatrick. 2000. Lex30: An improved method of assessing results made us reflect on the proposed task and gave us hints for productive vocabulary in an L2. System 28, 1 (2000), 19–30. [18] Detmar Meurers and Markus Dickinson. 2017. Evidence and interpretation in redesigning the task. Because we noticed how insufficient dialogue language learning research: Opportunities for collaboration with computational data causes problems in the implementation of the systems, partic- linguistics. Language Learning 67, S1 (2017), 66–95. ularly when adopting the retrieval-based approach, we proposed [19] John Morton. 1979. Word recognition. Psycholinguistics: Series 2. Structures and processes (1979), 107–156. two possible solutions. One option is gathering additional dialogue [20] Leire Ozaeta and Manuel Graña. 2018. A View of the State of the Art of Dialogue data through a gamified data gathering platform. The other one is Systems. In International Conference on Hybrid Artificial Intelligence Systems. Springer, 706–715. enhancing retrieval-based approaches with techniques like query [21] P David Pearson, Elfrieda H Hiebert, and Michael L Kamil. 2007. Vocabulary expansion and template-filling assessment: What we know and what we need to learn. Reading research quarterly Our future work includes the implementation and evaluation of 42, 2 (2007), 282–296. [22] Joy Kreeft Peyton. 1997. Dialogue journals: Interactive writing to develop lan- the proposed dialogue systems. We would also like to develop and guage and literacy. Teacher Librarian 24, 5 (1997), 46. deploy a gamified approach to gather more dialogue data. Finally, [23] John Read. 2007. Second language vocabulary assessment: Current practices and we also need to investigate how to appropriately create a pool of new directions. International Journal of English Studies 7, 2 (2007), 105–126. [24] Katherine A Dougherty Stahl and Marco A Bravo. 2010. Contemporary classroom target words for the systems and implement the mechanism that vocabulary assessment for content areas. The Reading Teacher 63, 7 (2010), 566– will adjust them dynamically during the conversations. 578. [25] David Suendermann-Oeft, Vikram Ramanarayanan, Zhou Yu, Yao Qian, Keelan Evanini, Patrick Lange, Xinhao Wang, and Klaus Zechner. 2017. A Multimodal ACKNOWLEDGMENTS Dialog System for Language Assessment: Current State and Future Directions. ETS Research Report Series 2017, 1 (2017), 1–7. This work was supported by JSPS KAKENHI Grant Number [26] Dolça Tellols, Hitoshi Nishikawa, and Takenobu Tokunaga. 2019. Dialogue JP19H04167. Systems for the Assessment of Language Learners’ Productive Vocabulary. In Proceedings of the 7th International Conference on Human-Agent Interaction. ACM, 223–225. REFERENCES [27] Wen-Ta Tseng. 2016. Measuring English vocabulary size via computerized adap- [1] Laura K Allen and Danielle S McNamara. 2015. You Are Your Words: Model- tive testing. Computers & Education 97 (2016), 69–85. ing Students’ Vocabulary Knowledge with Natural Language Processing Tools. [28] Richard S Wallace. 2009. The Anatomy of A.L.I.C.E. In Parsing the Turing Test. International Educational Data Mining Society (2015). Springer, 181–210. [2] Lynn Patricia Baudrand-aertker. 1992. Dialogue Journal Writing in a Foreign [29] Stuart Webb. 2008. Receptive and productive vocabulary sizes of L2 learners. Language Classroom: Assessing Communicative Competence and Proficiency. Studies in Second language acquisition 30, 1 (2008), 79–95. (1992). [30] Sabrina Wilske. 2015. Form and meaning in dialog-based computer-assisted lan- [3] Jonathan C Brown, Gwen A Frishkoff, and Maxine Eskenazi. 2005. Automatic guage learning. Ph.D. Dissertation. Universität des Saarlandes. question generation for vocabulary assessment. In Proceedings of the conference [31] Kayo Yoshida et al. 2012. Genre-based Tasks and Process Approach in Foreign on Human Language Technology and Empirical Methods in Natural Language Language Writing. Language and Culture: The Journal of the Institute for Language Processing. Association for Computational Linguistics, 819–826. and Culture 16 (2012), 89–96.