=Paper=
{{Paper
|id=Vol-2848/user2agent_paper_5
|storemode=property
|title=Assessing Language Learners’ Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems
|pdfUrl=https://ceur-ws.org/Vol-2848/user2agent-paper-4.pdf
|volume=Vol-2848
|authors=Dolça Tellols,Takenobu Tokunaga,Hilofumi Yamamoto
|dblpUrl=https://dblp.org/rec/conf/iui/TellolsTY20
}}
==Assessing Language Learners’ Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-2848/user2agent-paper-4.pdf</pdf>
<pre>
 Assessing Language Learners’ Free Productive Vocabulary with
            Hidden-task-oriented Dialogue Systems
                    Dolça Tellols                                      Takenobu Tokunaga                           Hilofumi Yamamoto
         Tokyo Institute of Technology                             Tokyo Institute of Technology               Tokyo Institute of Technology
                   Tokyo, Japan                                            Tokyo, Japan                                Tokyo, Japan
           tellols.d.aa@m.titech.ac.jp                                  take@c.titech.ac.jp                      yamagen@ila.titech.ac.jp

ABSTRACT                                                                            opened the door to the possibility of more sophisticated Intelli-
This paper proposes a new task to assess language learners’ free                    gent Computer Assisted Language Learning (ICALL) [18]. Among
productive vocabulary, which is related to being able to articu-                    others, vocabulary assessment by computers has been an active
late certain words without getting explicit hints about them. To                    research area with studies focusing on the automatic generation
perform the task, we propose the use of a new kind of dialogue                      of vocabulary evaluation questions [3] [8] or the measurement of
systems which induce learners to use specific words during a natu-                  vocabulary size through computerised adaptive testing (CAT) [27].
ral conversation to assess if they are part of their free productive                However, these studies concerned the assessment of receptive vocab-
vocabulary. Though systems have a task, it is hidden from the                       ulary, which is used to comprehend texts or utterances. In contrast,
users. Consequently, these may consider systems task-less. Because                  there is a lack of studies on the computerised assessment of pro-
these systems do not fall into the existing categories for dialogue                 ductive vocabulary, which is used to speak and write [29]. From the
systems (task-oriented and non-task-oriented), we named them                        viewpoint of linguistic proficiency, receptive vocabulary is related
as hidden-task-oriented dialogue systems. To study the feasibility                  to language understanding and productive vocabulary to language
of our approach, we conducted three experiments. The Question                       production. It is said that there is a gap between understanding the
Answering experiment evaluated how easily learners could recall                     meaning of a particular word (passive or receptive vocabulary) and
a target word from its dictionary gloss. Through the Wizard of                      being able to articulate it (active or productive vocabulary) [12].
Oz experiment, we confirmed that the proposed task is hard, but                        Although there exist many approaches to evaluate receptive
humans can achieve it to some extent. Finally, the Context Contin-                  vocabulary, studies that focus on the assessment of productive vo-
uation experiment showed that a simple corpus-retrieval approach                    cabulary are scarce. Meara et al. [17] and Laufer et al. [13], who
might not work to implement the proposed dialogue systems. In                       propose the Lex30 task and the LFP (Lexical Frequency Profile) mea-
this work, we analyse the experiments results in detail and discuss                 sure respectively, are two exceptions. Lex30 is a word association
the implementation of dialogue systems capable of performing the                    task where learners have to provide words given another word
proposed task.                                                                      stimulus. LFP measures vocabulary size based on the proportion of
                                                                                    words in different vocabulary-frequency levels that learners use in
CCS CONCEPTS                                                                        their writing.
                                                                                       It is considered that productive ability may comprise different
• Computing methodologies → Intelligent agents; • Applied
                                                                                    degrees of knowledge. We refer to the ability to use a word at one’s
computing → Education.
                                                                                    free will as free productive ability, while controlled productive ability
                                                                                    refers to the ability to use a word when driven to do so [14]. Fill-in-
KEYWORDS
                                                                                    the-blank tasks evaluate controlled productive ability and, though
Computer Aided Language Learning, Dialogue Systems, Productive                      the Lex30 task wants to asses free productive ability, stimulus words
Vocabulary                                                                          make it controlled to some extent. We can use the Lexical Frequency
ACM Reference Format:                                                               Profile to measure free productive vocabulary size, but it is unable
Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto. 2020. Assessing            to determine if learners are capable of freely using specific words.
Language Learners’ Free Productive Vocabulary with Hidden-task-oriented                We may ideally assess free productive ability in conversational
Dialogue Systems. In IUI ’20 Workshops, March 17, 2020, Cagliari, Italy. ACM,       contexts but this complicates, even more, the design of tasks for
New York, NY, USA, 6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn                 this purpose. Speaking tests used in language certification exams
                                                                                    are one option to overcome this deficiency, but they require human
1 INTRODUCTION                                                                      resources for the evaluation and hardly specify words to test if
Second language (L2) learning has attracted much attention in                       learners can use them. Suendermann-Oeft et al. [25] tried to solve
recent years since revitalised Artificial Intelligence (AI) research                the human resource problem by replacing the evaluators with a
                                                                                    multi-modal dialogue system, but they do not provide solutions to
Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).                          the latter, the evaluation of specific words.
                                                                                       Against this backdrop, the present work proposes a new task
                                                                                    for dialogue systems to evaluate free productive vocabulary by
                                                                                    inducing learners to naturally use the words to assess during a
                                                                                    conversation without providing explicit hints about them. Our
                                                                                    hypothesis for the assessment is that a certain set of words forms
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto


part of people’s free productive vocabulary if they can naturally use      learning commercial applications provide conversations with chat-
those words in a conversation without having been asked explicitly         bots, e.g. Duolingo Bots1 , Andy2 , Mondly3 and Eggbun Education4 .
to do so.                                                                  However, most of them base their interactions on predefined an-
   Dialogue systems are usually divided into two categories: task-         swers or have a rigidly guided task-oriented dialogue. Research
oriented and non-task-oriented. Systems capable of performing              level systems are more versatile than commercial ones. As an exam-
the proposed task can be considered non-task-oriented from the             ple, Genie tutor [10] is a dialogue-based language learning system
user point of view and task-oriented from the system point of view         that is designed for native Korean speakers to learn English. It
(though the task is hidden from the user). Given the asymmetrical          accepts free text input in a given scenario and can respond by re-
nature of the proposed systems, it is hard to fit them into one of the     trieving utterances in a dialogue corpus based on their context
available categories. Consequently, we propose a new one named             similarity. Höhn [9] introduces an Artificial Intelligence Markup
hidden-task-oriented dialogue systems. We will further explain this        Language (AIML)-based chat-bot for conversational practice, which
new category in section 4.                                                 recognises repair initiations and generates repair carry-outs. And
   In our previous work, we briefly presented the proposed task            Wilske [30] also examines how NLP, particularly dialogue systems,
and investigated some of the difficulties that its implementation          can contribute to language learning. In her dialogue system, learn-
may have to deal with [26]. In this work, we review the experi-            ers can receive feedback on their utterances.
ments and expand them. Additionally, we analyse the requirements              Research on automated language proficiency evaluation through
for the design of dialogue systems capable of performing the pro-          dialogue is scarce. Some studies include the assessment of the ver-
posed task and discuss the techniques that we may use for their            bal skill of English learners through task-oriented dialogues [15]
implementation, which we leave as future work.                             or through simulated conversations [5]. There is also an already
                                                                           mentioned proposal of a multimodal dialogue system for the evalu-
                                                                           ation of English learners’ speech capabilities [25]. Our contribution
                                                                           is proposing a new free productive vocabulary assessment method-
2    RELATED WORK                                                          ology in the form of a new task for dialogue systems. Because
Recent studies on vocabulary assessment concern various aspects,           our dialogue systems do not fall into any of the existing categories
e.g. asking words with or without a context, and different forms of        (task-oriented and non-task-oriented), we propose a new one named
questions, e.g. multiple-choice or fill-in-the-blank questions [23, 24].   hidden-task-oriented dialogue systems.
Others also point out the importance of domain when assessing lex-
ical knowledge [21]. We focus on the distinction between receptive         3  PROPOSED TASK TO ASSESS FREE
and productive vocabulary and, more specifically, propose a new               PRODUCTIVE VOCABULARY
method for assessing language learners’ free productive vocabulary
through dialogue systems.                                                  3.1 Hypotheses
    Laufer and Nation [14] proposed evaluating controlled produc-          This work takes base on the following hypothesis: “If a person can
tive vocabulary by using sentence completion tasks where they              naturally use a certain word during a conversation, we can assume
gave some initial letters of the target word. However, this technique      that it belongs to their free productive vocabulary”.
is controversial because it may assess receptive vocabulary instead
of productive vocabulary as they provided a hint to guess the tar-         3.2     Task goal
get words [19]. Others used translation tasks that ask learners to         Taking into consideration this hypothesis, we propose a new task
translate L1 (mother tongue) expressions into L2 (language being           for dialogue systems (DS) that will be used to evaluate free pro-
learned) [29]. The problem of this approach is that they need to           ductive vocabulary. The goal of the task is inducing learners to
adapt tests according to the L1 language of the learners. Moreover,        naturally use certain target words (TWs) during a conversation by
target words need to be chosen carefully to ensure that learners           generating an appropriate dialogue context. Directly asking the
use the expected target word and not a synonym. In our proposal,           words or providing explicit hints about them is prohibited. Fig-
we do not plan on giving any explicit hints for the target words           ure 1 illustrates appropriate and inappropriate examples of the DS
and neither need adaptation according to the L1, since dialogues           behaviour.
will be directly in the L2.                                                   To motivate this task goal, we took inspiration from a theory
    Regarding computer-assisted vocabulary assessment, Brown at            about second language acquisition called the Natural Approach [11].
al. [3] and Heilman and Eskenazi [8] studied the automatic genera-         This theory states that conversation is the base of language learning.
tion of vocabulary assessment questions and Tseng [27] focused on          As our proposal is a task for dialogue systems, it follows its main
the measurement of English learners’ vocabulary size. Allen and            principle.
McNamara [1] utilised Natural Language Processing (NLP) tools                 There is also a technique some teachers use, named dialogue
to analyse the lexical sophistication of learners’ essays to estimate      journals, which also relates to our proposal. Peyton [22] describes
their vocabulary size. They also pointed out the importance of             dialogue journals as written conversations between a teacher and
providing personalised instructions to each learner. We take this
aspect into account by controlling dialogue topics according to the
                                                                           1 http://bots.duolingo.com
learner’s interests and the words being assessed.                          2 https://andychatbot.com
    Fryer and Carpenter [6] discuss the possibility of utilising dia-      3 https://www.mondly.com

logue systems in language education. Nowadays, many language               4 https://web.eggbun.net
Assessing Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems           IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


 Appropriate                       Inappropriate                            Recently, storytelling dialogue systems are emerging [20]. They
 S: I think I want to travel       S: How do you call a rail-            usually interact with the user to reach the end of a story plot, but
     somewhere. What would you        way vehicle that is self-          dialogue can diverge during the process by getting questions or
     recommend to me?                 propelled on a track carry-        ideas from the user. Though they can be considered as a hybridi-
 L: You could go to London.           ing people and luggage             sation of task-oriented and non-task-oriented systems and may
 S: Nice idea! How could I get        thanks to electricity?             resemble our proposed dialogue systems, there is a clear difference
     there?                                                              between them. During the flow of the dialogue, storytelling dia-
 L: I think nowadays you can go    L: A train.                           logue systems change between task-oriented and non-task-oriented
     by plane or train.                                                  interactions. However, our proposed systems always have the same
 S: system, L: learner                                                   kind of interaction, but they look different depending on the di-
                                                                         alogue participant roles: the user vs. the system. Additionally, if
Figure 1: Appropriate and inappropriate dialogue examples                we consider our systems in general, they have a clear task, with
of the proposed task (TW: “train”)                                       the peculiarity that this task is hidden from the user. Consequently,
                                                                         we do not consider that the term hybrid is appropriate enough
                                                                         and named our proposed systems hidden-task-oriented dialogue
a student, where the teacher avoids acting as an evaluator. Bau-         systems.
drand [2] researched the impact of using this technique in a foreign        Note that Yoshida [31] also used the word ’hidden task’ to de-
language class where students had to communicate through the di-         scribe the dialogue journals task referenced in section 3. Because
aries in the target language. While journals are closer to exchanging    the teacher responds naturally while keeping in mind the student’s
letters without a clear evaluation purpose, we propose the use of        language ability and interests, what the teacher does can be consid-
real-time written conversations aiming at the assessment of specific     ered a ’hidden task’ from the user’s point of view.
terms.

4     HIDDEN-TASK-ORIENTED DIALOGUE
      SYSTEMS
Though there is a huge variety of dialogue systems deployed, they        5 EXPERIMENTS AND RESULTS
are usually classified into one of the two categories: task-oriented
                                                                         5.1 Experimental design
and non-task-oriented.
   Task-oriented dialogue systems are usually topic-constrained          To study the feasibility of the task and to analyse ideas for the
and their goal is to help the user achieve a certain task. Into this     implementation of hidden-task-oriented dialogue systems capable
category fall reservation, shopping or personal assistant systems        of achieving the proposed task, we conducted three different kinds
like Apple’s Siri5 or Google Assistant6 .                                of experiments.
   On the other hand, non-task-oriented (or conversational) systems         The Question Answering (QA) experiment asks a word by provid-
are commonly chit-chat dialogue systems whose only purpose is            ing learners with its definition, taken from a dictionary and turned
to keep the conversation with the user ongoing. Conversations            into a question, as shown in the inappropriate example in Figure 1.
are usually not restrained to a certain topic; they are considered       This experiment is not assessing free productive but it serves us
open-domain or free. Consequently, if systems want to provide            as a reference and shows how easily learners can recall a specific
informative responses, large amounts of data are necessary for their     target word from their definition. Additionally, it can also help us
implementation. However, if that is not the case, conversations can      detect if there are certain words harder to assess.
easily keep going by giving generic answers that may make the               In the Wizard of Oz (WOZ) experiment, one of a pair plays the
user assume the system understanding. Some examples of this kind         system role and tries to make their counterpart, playing the learner
of systems include Microsoft’s Japanese chatbot Rinna7 or ALICE,         role, use the target word in their utterances. System role participants
a chatbot implemented using AIML (Artificial Intelligence Markup         must not reveal their intention nor use the target word in their
Language) [28].                                                          utterances. Learner role participants believe they are doing goal-
   To achieve the task proposed in section 3, we need dialogue           less chatting. The dialogue, for which we did not set a time limit,
systems such that:                                                       can be terminated by anyone at any time and is performed through
                                                                         a text chat interface. The aim of this experiment is showing the
     • From the user point of view, since we are aiming for free topic
                                                                         difficulty of the proposed task for humans and gathering data that
       chit-chat conversation, they look like a non-task-oriented
                                                                         may serve to implement the proposed dialogue systems.
       dialogue system.
                                                                            The Context Continuation (CC) experiment asks learners to esti-
     • From the system point of view, as the system has the goal
                                                                         mate the next utterance given a dialogue context. We made the con-
       of making the user use a certain target word during the
                                                                         text by extracting a sequence of utterances from a human-human
       dialogue, they are task-oriented dialogue systems. Their pe-
                                                                         dialogue corpus so that the next utterance of the sequence (not
       culiarity is that the task is hidden from the user.
                                                                         shown in the experiment) includes the TW (see example in Figure 2).
5 https://www.apple.com/siri/                                            This experiment shows if such a corpus-retrieval approach might
6 https://assistant.google.com/                                          work for the implementation of the dialogue systems.
7 https://www.rinna.jp/profile                                              In all the experiments, tasks succeed if learners use the TWs.
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                              Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto


  B:    Aren’t 3 books a little bit expensive?                            with a given username and password and the application automati-
  A:    I don’t think so.                                                 cally lead them to the appropriate experiment instructions screen.
  B:    But it is quite a lot, right?                                     Figure 3 illustrates how dialogue took place in the WOZ experiment.
  A:    (utterance in the original corpus)
        Well, but if the number of words increases, it makes sense that   5.3       Results
        the price also increases.
  A:    (success)
        I think their price is quite appropriate.                                      Table 1: Results of the QA experiment
  A:    (failure)
        I don’t think so, but if you do, don’t buy them.                                       Target word          Success rate
  upper: context, middle: corpus continuation, bottom: answer examples                         “face”                         5/5
                                                                                               “primary school”               5/5
          Figure 2: CC experiment example (TW: “price”)                                        “audio recording”              4/5
                                                                                               “computer”                     1/5
                                                                                               “cheese”                       1/5
                                                                                               “atmosphere”                   3/5
5.2      Material
                                                                                               Total                        19/30
  Language. Our target language is Japanese, but the methodology
can apply to any language.

    Target words. We decided six nouns as the TWs by the follow-
ing criteria. Since we wanted to implement the CC experiment, we             QA experiment. Table 1 shows the results of the QA experiment.
selected words that frequently appear in the Nagoya University            The success rate (19/30 = 63.3%) is rather low considering that
Conversation Corpus [7], which consists of 129 transcripted dia-          participants are native speakers, i.e. they should know the target
logues by 161 persons with an approximate total duration of 100           words. In addition, we can observe how the success rate differs
hours. We chose words appearing in utterances with more than              across individual words. The gloss we used to ask the target word is
two and less than eleven preceding utterances, not counting the           written originally to explain the headword and not vice versa. This
ones with less than four words if they did not contain a noun. We         directionality may explain this low success rate. For instance, the
filtered out words categorised into N1 (the hardest) and N5 (the          gloss of “cheese” can be similar to that of other dairy products like
easiest) levels in terms of the Japanese Language Proficiency Test        yogurt and butter, which are examples of wrong answers given by
(JLPT), and further filtered out those having a one-word gloss as         the participants. From these, we can deduce, due to the same reason,
their definition in the employed dictionary [16]. We picked up these      how the gloss is not specific enough to identify the headword.
six words from the remaining ones: “kao (face)”, “syôgakkô (primary
school)”, “rokuon (audio recording)”, “konpyûtâ (computer)”, “tîzu
(cheese)” and “fun’iki (atmosphere)”.                                                  Table 2: Results of the WOZ experiment

   Participants. We recruited ten native Japanese speakers and di-                              Success        Dialogue     Number of    Naturalness
vided them into two groups: S and L. Group S performed the QA                                     rate       length (min)   utterances      (1–5)
experiment first and then played the system role in the WOZ ex-
                                                                           “face”                 1/5            16.4          35.0          3.6
periment, while group L played the learner role in WOZ, and then,
                                                                           “primary school”       3/5            14.4          41.2          4.0
performed the CC experiment. Each pair performed six dialogues             “audio recording”      1/5            15.9          38.4          4.8
(one per target word). After every WOZ dialogue, group L evaluated         “computer”             0/4*           13.9          32.3          4.0
the dialogue naturalness.                                                  “cheese”               2/4*           14.9          22.3          3.4
   Group S answered six questions in the QA experiment (one per            “atmosphere”           3/5            13.0          26.4          4.4
target word). We explicitly informed participants they should only         Pair 1                 1/5*           18.8          47.0          3.4
rely on their knowledge and do not check any other external infor-         Pair 2                 1/6            11.4          18.0          4.7
mation source when providing the answers.                                  Pair 3                 1/5*           21.1          64.0          3.8
   Group L continued eighteen contexts (three per target word) in          Pair 4                 4/6             7.3          18.8          4.7
the CC experiment.                                                         Pair 5                 3/6            13.7          19.8          3.3
   Assuming that native speakers have large enough vocabulary,             Success                               10.3          24.7          4.0
we can assess the feasibility of our approach itself.                      Failure                               16.1          33.4          4.0
   Platform. We designed a system that consists of a Unity8 applica-       Total                  10/28          14.1          30.8          4.0
tion communicating with a Django9 Python server to perform the             Dialogue length, number of utterances and naturalness indicate the
experiments and gather the data. Participants accessed the system          average value across dialogues.
                                                                           Participants accidentally skipped two dialogues (*).
8 https://unity.com/
9 https://www.djangoproject.com/
Assessing Free Productive Vocabulary with Hidden-task-oriented Dialogue Systems              IUI ’20 Workshops, March 17, 2020, Cagliari, Italy


                                       System side                                           User (Learner) side


      Figure 3: Screenshots of the application used to perform the WOZ experiment (translated from Japanese to English)


    WOZ experiment. Table 2 shows the target word-wise (upper sec-           TW with a certain utterance, it should stick to the TW and try a
tion), pair-wise (middle section) and success/failure-wise (bottom           different utterance (strategy) even though the current context may
section) statistics of the WOZ experiment. The overall success rate          have varied and may be more related to different (potential target)
(10/28 = 35.7%) is lower than that of the QA experiment. This sug-           words. We should redefine the proposed task such that the systems
gests that it is harder to make learners think about a specific word         consider a pool of target words simultaneously during the dialogue.
within a dialogue. The success rate across words is diverse, but             This pool could contain words from different difficulty levels and be
it is not directly related to the word difficulty level. It is rather        updated dynamically according to the current conversation topic,
related to the abundance of synonyms. For instance, learner role             word difficulty in user utterances and the already-achieved TWs.
participants used words like “PC” instead of “computer”. Since we            Based on the achieved TWs and their difficulty level, we may be
strictly required using the exact same word, such synonyms did not           able to assess the user’s free productive vocabulary automatically.
lead to success. When assessing learners’ productive vocabulary,             As for user profiles, they facilitate choosing appropriate dialogue
we need to decide what ability we evaluate, i.e. an ability to express       topics. For example, given “graduation” as TW, knowing that the
a concept or that to use an exact word.                                      user has just graduated from a school makes it easier to bring a re-
    The middle section indicates the difference in performance among         lated topic into the conversation. Consequently, we should consider
the pairs. Pair 4 and 5 performed better than Pair 1, 2 and 3. In par-       introducing user modelling into the proposed dialogue systems.
ticular, Pair 4 performed the best in terms of both dialogue length
                                                                                 Gathering dialogue data. The results of the “Context Continu-
and dialogue naturalness. We should aim at realising a dialogue
                                                                             ation (CC) experiment” suggest that the amount of available data
system that performs at least as well as Pair 4.
                                                                             is so limited that it is difficult to implement the proposed systems
    The bottom section indicates that there is no big difference in
                                                                             using a simple retrieval-based approach. We expected that the WOZ
naturalness between successful and failed dialogues but failed dia-
                                                                             experiment would also serve to gather dialogue data which would
logues tend to be longer. Note that we did not set a time limit for a
                                                                             be more appropriate to implement dialogue systems capable of
dialogue in the present experiments and this sometimes leads to
                                                                             performing the proposed task. During the arrangement of the WOZ
quite long conversations. The average failed dialogue length would
                                                                             experiment, however, we had difficulties in finding participants and
be a good reference for the time limit in future experiments.
                                                                             matching them for the dialogue. There were also some problems
   CC experiment. Lastly, there was no success case among 90 in              during the data gathering process due to internet connection prob-
the CC experiment. In terms of linguistic quality of utterances, the         lems and platform instability. We plan on developing a simpler and
retrieval-based approach has an advantage, but it is hard to retrieve        more accessible system to avoid the manual search of participants.
an appropriate context from a corpus of this size.                           To cope with these problems in data gathering, we plan to imple-
                                                                             ment and launch a gamified platform in which players (dialogue
6   DISCUSSION                                                               participants) will be automatically matched and try to compete
                                                                             to make their counterparts use the target words. In this gamified
    Reflections about the proposed task. The results of the WOZ exper-
                                                                             setting, each player takes both the learner and the system role.
iment lead to reflections regarding the number of target words and
the knowledge about the user. Concerning the number of target                   Implementing dialogue systems with limited amounts of data. In
words, the current experiment systems (Wizards) focus on a single            our case, as users will be language learners, system utterances
target word at a time. As we can see from the results, it is quite           should be grammatically correct. Retrieval-based approaches are
hard for systems to succeed in this scenario. One of the reasons             advantageous in this respect. As we did in the CC experiment, we
is that having just a single target word constrains the freedom of           can retrieve contexts from the dialogue corpora that are similar
the dialogue, i.e. restricts the choice of topics and the flow of the        to the current context and precede an utterance that includes the
dialogue. Thus, it becomes difficult to induce the user to use the           target word. Then, we can use the previous utterance to the utter-
target word. For instance, when the system failed to induce the              ance that includes the target word as a system utterance. However,
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy                                            Dolça Tellols, Takenobu Tokunaga, and Hilofumi Yamamoto


insufficient dialogue data might prevent us from retrieving the con-                    [4] Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, Wai Lam, and Shum-
texts in the first place. We need to use query expansion techniques                         ing Shi. 2019. Skeleton-to-response: Dialogue generation guided by retrieval
                                                                                            memory. In Proceedings of the 2019 Conference of the North American Chapter
by considering synonyms and similar words of the target word to                             of the Association for Computational Linguistics: Human Language Technologies,
cope with this problem. The contexts retrieved by query expansion,                          Volume 1 (Long and Short Papers). 1219–1228.
                                                                                        [5] Keelan Evanini, Sandeep Singh, Anastassia Loukina, Xinhao Wang, and
however, might provide system utterances irrelevant to the current                          Chong Min Lee. 2015. Content-based automated assessment of non-native spoken
context at the lexical level. One possibility to solve this inappro-                        language proficiency in a simulated conversation. In NIPS Workshop on Machine
priateness would be adopting the skeleton-to-response method [4],                           Learning for Spoken Language Understanding and Interaction.
                                                                                        [6] Luke Fryer and Rollo Carpenter. 2006. Emerging Technologies. Language Learning
which replaces not-context-related words in the utterance with                              & Technology 10, 3 (2006), 8–14.
open slots (skeleton generation) and applies a generative model to                      [7] Itsuko Fujimura, Shoju Chiba, and Mieko Ohso. 2012. Lexical and grammatical
fill the slots with appropriate words.                                                      features of spoken and written Japanese in contrast: Exploring a lexical profiling
                                                                                            approach to comparing spoken and written corpora. In Proceedings of the VIIth
    If we also consider implementing the pool of target words as                            GSCP International Conference. Speech and Corpora. 393–398.
mentioned above, we could retrieve a set of contexts for each target                    [8] Michael Heilman and Maxine Eskenazi. 2007. Application of automatic thesaurus
                                                                                            extraction for computer generation of vocabulary questions. In Workshop on
word in the pool in parallel. We then construct the system utterance                        Speech and Language Technology in Education.
from all contexts across the different target words. This method                        [9] Sviatlana Höhn. 2017. A data-driven model of explanations for a chatbot that
would increase the task success rate because we can choose the                              helps to practice conversation in a foreign language. In Proceedings of the 18th
                                                                                            Annual SIGdial Meeting on Discourse and Dialogue. 395–405.
most appropriately-contextualised target word in the pool.                             [10] Jin-Xia Huang, Kyung-Soon Lee, Oh-Woog Kwon, and Young-Kil Kim. 2017. A
                                                                                            chatbot for a dialogue-based second language learning system. CALL in a climate
7    CONCLUSIONS AND FUTURE WORK                                                            of change: adapting to turbulent global conditions (2017), 151.
                                                                                       [11] Stephen D Krashen and Tracy D Terrell. 1983. The natural approach: Language
This paper proposed a novel task to assess language learners’ free                          acquisition in the classroom. Alemany Press.
                                                                                       [12] Batia Laufer and Zahava Goldstein. 2004. Testing vocabulary knowledge: Size,
productive vocabulary. The task goal is making learners use a cer-                          strength, and computer adaptiveness. Language learning 54, 3 (2004), 399–436.
tain word in their utterances during a natural dialogue. It aims to                    [13] Batia Laufer and Paul Nation. 1995. Vocabulary size and use: Lexical richness in
verify if the word is in the vocabulary learners use (productive)                           L2 written production. Applied linguistics 16, 3 (1995), 307–322.
                                                                                       [14] Batia Laufer and Paul Nation. 1999. A vocabulary-size test of controlled produc-
rather than in the one they understand (receptive). To perform this                         tive ability. Language testing 16, 1 (1999), 33–51.
task, we proposed a new category of dialogue systems, namely                           [15] Diane Litman, Steve Young, Mark Gales, Kate Knill, Karen Ottewell, Rogier van
hidden-task-oriented dialogue systems. To study the feasibility                             Dalen, and David Vandyke. 2016. Towards using conversations with spoken
                                                                                            dialogue systems in the automated assessment of non-native speakers of english.
of our proposal, we conducted three experiments, including one                              In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse
employing the WOZ approach. The experiments showed that the                                 and Dialogue. 270–275.
                                                                                       [16] Akira Matsumura. 2010, 2013. Super Daijirin Japanese Dictionary. Sanseido Co.,
proposed task is more difficult than a simple QA task to answer the                         Ltd.
target word but can be achieved by humans to some extent. The                          [17] Paul Meara and Tess Fitzpatrick. 2000. Lex30: An improved method of assessing
results made us reflect on the proposed task and gave us hints for                          productive vocabulary in an L2. System 28, 1 (2000), 19–30.
                                                                                       [18] Detmar Meurers and Markus Dickinson. 2017. Evidence and interpretation in
redesigning the task. Because we noticed how insufficient dialogue                          language learning research: Opportunities for collaboration with computational
data causes problems in the implementation of the systems, partic-                          linguistics. Language Learning 67, S1 (2017), 66–95.
ularly when adopting the retrieval-based approach, we proposed                         [19] John Morton. 1979. Word recognition. Psycholinguistics: Series 2. Structures and
                                                                                            processes (1979), 107–156.
two possible solutions. One option is gathering additional dialogue                    [20] Leire Ozaeta and Manuel Graña. 2018. A View of the State of the Art of Dialogue
data through a gamified data gathering platform. The other one is                           Systems. In International Conference on Hybrid Artificial Intelligence Systems.
                                                                                            Springer, 706–715.
enhancing retrieval-based approaches with techniques like query                        [21] P David Pearson, Elfrieda H Hiebert, and Michael L Kamil. 2007. Vocabulary
expansion and template-filling                                                              assessment: What we know and what we need to learn. Reading research quarterly
   Our future work includes the implementation and evaluation of                            42, 2 (2007), 282–296.
                                                                                       [22] Joy Kreeft Peyton. 1997. Dialogue journals: Interactive writing to develop lan-
the proposed dialogue systems. We would also like to develop and                            guage and literacy. Teacher Librarian 24, 5 (1997), 46.
deploy a gamified approach to gather more dialogue data. Finally,                      [23] John Read. 2007. Second language vocabulary assessment: Current practices and
we also need to investigate how to appropriately create a pool of                           new directions. International Journal of English Studies 7, 2 (2007), 105–126.
                                                                                       [24] Katherine A Dougherty Stahl and Marco A Bravo. 2010. Contemporary classroom
target words for the systems and implement the mechanism that                               vocabulary assessment for content areas. The Reading Teacher 63, 7 (2010), 566–
will adjust them dynamically during the conversations.                                      578.
                                                                                       [25] David Suendermann-Oeft, Vikram Ramanarayanan, Zhou Yu, Yao Qian, Keelan
                                                                                            Evanini, Patrick Lange, Xinhao Wang, and Klaus Zechner. 2017. A Multimodal
ACKNOWLEDGMENTS                                                                             Dialog System for Language Assessment: Current State and Future Directions.
                                                                                            ETS Research Report Series 2017, 1 (2017), 1–7.
This work was supported by JSPS KAKENHI Grant Number                                   [26] Dolça Tellols, Hitoshi Nishikawa, and Takenobu Tokunaga. 2019. Dialogue
JP19H04167.                                                                                 Systems for the Assessment of Language Learners’ Productive Vocabulary. In
                                                                                            Proceedings of the 7th International Conference on Human-Agent Interaction. ACM,
                                                                                            223–225.
REFERENCES                                                                             [27] Wen-Ta Tseng. 2016. Measuring English vocabulary size via computerized adap-
 [1] Laura K Allen and Danielle S McNamara. 2015. You Are Your Words: Model-                tive testing. Computers & Education 97 (2016), 69–85.
     ing Students’ Vocabulary Knowledge with Natural Language Processing Tools.        [28] Richard S Wallace. 2009. The Anatomy of A.L.I.C.E. In Parsing the Turing Test.
     International Educational Data Mining Society (2015).                                  Springer, 181–210.
 [2] Lynn Patricia Baudrand-aertker. 1992. Dialogue Journal Writing in a Foreign       [29] Stuart Webb. 2008. Receptive and productive vocabulary sizes of L2 learners.
     Language Classroom: Assessing Communicative Competence and Proficiency.                Studies in Second language acquisition 30, 1 (2008), 79–95.
     (1992).                                                                           [30] Sabrina Wilske. 2015. Form and meaning in dialog-based computer-assisted lan-
 [3] Jonathan C Brown, Gwen A Frishkoff, and Maxine Eskenazi. 2005. Automatic               guage learning. Ph.D. Dissertation. Universität des Saarlandes.
     question generation for vocabulary assessment. In Proceedings of the conference   [31] Kayo Yoshida et al. 2012. Genre-based Tasks and Process Approach in Foreign
     on Human Language Technology and Empirical Methods in Natural Language                 Language Writing. Language and Culture: The Journal of the Institute for Language
     Processing. Association for Computational Linguistics, 819–826.                        and Culture 16 (2012), 89–96.

</pre>