<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Be More Eloquent, Professor ELIZA - Comparison of Utterance Generation Methods for Artificial Second Language Tutor</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Taku Nakamura</string-name>
          <email>tnakamura@ecei.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafal Rzepka</string-name>
          <email>rzepka@ist.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kenji Araki</string-name>
          <email>araki@ist.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kentaro Inui</string-name>
          <email>inui@ecei.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Information Science, Tohoku University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate School of Information and Technology, Hokkaido University</institution>
        </aff>
      </contrib-group>
      <fpage>34</fpage>
      <lpage>41</lpage>
      <abstract>
        <p>This paper presents utterance generation methods for artificial foreign language tutors and discusses some problems of more autonomous educational tools. To tackle problem of keeping learners interested, we propose a hybrid, half automatic (for semantics), half rule-based (for syntax) approach that utilizes topic expansion by retrieving the conversational subjects related to users' utterances. We compared the utterances generated by our methods with those of other dialogue systems. The evaluation results show that the topic expansion enriches vocabulary of the utterances. On the other hand, ELIZA-like confirmations and follow-ups were preferred by Japanese subjects when practicing conversational English was considered. Although our project is in its initial stage, we have decided to share our findings and thoughts on autonomy resulting from various trials, and thereby spark a discussion on the pros and cons of next generation of teaching applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Applications supporting second language acquisition have
evolved from simple flashcards for memorizing words to
more sophisticated tools using gamification, voice analysis,
etc. Computer applications and socializing online help to
improve stickiness [Chen, 2014], which is often one of the
biggest obstacles on the way of mastering a given topic. This
problem is visible in software solutions not demanding any
involvement from tutors and other peers (which is the
majority of self-study mobile applications) but software-led
teaching is preferable in scenarios where learners wish to improve
skills without feeling ashamed. Our task, helping Japanese
practice their communicational skills in English, is an
example of such scenarios. Japanese students are not eager to use
the language in everyday life for social and cultural reasons
[Doyon, 2000], although they are often interested in foreign
languages and possess wide knowledge about grammar and
vocabulary. Artificial tutor is one possible solution and we
decided to start a project aiming at creating a chat system that
could be not only conversational partner, but also a second
language acquisition supporter that learns user preferences
(from language level to hobbies). However, there are difficult
problems to be solved. First, the system must be linguistically
correct. Second, the autonomy level of the software is
important when using external corpora as its world knowledge.
Presumably, controlled conversation by artificial templates must
be balanced with the learning of a user’s preferences and topic
retrieval from the big textual data, which is more interesting
but can be dangerous when left completely uncontrolled
        <xref ref-type="bibr" rid="ref8">(as
in case of Tay bot from Microsoft [Lee, 2016])</xref>
        .
      </p>
      <p>The present paper introduces our prototype methods,
which focus on providing the responses affected by learners’
utterances based on rules and comparatively reliable
knowledge resources. It does not necessarily extend the
state-ofthe-art techniques in the language generation domain per se,
but we believe it is more efficient for this specific educational
purpose. We compare various utterance generation methods,
present experimental results and discuss other findings
including user preferences for error corrections.</p>
      <p>This paper concludes with ideas of measures that could be
implemented to maintain balance between interesting and
potentially dangerous Web-based tutors.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Traditional vs. Web-based Dialogue Systems</title>
      <p>Well-known chatbots are ELIZA [Weizenbaum, 1966] and
ALICEBOT1. ELIZA can respond to any input, but never
provides new topics related to the user’s utterances.
ALICEBOT responds based on manually created databases that
already exist. Although creating or extending databases will
expand conversational topics, it is costly and nearly
impossible to build a database that covers many fields and a broad
range of users’ interests.</p>
      <p>Modalin [Higuchi et al., 2008] is a Japanese text-based
dialogue system that uses word associations retrieved from the
Web and randomly adds modality to generated utterances. To
sustain motivated conversation with users, the Modalin
system generates input-related utterances using word
associations. Presuming that a similar approach could enhance the
conversation opportunities for English learners, we adopt the
idea of word associations in our proposed system.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>System for Language Learning</title>
      <p>Jia [Jia, 2009] developed CSIEC (Computer Simulation in
Educational Communication) system with multiple functions</p>
      <sec id="sec-3-1">
        <title>1http://www.alicebot.org</title>
        <p>for English learning, including a chatbot as a conversational
partner. CSIEC system has a free conversation function based
on textual knowledge and reasoning, aiming to overcome
the problem in ELIZA-like systems, which require numerous
predefined patterns fitted to the various utterances of users.
The author suggested that databases for the system responses
can be enriched by users’ inputs, which need to be created
beforehand. The CSIEC system still had insurmountable
content shortcomings, and the project has been discontinued.
2
2.1</p>
        <sec id="sec-3-1-1">
          <title>System Overview</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>CoAPM</title>
      <p>Figure 1 outlines our first proposed method, the
Cooccurring Action Phrases-based Method (CoAPM). The
CoAPM method adopts the word associations utilized in
Modalin [Higuchi et al., 2008] on the hypothesis that
inputrelated utterances could maintain users’ interest in the
conversation.</p>
      <p>The present research applies this idea to English by
replacing the Web with the British National Corpus (BNC)2. The
BNC was chosen primarily because Web search engines
restrict the number of searches, and because the BNC (being
taken from trustful sources like newspapers or books) is
expected to contain more correct English than other Web-based
corpora. Therefore the English in the BNC was deemed
suitable for educational purposes. Learners of English as a
second language, who will mainly use common English, need
not necessarily be familiar with native standard English,
especially with natural expressions that rarely appear in
textbooks. Nevertheless, resources with more input from
nonnative contributors might contain dialects proper to specific
regions, which could baffle some leaners, whereas the BNC
seems to maintain a more unified style with less potential for
confusion. Thus, we assume that a standard English corpus
such as the BNC is still useful for realizing a system as a
widely acceptable English teacher.</p>
      <p>Extracting Keywords and Word Associations
In the first step, the method analyzes users’ utterances using
the Stanford Log-linear Part-Of-Speech Tagger (POS Tagger)
[Toutanova and Manning, 2000; Toutanova et al., 2003] to
spot query keywords for extracting word associations lists.
As the query keywords, we selected nouns and verbs
(excluding some stop-words) because they constitute the core
semantic elements of English sentence structures, and to some
extent, describe the context of the utterances. This
concentration also helps to reduce the exact co-occurrence matching
costs when searching words of interest. Nouns identified as
proper nouns by the POS Tagger are further analyzed using
the Stanford Named Entity Recognizer (NER) [Finkel et al.,
2005] and are assigned to labels such as “PERSON”,
“LOCATION”, “ORGANIZATION”. In the next step, the method
searches the BNC using these keywords (nouns or named
entities, verbs) as queries and extracts sentences containing
2The British National Corpus, version 3 (BNC XML Edition),
2007. Distributed by Oxford University Computing Services on
behalf of the BNC Consortium. http://www.natcorp.ox.ac.
uk/</p>
      <p>(buy, ticket)
Association Words Extraction
Candidate Words Generation</p>
      <p>FAirgeuweretaking abou:t (vreetrrievvieedwnouno)?f the proposed Co-occurring Action
1: O
Phrases-based Method (CoAPM) and utterance generation
examples.
these words from the corpus. Nouns and verbs in the
extracted sentences are listed and sorted in frequency order as
word associations. This process is exemplified in Table 1.
Generation of Words Candidates for Utterances
Using the sorted lists of extracted nouns and verbs related
to the input keywords, the method generates a single verb
and a single noun pair from the most frequent words in the
lists. This verb-noun pair can be a candidate for utterance
generation. To verify the existence of the verb-noun
combination, the method then checks for co-occurrences of the
given pair in the BNC. That is, the method first selects the top
noun and top verb word associations, and then searches for
the co-occurrence in each sentence in the BNC using
exactmatching. Even if only one pair is found in the BNC, the
verb-noun combination is regarded as possible in English. If
the noun and verb are not found in the same sentence of the
corpus, the method tests another verb-noun pair (the second
most frequent verb and top noun in the list). The method
repeats this process up to the three most frequent verbs and
nouns, advancing to the next verb in stepwise fashion until a
proper combination is found. We prioritize nouns because of
the assumption that nouns describe the context of an utterance
more specifically than verbs, which influence a topic shift
more often. However, this assumption must be confirmed
empirically in the future.</p>
      <p>Utterance Generation
A CoAPM response is generated by applying the proposed
verb-noun or one of the pair to a template. We prepared
the templates for utterances half-manually, based on the most
frequent sentences in English movie subtitles retrieved from
OPUS corpus3 [Tiedemann, 2012; Lison and Tiedemann,
2016]. The sentences were automatically abstracted using
POS tagging and NER, then ranked by frequency. Movie
subtitles were selected for their adequately large corpus size and
their potential suitability for conversational templates.
Examples of templates are shown in Figure 2. Using POS tag
analysis, the method selects the templates that fit the proposed
words or words in users’ input. It then randomly selects a
template and applies the previously chosen candidate words
or input words. To confirm the correctness of the expression
in an applied template, the method searches the core phrase
of the given template (such as “visit* Tokyo” for “Would you
like to visit Tokyo?”, where * is a wildcard for matching
various forms of a verb; in this case, visited, visits or visiting) in
the corpus by exact matching. If more than five matches occur
in the BNC, the method outputs that template inserted with
the retrieved words or input words. The number of matches
is set experimentally, accounting for the processing time and
validity of the output. If no template satisfies the condition,
CoAPM tries another combination of candidate words.</p>
    </sec>
    <sec id="sec-5">
      <title>CiAPM and RAPM</title>
      <p>The BNC used in CoAPM contains formal and reliable
English, which could be suitable for learners of English.
However, the corpus covers few expressions of latest events or
trends. In our next models, we relied on a more up-to-date
ontology, ConceptNet4, enabling response to ongoing
topics. Based on the evaluation outcome and analysis of the first
method evaluation, which we describes in Section 3.3, we
developed two variations of our second method, named “Cited
Action Phrases-based Method (CiAPM)” and “Related
Action Phrases-based Method (RAPM)”. CiAPM uses the cited
phrases from user utterances without replacing the relevant
text. RAPM retrieves the input-related concepts using the
semantic network, ConceptNet, which contains natural
language phrases. The method is outlined in Figure 3.
ConceptNet
ConceptNet is a large-scale semantic network providing
general human knowledge [Speer and Havasi, 2012] expressed in
natural language. It includes words, common phrases and the
relations between them.</p>
      <p>In the course of our study for better system utterances, we
considered to employ sequence to sequence model introduced
in [Cho et al., 2014]. Inspired by [Vinyals and Le, 2015],
we tried to apply this model to build a conversational
system. However, it was difficult to find a training corpus with</p>
      <sec id="sec-5-1">
        <title>3http://opus.lingfil.uu.se/OpenSubtitles.</title>
        <p>php</p>
      </sec>
      <sec id="sec-5-2">
        <title>4http://conceptnet.io/</title>
        <p>Getting good grades is hard for me.</p>
        <p>Keyphrase Extraction</p>
        <p>getting_grade
Related Phrases Extraction</p>
        <p>ConceptNet
Templates</p>
        <p>Utterance Generation
using Related Phrases
attending class [UsedFor]
taking finals [Causes]</p>
        <p>:</p>
        <p>What else can I use for getting good grades except attending class?
sufficient size and compatibility with our objective:
conversational practice for language learning. Mainly because of this
difficulty, we abandoned this attempt after a few trials.</p>
        <p>At that time, the latest iteration of ConceptNet was
announced, which can be regarded as reliable, up-to-date and
one of the biggest freely available common sense
knowledge resources. Commonsensical utterances are known to be
a factor for enriching the naturalness of system responses:
consequently, they enhance users’ will to continue
conversations [Rzepka et al., 2005]. Therefore, we adopted
ConceptNet, which includes knowledge from ConceptNet 55 and
many different sources, in our methods.</p>
        <p>Extracting the Key Phrase and Related Phrases
CoAPM identifies single words, so cannot handle idiomatic
phrasal expressions. CiAPM and RAPM, which detect
phrases including a gerund and a noun, can handle
multiword expressions in a limited syntactic form, but they do not
cover inflections in the phrase or other syntactic forms. For
example, CiAPM and RAPM will detect “making a mistake”,
but ignore variations such as “made a mistake” or the phrasal
verb “break down”.</p>
        <p>In the first step, the method parses the input utterances
using the Stanford POS tagger to detect action phrases
consisting of the -ing (gerund) form of a verb and a noun. Articles
and adjectives between the verb and the noun are also
captured. As key phrases, this form of action phrases is selected
because they play various grammatical roles in English
sentences without inflection, and to a certain degree, represent
the semantic essence of utterances. In this stage, we partially
detect the action phrases using the gerund without
lemmatization, which facilitates the maintenance of grammatical
validity. However, a fully developed system should respond to
any utterances, requiring a more flexible method. If there are
more than two action phrases in the input, the method selects
the first phrase, based on an assumption that the first phrase
has priority over other action phrases in the utterance context
in English. The extracted phrase is transformed into a form
of query phrase for ConceptNet API. Next, RAPM searches
ConceptNet using this key phrase as a query. Finally, the
method extracts the related action phrases from the results
in natural language form. The phrase-extraction process is</p>
      </sec>
      <sec id="sec-5-3">
        <title>5http://conceptnet5.media.mit.edu</title>
        <p>To generate responses from the proposed methods (RAPM,
CiAPM), a related phrase or a cited phrase from an input is
applied to a template. The related phrase and template are
selected randomly. The templates were manually prepared
based on the analysis of the first method (Section 3.3).</p>
        <p>They were divided into two types: templates for any
relation and templates for specific relations. Referring to the
statistics of common relations [Ferschke et al., 2013] in
ConceptNet 5, we selected 11 relations in ConceptNet, namely,
IsA, PartOf, RelatedTo, HasProperty, UsedFor, DerivedForm,
Cause, CapableOf, MotivatedbyGoal, HasSubevent,
HasPrerequiste. CiAPM applies phrases to the former type of
templates, without using relations. In the template examples of
Figure 4, ‘V-ing N’ denotes an action phrase which
comprising a verb in gerund form and a noun.</p>
        <p>Templates for any relation
Talking about [V-ing N (related phrase)]... What is your
opinion on that topic?
Speaking of that, what do you think about [V-ing N (related
phrase)]?
Templates for specific relations
relation: RelatedTo
Often [V’-ing N’ (action phrase from input)] and [V-ing N
(related phrase)] are a good combination.</p>
        <p>What do you think?
relation: HasProperty
What about [V-ing N (related phrase)] while [V’-ing N’
(action phrase from input)]?
To improve the tutoring ability of our method, we aim at
detecting the spelling or grammatical mistakes in users’
utterances. We integrates LanguageTool, an open source writing
style (including spelling) and grammar checker, calling it as
a service via the HTTP API. Our method indicates errors in
English usage by presenting a candidate correction with the
error description message returned by LanguageTool. The
correction candidate is taken from the top of the suggestions
list generated by LanguageTool, in “Recast” form, which was
preferred in the preliminary survey described in Section 3.1,
and is displayed before the method utterance.</p>
        <sec id="sec-5-3-1">
          <title>Experiments and Results</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Survey on Error Correction Methods</title>
      <p>Since we plan to equip our system with the function that
detects the mistakes in users’ utterances and convey these
mistakes to the users in the dialogue, we conducted a
questionnaire about how people prefer to be corrected. Five
evaluators (four male students in their early 20s, one male in
his early 30s), selected from among the potential users of
an automated tutor, chose their preference as learners from
three error correction methods, “Explicit-correction”,
“Recast”, and “Prompt” (or “Elicitation”) (see Table 3). These
options were based previous studies of error correction in a
second language classroom [Loewen, 2007; Tedick, 1986].
“Explicit-correction” refers to the direct indication and
correction of mistakes. “Recast” is implicit reformulation of
errors to the correct form. “Prompt” induces self-correction
instead of providing the corrected form. Among many types of
error correction, these three methods were selected for their
efficiency and applicability to automatic dialogue generation
methods.</p>
      <p>This survey and the evaluation experiment of CoAPM in
Section 3.2 were conducted online in a bundle. The survey
presents participants with an erroneous utterance and its
corrections by each method.</p>
      <p>Majority of evaluators answered that “Explicit-correction”
(40%) or “Recast” (40%) is preferable for learners, while the
remaining 20% supported “Prompt” (Table 3). According to
the result, “Explicit-correction” and “Recast” were
considered to be more suitable than “Prompt” for error correction
in utterances, although a broader survey is needed to reach a
more definite conclusion.</p>
      <p>The lower score for prompting might be related to the fact
that we are not willing to keep people waiting and feel
embarrassed when we are not sure what is the correct form.
However, replacing a human teacher by a patient machine might
significantly alter these results. This possibility requires
evaluation in future study.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>CoAPM Evaluation</title>
      <p>To see how learners react to generated utterances, we
compared CoAPM with ELIZA [Weizenbaum, 1966]. A
possible benchmark, CSIEC [Jia, 2009], mentioned in Section
1.2, utilizes the conversational history. Because we evaluated
only one-turn utterance exchanges this time, we instead used
ELIZA as a baseline, which is independent of the preceding
conversation and whose utterance rules are freely available.
We employed python implementation of ELIZA by Jez
Higgins6.</p>
      <p>As the user inputs, we used the utterances of English
learners’ in The NICT JLE (Japanese Learner English) Corpus7.
This corpus comprises transcriptions of English oral
proficiency interview tests for native Japanese speakers. The
utterances include errors in English, some of which are tagged.
Among the error-tagged data, we chose test takers’ utterances</p>
      <sec id="sec-7-1">
        <title>6http://www.jezuk.co.uk/cgi-bin/view/</title>
        <p>software/eliza</p>
      </sec>
      <sec id="sec-7-2">
        <title>7https://alaginrc.nict.go.jp/nict_jle/</title>
        <p>index_E.html
including at least a verb and a noun that appear more than five
times in the BNC, and applied them as the input data (to
ensure that the utterances convey a rich meaning, the 10 most
frequent verbs in the BNC, expecting to include auxiliary and
delexical verbs, were excluded from the condition). Under
these restrictions, 19.6% of the examinees’ utterances were
used as potential inputs. We used error-tagged utterances8
for the convenience of evaluation when introducing the error
suggestion function into our system. As mentioned above,
the 10 most frequent verbs were excluded because they
include verbs with low semantic meaning such as auxiliary and
delexical verbs, although a more principled approach could
be taken.</p>
        <p>We asked the five evaluators (described in Section 3.1) to
assess each of 20 utterance pairs (identical for all evaluators).
Evaluators were asked to rate the input and response
utterances generated by two methods in three categories:
“grammatical naturalness”, “semantic naturalness” and “motivation
to keep studying as a learner” on a 5-point scale (where 1
indicates unnatural language or lowest motivator of continued
study, and 5 denotes natural language or highest motivator of
continued study).
3.3</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Results and Analysis (CoAPM)</title>
      <p>Table 4 shows the average scores of all evaluators in each
criteria for both systems, rated on a 1-5 scale. The inter-rater
agreement of the five evaluators was 0.48 (Kendall’s
coefficient of concordance).</p>
      <p>On average, the preliminary version of our proposed
method (CoAPM) scored slightly lower than ELIZA,
although there were no statistically significant difference (p &gt;
0.05) between CoAPM and ELIZA in all three evaluation
criteria (Mann-Whitney U-test, p = 0.42 for grammar, p = 0.29
for semantics, p = 0.21 for motivation).</p>
      <p>Figure 5 shows how CoAPM and ELIZA responded to
several input utterances. The lower average scores of CoAPM
8In the evaluations, we used the original utterances without
error corrections as inputs, so the examples may contain erroneous
expressions.</p>
      <p>9The number of respondents is shown in brackets.
than ELIZA (especially for grammatical and semantic
naturalness) were mainly caused by insufficient utterance
templates and incorrect POS analysis.</p>
      <p>Among more than 100 types of templates, the POS
restrictions admitted only six templates for 20 utterances of
CoAPM.</p>
      <p>In addition, we presumed that in second-language
acquisition, the questioning or confirming style of ELIZA
frequently surpassed the association-based strategy of CoAPM,
although people preferred Modalin [Higuchi et al., 2008] over
ELIZA during normal chatting with no educational
inclinations. This implies that follow-up questions are often more
important than input-related statements in language tutoring
tasks. Considering to evaluate each turn (each utterance pair)
separately, we here set all templates as interrogatives.
However, a deployed system should acknowledge as well as
question a user’s utterance.
3.4</p>
    </sec>
    <sec id="sec-9">
      <title>CiAPM and RAPM Evaluation</title>
      <p>The following five systems were experimentally evaluated:
Baselines
(I) ELIZA
(II) ALICEBOT
Proposed methods
(IV) CiAPM
(V) RAPM-NOREL (not using relations)
(VI) RAPM-REL (using relations)</p>
      <p>We used the same implementation of ELIZA as described
in a previous experiment (Section 3.2). In conversation,
ALICEBOT needs the AIML (Artificial Intelligence Markup
Language) set, which contains the contents of the ALICE
brain written in AIML. Therefore, we adopted the
standard free AIML set, “AIML-en-us-foundation-ALICE”10. By
comparing with ELIZA and ALICE, we expected to observe
whether chatting with simple dialogue systems is intrinsically
efficient or not for educational purposes. Although it might
be discussable, we believe that among the conversational
systems, ELIZA and ALICE have been well-known and cited
because other rule-based dialogue systems adopt similar
processing of scarce context, or are unavailable for commercial
or disclosed specification.</p>
      <p>We created two versions of RAPM, which generate
utterances from different templates. Specifically, RAPM-NOREL
10https://code.google.com/archive/p/aimlen-us-foundation-alice/
employs templates for any relations, while RAPM-REL
utilizes templates for specific relations.</p>
      <p>The user inputs were sentences in English learners’
utterances extracted from The NICT JLE Corpus described in
Section 3.2. Considering there were long utterances with
many sentences, we used sentences here. From test takers’
utterances, we selected sentences including at least one
action phrase comprising a verb in gerund form and a noun.
This condition is set on the assumption that action phrases
have richer context in sentences, and facilitate the generation
of grammatically correct utterances. Under this condition,
6.12% of the examinees’ original sentences were retained as
potential user inputs.</p>
      <p>The evaluators were six male Japanese university students
majoring in science (three undergraduates and three graduates
in their 20s), who were potential targets of a full-fledged
tutoring system. The subjects were intermediate English
learners with basic knowledge of English grammar and
vocabulary, but with low proficiency especially in speaking English.</p>
      <p>The six evaluators assessed the utterances generated by all
five systems, in response to each of 10 inputs chosen
randomly from the utterances of test takers. The examinees’
utterances were originally separated from the interviewer’s
utterances in the corpus. That is, each evaluator was given
the same 50 utterances from the systems. The participants
received pairs of utterances in a specific order. In contrast,
in the former CoAPM evaluation (Section 3.2), the utterance
pairs were presented in mixed order.</p>
      <p>The system utterances were rated on a 5-point scale (where
1 means ‘poor’ and 5 represents ‘excellent’) in the following
six categories.
(A) “Will to continue the conversation”
(B) “Semantical naturalness of dialogue”
(C) “Appropriateness in English conversation practice”
(D) “Vocabulary richness”
(E) “Knowledge richness”
(F) “Human-likeness of the system”</p>
      <p>These evaluation criteria were based on the benchmark
used in a related work [Higuchi et al., 2008]. However, by
focusing on the action phrases, the proposed methods are
supposed to ensure a degree of grammatical naturalness in the
utterances. Therefore, the original criterion “grammatical
naturalness of dialogues” was changed to “appropriateness in
English conversation practice”, which is considered to be more
important for evaluating English-teaching dialogue systems.</p>
      <p>In the “vocabulary richness” evaluation, we expected
subjects to rate utterances on a scale from “laconic” to “wordy”.
Some of these criteria could be evaluated by specialists
familiar with English education, or at least by native English
speakers. However, at this stage of our project, we focus
on the user experience of learners who are easily bored with
learning. Therefore we set the criteria in terms of the user
experience, expecting evaluation from the learners’ standpoint.
In the questionnaire, the criteria (without specific
descriptions) were presented to the evaluators in the Japanese
language.
Input “And they enjoyed eating delicious food and
alcohol.”
ELIZA “How does that make you feel?”
ALICEBOT “Who, specifically?”
CiAPM “Let’s talk about eating delicious food. What
do you think about it?”
RAPM- “Talking about diminishing your own
NONREL hunger... What is your opinion on that topic?”
RAPM- “I guess you were diminishing your own
REL hunger - can you tell me how you did that?”
3.5</p>
    </sec>
    <sec id="sec-10">
      <title>Results and Analysis (CiAPM / RAPM)</title>
      <p>Table 5 shows the average scores and standard deviations of
all evaluators in each criterion for the five systems (rated from
1 to 5). The Kendall’s coefficient of concordance among the
six raters was 0.369. One of the proposed methods,
RAPMNONREL with templates not using relations, scored highest
in “vocabulary richness (criterion D)”, and scored
secondhighest in other criteria. In all criteria except vocabulary
richness, CiAPM achieved the highest score. The other proposed
methods, RAPM-REL, also achieved a high average score in
vocabulary richness. According to the Steel-Dwass test
(evaluated by the asymptotic method), the “vocabulary richness
(D)” score of our three methods significantly differed from
the ELIZA score (p &lt; 0.05), but no statistically significant
differences were observed in the other criteria. Figure 6 shows
some responses of each method to different input utterances.</p>
      <p>The result suggests that the input-related phrases from
ConceptNet are useful to expand the vocabulary of the
system, and hopefully that of interacting users. For instance, the
input “And a woman is playing piano.” elicited the responses
“Learning something about music... why some people love it
and some don’t?” (RAPM-NONREL) and “Let’s talk about
playing piano. What do you think about it?” (CiAPM). The
retrieved phrase ‘learning something about music’, which had
a relation to the input phrase ‘playing piano’, and appears to
enrich the vocabulary over merely repeating the input phrase.
In RAPM-NONREL and CiAPM, the criterion “vocabulary
richness” was rated 4 by 6/6 and 2/6 evaluators, respectively.
This example indicates the potential usefulness of expanding
the variety of expressions with phrases including hypernyms
or hyponyms, based on the relations in ConceptNet.</p>
      <p>However, when the action phrases from a user input are
inserted into the system output, the utterances may sound more
natural, as demonstrated in the following example. The
input “And they enjoyed eating delicious food and alcohol.”,
brought the outputs “Let’s talk about eating delicious food.
What do you think about it?” (CiAPM) and “Talking about
diminishing your own hunger... What is your opinion on that
topic?” (RAPM-NONREL). In this case, a discussion about
human needs, suggesting the related subject of ‘diminishing
your own hunger’ to ‘eating delicious food’, would be a good
topic for a deeper conversation. However, the preference of
the conversational topic depends on the user, his or her
interests and their English levels. For this reason, the
repeating method (CiAPM) is considered to score above the other
methods on average in all criteria except vocabulary richness.</p>
      <p>We presume that the related concepts in ConceptNet are
not always compatible with the dialogue context. In such
cases, the responses are unsuited to the user’s need. This
could be partly attributable to random selection of the related
concepts. To avoid wandering away from the subject of the
conversation, the related phrases must be carefully chosen to
suit the context and the individual user, especially when
applying phrases with their relations. In future work, the
random selection must be replaced by a context processing
module, a user profiler, and a language level estimator. A context
processing module could select proper phrases by semantic
analysis. Considering the ambiguity of multi-word
expressions, detecting phrases after applying a topic modeling such
as latent Dirichlet allocation might be useful for this purpose.
In addition, complete reliance on ConceptNet, which lacks
knowledge of some items and includes dubious entries, is also
problematic.</p>
      <p>As wrong inputs were not corrected, the open source
checker found no mistakes. We might require a more
powerful error correction approach. For error detection and
correction suggestions, a promising solution is the
Grammatical Error Correction (GEC) system based on the Neural
Machine Translation (NMT) approach [Yuan, 2017]. In the
experiments [Yuan, 2017], the NMT-based GEC system
outperformed the SMT (Statistical Machine Translation)-based
system even in a difficult subject-verb agreement problem.</p>
      <p>From these results we can assume that repetition for
confirmation plays an important part in conversation practice by
Japanese learners of English. However, to assess whether
this observation extends to other cultural backgrounds and
individuals, broader experiments with more evaluators are
needed. Furthermore, the evaluated conversations were very
short, limited to one-turn dialogue (a user’s utterance and
the corresponding system utterances). Whether the proposed
methods maintain users’ interest in an actual conversation
cannot be known at this stage. For this purpose, we must
evaluate a fully developed system on multiple turns of free
conversation. In long conversations for second language
acquisition, a system that generates only repetitive utterances
would bore users. The wide vocabulary of RAPM, providing
related topics to user utterances, could potentially mitigate
conversational deadlocks. Thus, combining the two methods
(one that with repetitive utterances, the other using related
topics) might be more efficient for language tutoring tasks.
4</p>
      <sec id="sec-10-1">
        <title>Conclusion and Future Works</title>
        <p>We proposed methods that automatically generate utterances
for an English language tutor, and compared their
performances with those of classic chatbots. Specifically, we
evaluated how the generated expressions were received by
Japanese subjects. Although our small-scale experiment does
not allow drawing any conclusions about the stickiness level
of these approaches yet, we found that ELIZA-like outputs
offer more encouragement to users than Web- or common
sense-based approaches. These inferences oppose the
findings of [Rzepka et al., 2005], who evaluated non-learning
dialogues. In enriching the vocabulary of the system
utterances, the proposed methods had shown their superiority,
which could be potentially useful to improve users’ command
of a foreign language.</p>
        <p>However, using external corpora or crowd-sourced
knowledge sources might incur serious drawbacks. Allowing the
tutor excessive freedom, especially in learning material beyond
the preferences of the user, risks misuse, as has occurred in
Microsoft’s Tay and other chatbots [Michael, 2016]. In our
approach, adaption of hand-crafted syntactic rules seem to be
the only restriction, but because of majority voting in both
British National Corpus- and ConceptNet-based methods we
indirectly try to avoid semantic strangeness. This does not
mean that corpora guarantee safe communication, and some
topic restrictions might be needed from the outset. However,
blocking slang and offensive words completely can be
problematic, especially when considering more sophisticated
personality modeling, which is required in longer-term
conversational sessions.</p>
        <p>As the next step, we plan to combine our method with
estimating language level and supporting vocabulary
acquisition algorithms [Mazur, 2016]. Error corrections could be
improved by the annotated data11, taking into account that
Japanese students often make non-word spelling errors
(making not existing spellings) [Nagata and Neubig, 2017].
Although our dialogue system is not yet ready for long-run
conversational sessions, we should experiment on the tutor’s
autonomy level in choosing topics related to user’s input, prior
to larger scale testing. We plan to analyze which outputs
11http://www.gsk.or.jp/catalog/gsk2016-b/
are potentially harmful, and to determine appropriate
countermeasures against these expressions.</p>
      </sec>
      <sec id="sec-10-2">
        <title>Acknowledgements</title>
        <p>The authors would like to thank DENSO CORPORATION
for funding this work. We are grateful to anonymous
reviewers for their detailed comments and helpful suggestions.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Chen</source>
          , 2014] Yi-Cheng
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>An empirical examination of factors affecting college students' proactive stickiness with a web-based english learning environment</article-title>
          .
          <source>Computers in Human Behavior</source>
          ,
          <volume>31</volume>
          :
          <fpage>159</fpage>
          -
          <lpage>171</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Cho et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merrienboer,
          <string-name>
            <surname>Caglar Gulcehre</surname>
            , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          .
          <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Doyon</source>
          , 2000]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Doyon</surname>
          </string-name>
          .
          <article-title>Shyness in the Japanese EFL class: Why it is a problem, what it is, what causes it, and what to do about it</article-title>
          .
          <source>The Language Teacher</source>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>11</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Ferschke et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Oliver</given-names>
            <surname>Ferschke</surname>
          </string-name>
          , Johannes Daxenberger, and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <article-title>The People's Web Meets NLP</article-title>
          .
          <source>In Theory and Applications of Natural Language Processing</source>
          , pages
          <fpage>121</fpage>
          -
          <lpage>160</lpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Finkel et al.,
          <year>2005</year>
          ]
          <string-name>
            <given-names>Jenny</given-names>
            <surname>Rose</surname>
          </string-name>
          <string-name>
            <surname>Finkel</surname>
          </string-name>
          , Trond Grenager, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Incorporating non-local information into information extraction systems by Gibbs sampling</article-title>
          .
          <source>In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05</source>
          , pages
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          , Morristown, NJ, USA,
          <year>2005</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Higuchi et al.,
          <year>2008</year>
          ]
          <string-name>
            <given-names>Shinsuke</given-names>
            <surname>Higuchi</surname>
          </string-name>
          , Rafal Rzepka, and
          <string-name>
            <given-names>Kenji</given-names>
            <surname>Araki</surname>
          </string-name>
          .
          <article-title>A casual conversation system using modality and word associations retrieved from the Web</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08</source>
          , pages
          <fpage>382</fpage>
          -
          <lpage>390</lpage>
          , Stroudsburg, PA, USA,
          <year>2008</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Jia</source>
          , 2009]
          <string-name>
            <given-names>Jiyou</given-names>
            <surname>Jia. CSIEC</surname>
          </string-name>
          :
          <article-title>A computer assisted English learning chatbot based on textual knowledge and reasoning</article-title>
          .
          <source>Knowledge-Based Systems</source>
          ,
          <volume>22</volume>
          (
          <issue>4</issue>
          ):
          <fpage>249</fpage>
          -
          <lpage>255</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Lee</source>
          , 2016]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Learning from Tay's introduction</article-title>
          . https://blogs.microsoft.com/blog/2016/ 03/25/learning-tays-introduction/,
          <year>2016</year>
          . (accessed May 8
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Lison and Tiedemann</source>
          , 2016]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jörg</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          .
          <article-title>OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2016</year>
          ),
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Loewen</source>
          , 2007]
          <string-name>
            <given-names>Shawn</given-names>
            <surname>Loewen</surname>
          </string-name>
          .
          <article-title>Error correction in the second language classroom</article-title>
          .
          <source>Clear News</source>
          ,
          <volume>11</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Mazur</source>
          , 2016]
          <string-name>
            <given-names>Michal</given-names>
            <surname>Mazur</surname>
          </string-name>
          .
          <article-title>A Study on English Language Tutoring System Using Code-Switching Based Second Language Vocabulary Acquisition Method</article-title>
          .
          <source>PhD thesis</source>
          , Hokkaido University,
          <year>2016</year>
          . https://eprints.lib.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>hokudai.ac.jp/dspace/handle/2115/61833.</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Michael</source>
          , 2016]
          <string-name>
            <given-names>Katina</given-names>
            <surname>Michael</surname>
          </string-name>
          .
          <article-title>Science fiction is full of bots that hurt people:... but these bots are here now</article-title>
          .
          <source>IEEE Consumer Electronics Magazine</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <fpage>112</fpage>
          -
          <lpage>117</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>[Nagata and Neubig</source>
          , 2017]
          <string-name>
            <given-names>Ryo</given-names>
            <surname>Nagata</surname>
          </string-name>
          and
          <string-name>
            <given-names>Graham</given-names>
            <surname>Neubig</surname>
          </string-name>
          .
          <article-title>Construction of japanese efl learner corpus for a study of spelling mistakes (in Japanese)</article-title>
          .
          <source>In Proceedings of the Twenty-third Annual Meeting of the Association for Natural Language Processing</source>
          , pages
          <fpage>1030</fpage>
          -
          <lpage>1033</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Rzepka et al.,
          <year>2005</year>
          ]
          <string-name>
            <given-names>Rafal</given-names>
            <surname>Rzepka</surname>
          </string-name>
          , Yali Ge, and
          <string-name>
            <given-names>Kenji</given-names>
            <surname>Araki</surname>
          </string-name>
          .
          <article-title>Naturalness of an utterance based on the automatically retrieved commonsense</article-title>
          .
          <source>In Proceedings of IJCAI 2005 - Nineteenth International Joint Conference on Artificial Intelligence</source>
          , Edinburgh, Scotland, pages
          <fpage>996</fpage>
          -
          <lpage>998</lpage>
          ,
          <year>August 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Speer and Havasi</source>
          , 2012]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Speer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Catherine</given-names>
            <surname>Havasi</surname>
          </string-name>
          .
          <article-title>Representing general relational knowledge in conceptnet 5</article-title>
          .
          <source>In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          , Istanbul, Turkey, may
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Tedick</source>
          , 1986]
          <article-title>Diane J Tedick. Research on error correction and implications for classroom</article-title>
          .
          <source>ACIE Newsletter</source>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Tiedemann</source>
          , 2012]
          <string-name>
            <given-names>Jörg</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          .
          <article-title>Parallel data, tools and interfaces in OPUS</article-title>
          .
          <source>In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012)</source>
          , pages
          <fpage>2214</fpage>
          -
          <lpage>2218</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[Toutanova and Manning</source>
          , 2000]
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Enriching the knowledge sources used in a maximum entropy part-of-speech tagger</article-title>
          .
          <source>In Proceedings of the 2000 Joint SIGDAT conference on Empirical Methods in Natural Language Processing</source>
          and
          <article-title>very large corpora held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics</article-title>
          , volume
          <volume>13</volume>
          , pages
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          , Morristown, NJ, USA,
          <year>2000</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Toutanova et al.,
          <year>2003</year>
          ]
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Dan Klein,
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yoram</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <article-title>Feature-rich part-of-speech tagging with a cyclic dependency network</article-title>
          .
          <source>In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>173</fpage>
          -
          <lpage>180</lpage>
          , Morristown, NJ, USA,
          <year>2003</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>[Vinyals and Le</source>
          , 2015]
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          and
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <source>A Neural Conversational Model. arXiv preprint arXiv:1506.05869</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Weizenbaum</source>
          , 1966]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Weizenbaum</surname>
          </string-name>
          .
          <article-title>ELIZA-a computer program for the study of natural language communication between man and machine</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>36</fpage>
          -
          <lpage>45</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>[Yuan</source>
          , 2017]
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Yuan</surname>
          </string-name>
          .
          <article-title>Grammatical error correction in non-native english</article-title>
          .
          <source>Technical report</source>
          , University of Cambridge, Computer Laboratory,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>