Be More Eloquent, Professor ELIZA – Comparison of Utterance Generation
                 Methods for Artificial Second Language Tutor
                          Taku Nakamura1 , Rafal Rzepka2 , Kenji Araki2 , and Kentaro Inui1
                            1
                             Graduate School of Information Science, Tohoku University
                     2
                         Graduate School of Information and Technology, Hokkaido University
                                1
                                  {tnakamura, inui}@ecei.tohoku.ac.jp
                                  2
                                    {rzepka, araki}@ist.hokudai.ac.jp
                           Abstract                                     (from language level to hobbies). However, there are difficult
                                                                        problems to be solved. First, the system must be linguistically
     This paper presents utterance generation methods                   correct. Second, the autonomy level of the software is impor-
     for artificial foreign language tutors and discusses               tant when using external corpora as its world knowledge. Pre-
     some problems of more autonomous educational                       sumably, controlled conversation by artificial templates must
     tools. To tackle problem of keeping learners in-                   be balanced with the learning of a user’s preferences and topic
     terested, we propose a hybrid, half automatic (for                 retrieval from the big textual data, which is more interesting
     semantics), half rule-based (for syntax) approach                  but can be dangerous when left completely uncontrolled (as
     that utilizes topic expansion by retrieving the con-               in case of Tay bot from Microsoft [Lee, 2016]).
     versational subjects related to users’ utterances. We                 The present paper introduces our prototype methods,
     compared the utterances generated by our methods                   which focus on providing the responses affected by learners’
     with those of other dialogue systems. The eval-                    utterances based on rules and comparatively reliable knowl-
     uation results show that the topic expansion en-                   edge resources. It does not necessarily extend the state-of-
     riches vocabulary of the utterances. On the other                  the-art techniques in the language generation domain per se,
     hand, ELIZA-like confirmations and follow-ups                      but we believe it is more efficient for this specific educational
     were preferred by Japanese subjects when prac-                     purpose. We compare various utterance generation methods,
     ticing conversational English was considered. Al-                  present experimental results and discuss other findings in-
     though our project is in its initial stage, we have                cluding user preferences for error corrections.
     decided to share our findings and thoughts on au-                     This paper concludes with ideas of measures that could be
     tonomy resulting from various trials, and thereby                  implemented to maintain balance between interesting and po-
     spark a discussion on the pros and cons of next gen-               tentially dangerous Web-based tutors.
     eration of teaching applications.
                                                                        1.1 Traditional vs. Web-based Dialogue Systems
1   Introduction                                                        Well-known chatbots are ELIZA [Weizenbaum, 1966] and
Applications supporting second language acquisition have                ALICEBOT1 . ELIZA can respond to any input, but never
evolved from simple flashcards for memorizing words to                  provides new topics related to the user’s utterances. ALICE-
more sophisticated tools using gamification, voice analysis,            BOT responds based on manually created databases that al-
etc. Computer applications and socializing online help to               ready exist. Although creating or extending databases will
improve stickiness [Chen, 2014], which is often one of the              expand conversational topics, it is costly and nearly impos-
biggest obstacles on the way of mastering a given topic. This           sible to build a database that covers many fields and a broad
problem is visible in software solutions not demanding any              range of users’ interests.
involvement from tutors and other peers (which is the major-               Modalin [Higuchi et al., 2008] is a Japanese text-based di-
ity of self-study mobile applications) but software-led teach-          alogue system that uses word associations retrieved from the
ing is preferable in scenarios where learners wish to improve           Web and randomly adds modality to generated utterances. To
skills without feeling ashamed. Our task, helping Japanese              sustain motivated conversation with users, the Modalin sys-
practice their communicational skills in English, is an exam-           tem generates input-related utterances using word associa-
ple of such scenarios. Japanese students are not eager to use           tions. Presuming that a similar approach could enhance the
the language in everyday life for social and cultural reasons           conversation opportunities for English learners, we adopt the
[Doyon, 2000], although they are often interested in foreign            idea of word associations in our proposed system.
languages and possess wide knowledge about grammar and                  1.2 System for Language Learning
vocabulary. Artificial tutor is one possible solution and we
                                                                        Jia [Jia, 2009] developed CSIEC (Computer Simulation in
decided to start a project aiming at creating a chat system that
                                                                        Educational Communication) system with multiple functions
could be not only conversational partner, but also a second
                                                                           1
language acquisition supporter that learns user preferences                    http://www.alicebot.org


                                                                   34
for English learning, including a chatbot as a conversational
                                                                                                                   User Utternace (Input)      I will buy a ticket.
partner. CSIEC system has a free conversation function based
on textual knowledge and reasoning, aiming to overcome
the problem in ELIZA-like systems, which require numerous                                                           Keywords Extraction        (buy, ticket)
predefined patterns fitted to the various utterances of users.
The author suggested that databases for the system responses                           [go, pay, ...]
                                                                                                                                                                  British     I'd like to go New York
                                                                                                                                                                              how shall we do to buy
                                                                                                                Association Words Extraction                     National
can be enriched by users’ inputs, which need to be created                             [trains, ticket...]                                                       Corpus       New York?
                                                                                                                                                                              So I will buy the expre
beforehand. The CSIEC system still had insurmountable con-                                                                                                                                       :
                                                                                                                                               (go, trains)
tent shortcomings, and the project has been discontinued.                                                       Candidate Words Generation


2   System Overview                                                       Templates Based on                       Utterance Generation                                                Templates Ba
                                                                                                                                               Are we talking about trains?
                                                                            Movie Subtitles                          using Candidates                                                    Movie Subt
2.1 CoAPM                                                                      Corpus                                                                                                       Corpus

Figure 1 outlines our first proposed method, the Co-
occurring Action Phrases-based Method (CoAPM). The
                                                                         Are we taking about (retrieved noun)?
CoAPM method adopts the word associations utilized in                   Figure 1: Overview
                                                                                   :       of the proposed Co-occurring Action
Modalin [Higuchi et al., 2008] on the hypothesis that input-            Phrases-based Method (CoAPM) and utterance generation
related utterances could maintain users’ interest in the con-           examples.
versation.
   The present research applies this idea to English by replac-         these words from the corpus. Nouns and verbs in the ex-
ing the Web with the British National Corpus (BNC)2 . The               tracted sentences are listed and sorted in frequency order as
BNC was chosen primarily because Web search engines re-                 word associations. This process is exemplified in Table 1.
strict the number of searches, and because the BNC (being
taken from trustful sources like newspapers or books) is ex-
pected to contain more correct English than other Web-based             Table 1: Keywords and association words extracted from the
corpora. Therefore the English in the BNC was deemed suit-              user utterance “I drink a glass of water in the morning.”
able for educational purposes. Learners of English as a sec-                  keywords                       ‘drink’, ‘glass’, ‘water’, ‘morning’
ond language, who will mainly use common English, need                    association verbs                  ‘rising’, ‘braised’, ‘cooked’, ‘fried’, ‘chopped’
                                                                          association nouns                  ‘fruit’, ‘juice’, ‘glasses’, ‘piece’, ‘salad’
not necessarily be familiar with native standard English, es-
pecially with natural expressions that rarely appear in text-
books. Nevertheless, resources with more input from non-
native contributors might contain dialects proper to specific           Generation of Words Candidates for Utterances
regions, which could baffle some leaners, whereas the BNC               Using the sorted lists of extracted nouns and verbs related
seems to maintain a more unified style with less potential for          to the input keywords, the method generates a single verb
confusion. Thus, we assume that a standard English corpus               and a single noun pair from the most frequent words in the
such as the BNC is still useful for realizing a system as a             lists. This verb-noun pair can be a candidate for utterance
widely acceptable English teacher.                                      generation. To verify the existence of the verb-noun com-
Extracting Keywords and Word Associations                               bination, the method then checks for co-occurrences of the
                                                                        given pair in the BNC. That is, the method first selects the top
In the first step, the method analyzes users’ utterances using          noun and top verb word associations, and then searches for
the Stanford Log-linear Part-Of-Speech Tagger (POS Tagger)              the co-occurrence in each sentence in the BNC using exact-
[Toutanova and Manning, 2000; Toutanova et al., 2003] to
                                                                        matching. Even if only one pair is found in the BNC, the
spot query keywords for extracting word associations lists.             verb-noun combination is regarded as possible in English. If
As the query keywords, we selected nouns and verbs (ex-                 the noun and verb are not found in the same sentence of the
cluding some stop-words) because they constitute the core se-           corpus, the method tests another verb-noun pair (the second
mantic elements of English sentence structures, and to some             most frequent verb and top noun in the list). The method
extent, describe the context of the utterances. This concen-            repeats this process up to the three most frequent verbs and
tration also helps to reduce the exact co-occurrence matching           nouns, advancing to the next verb in stepwise fashion until a
costs when searching words of interest. Nouns identified as             proper combination is found. We prioritize nouns because of
proper nouns by the POS Tagger are further analyzed using               the assumption that nouns describe the context of an utterance
the Stanford Named Entity Recognizer (NER) [Finkel et al.,              more specifically than verbs, which influence a topic shift
2005] and are assigned to labels such as “PERSON”, “LO-                 more often. However, this assumption must be confirmed em-
CATION”, “ORGANIZATION”. In the next step, the method                   pirically in the future.
searches the BNC using these keywords (nouns or named
entities, verbs) as queries and extracts sentences containing           Utterance Generation
    2
      The British National Corpus, version 3 (BNC XML Edition),         A CoAPM response is generated by applying the proposed
2007. Distributed by Oxford University Computing Services on be-        verb-noun or one of the pair to a template. We prepared
half of the BNC Consortium. http://www.natcorp.ox.ac.                   the templates for utterances half-manually, based on the most
uk/                                                                     frequent sentences in English movie subtitles retrieved from


                                                                   35
OPUS corpus3 [Tiedemann, 2012; Lison and Tiedemann,
                                                                                                     User Utternace (Input)      Getting good grades is hard for me.
2016]. The sentences were automatically abstracted using
POS tagging and NER, then ranked by frequency. Movie sub-
titles were selected for their adequately large corpus size and                                      Keyphrase Extraction         getting_grade

their potential suitability for conversational templates. Exam-
ples of templates are shown in Figure 2. Using POS tag anal-                                       Related Phrases Extraction                     ConceptNet
ysis, the method selects the templates that fit the proposed
                                                                                                                                  attending class [UsedFor]
words or words in users’ input. It then randomly selects a                                                                        taking finals [Causes]
                                                                                                     Utterance Generation
template and applies the previously chosen candidate words                        Templates          using Related Phrases
                                                                                                                                               :

or input words. To confirm the correctness of the expression
                                                                                                What else can I use for getting good grades except attending class?
in an applied template, the method searches the core phrase
of the given template (such as “visit* Tokyo” for “Would you
like to visit Tokyo?”, where * is a wildcard for matching var-             Figure 3: Overview of the proposed Related Action Phrases-
ious forms of a verb; in this case, visited, visits or visiting) in        based Method (RAPM) and utterance generation examples.
the corpus by exact matching. If more than five matches occur
in the BNC, the method outputs that template inserted with                 sufficient size and compatibility with our objective: conversa-
the retrieved words or input words. The number of matches                  tional practice for language learning. Mainly because of this
is set experimentally, accounting for the processing time and              difficulty, we abandoned this attempt after a few trials.
validity of the output. If no template satisfies the condition,               At that time, the latest iteration of ConceptNet was an-
CoAPM tries another combination of candidate words.                        nounced, which can be regarded as reliable, up-to-date and
                                                                           one of the biggest freely available common sense knowl-
           Figure 2: Examples of CoAPM templates                           edge resources. Commonsensical utterances are known to be
 Speaking of (noun from user utterance), do you (retrieved verb)?          a factor for enriching the naturalness of system responses:
 Would you like to visit LOCATION?                                         consequently, they enhance users’ will to continue conver-
 What do you think about (retrieved noun)s?                                sations [Rzepka et al., 2005]. Therefore, we adopted Con-
 Everybody (retrieved verb), right?                                        ceptNet, which includes knowledge from ConceptNet 55 and
 Does (noun from user utterance) belong to ORGANIZATION?                   many different sources, in our methods.
                                                                           Extracting the Key Phrase and Related Phrases
2.2 CiAPM and RAPM                                                         CoAPM identifies single words, so cannot handle idiomatic
                                                                           phrasal expressions. CiAPM and RAPM, which detect
The BNC used in CoAPM contains formal and reliable En-                     phrases including a gerund and a noun, can handle multi-
glish, which could be suitable for learners of English. How-               word expressions in a limited syntactic form, but they do not
ever, the corpus covers few expressions of latest events or                cover inflections in the phrase or other syntactic forms. For
trends. In our next models, we relied on a more up-to-date                 example, CiAPM and RAPM will detect “making a mistake”,
ontology, ConceptNet4 , enabling response to ongoing top-                  but ignore variations such as “made a mistake” or the phrasal
ics. Based on the evaluation outcome and analysis of the first             verb “break down”.
method evaluation, which we describes in Section 3.3, we de-
                                                                              In the first step, the method parses the input utterances us-
veloped two variations of our second method, named “Cited
                                                                           ing the Stanford POS tagger to detect action phrases consist-
Action Phrases-based Method (CiAPM)” and “Related Ac-
                                                                           ing of the -ing (gerund) form of a verb and a noun. Articles
tion Phrases-based Method (RAPM)”. CiAPM uses the cited
                                                                           and adjectives between the verb and the noun are also cap-
phrases from user utterances without replacing the relevant
                                                                           tured. As key phrases, this form of action phrases is selected
text. RAPM retrieves the input-related concepts using the
                                                                           because they play various grammatical roles in English sen-
semantic network, ConceptNet, which contains natural lan-
                                                                           tences without inflection, and to a certain degree, represent
guage phrases. The method is outlined in Figure 3.
                                                                           the semantic essence of utterances. In this stage, we partially
ConceptNet                                                                 detect the action phrases using the gerund without lemmati-
ConceptNet is a large-scale semantic network providing gen-                zation, which facilitates the maintenance of grammatical va-
eral human knowledge [Speer and Havasi, 2012] expressed in                 lidity. However, a fully developed system should respond to
natural language. It includes words, common phrases and the                any utterances, requiring a more flexible method. If there are
relations between them.                                                    more than two action phrases in the input, the method selects
   In the course of our study for better system utterances, we             the first phrase, based on an assumption that the first phrase
considered to employ sequence to sequence model introduced                 has priority over other action phrases in the utterance context
in [Cho et al., 2014]. Inspired by [Vinyals and Le, 2015],                 in English. The extracted phrase is transformed into a form
we tried to apply this model to build a conversational sys-                of query phrase for ConceptNet API. Next, RAPM searches
tem. However, it was difficult to find a training corpus with              ConceptNet using this key phrase as a query. Finally, the
                                                                           method extracts the related action phrases from the results
   3
       http://opus.lingfil.uu.se/OpenSubtitles.                            in natural language form. The phrase-extraction process is
php
   4                                                                          5
       http://conceptnet.io/                                                      http://conceptnet5.media.mit.edu


                                                                      36
                                                                           3       Experiments and Results
Table 2: Example of key phrase and related phrases extrac-
tion.                                                                      3.1 Survey on Error Correction Methods
 User utterance    “I was reading a newspaper, listening to music.”        Since we plan to equip our system with the function that de-
   Key phrase      “reading_newspaper”
 Related phrases   (HasSubevent: “learning about current events”)
                                                                           tects the mistakes in users’ utterances and convey these mis-
  and relations    (HasPrerequisite: “getting a newspaper”)                takes to the users in the dialogue, we conducted a question-
                                                                           naire about how people prefer to be corrected. Five eval-
demonstrated in Table 2.                                                   uators (four male students in their early 20s, one male in
                                                                           his early 30s), selected from among the potential users of
                                                                           an automated tutor, chose their preference as learners from
Utterance Generation (RAPM / CiAPM)                                        three error correction methods, “Explicit-correction”, “Re-
                                                                           cast”, and “Prompt” (or “Elicitation”) (see Table 3). These
To generate responses from the proposed methods (RAPM,                     options were based previous studies of error correction in a
CiAPM), a related phrase or a cited phrase from an input is                second language classroom [Loewen, 2007; Tedick, 1986].
applied to a template. The related phrase and template are                 “Explicit-correction” refers to the direct indication and cor-
selected randomly. The templates were manually prepared                    rection of mistakes. “Recast” is implicit reformulation of er-
based on the analysis of the first method (Section 3.3).                   rors to the correct form. “Prompt” induces self-correction in-
   They were divided into two types: templates for any re-                 stead of providing the corrected form. Among many types of
lation and templates for specific relations. Referring to the              error correction, these three methods were selected for their
statistics of common relations [Ferschke et al., 2013] in Con-             efficiency and applicability to automatic dialogue generation
ceptNet 5, we selected 11 relations in ConceptNet, namely,                 methods.
IsA, PartOf, RelatedTo, HasProperty, UsedFor, DerivedForm,                    This survey and the evaluation experiment of CoAPM in
Cause, CapableOf, MotivatedbyGoal, HasSubevent, HasPre-                    Section 3.2 were conducted online in a bundle. The survey
requiste. CiAPM applies phrases to the former type of tem-                 presents participants with an erroneous utterance and its cor-
plates, without using relations. In the template examples of               rections by each method.
Figure 4, ‘V-ing N’ denotes an action phrase which compris-                   Majority of evaluators answered that “Explicit-correction”
ing a verb in gerund form and a noun.                                      (40%) or “Recast” (40%) is preferable for learners, while the
                                                                           remaining 20% supported “Prompt” (Table 3). According to
                                                                           the result, “Explicit-correction” and “Recast” were consid-
  Templates for any relation
  Talking about [V-ing N (related phrase)]... What is your opin-
                                                                           ered to be more suitable than “Prompt” for error correction
  ion on that topic?                                                       in utterances, although a broader survey is needed to reach a
  Speaking of that, what do you think about [V-ing N (related              more definite conclusion.
  phrase)]?                                                                   The lower score for prompting might be related to the fact
                                                                           that we are not willing to keep people waiting and feel embar-
  Templates for specific relations                                         rassed when we are not sure what is the correct form. How-
  relation: RelatedTo                                                      ever, replacing a human teacher by a patient machine might
  Often [V’-ing N’ (action phrase from input)] and [V-ing N                significantly alter these results. This possibility requires eval-
  (related phrase)] are a good combination.                                uation in future study.
  What do you think?
  relation: HasProperty                                                    3.2 CoAPM Evaluation
  What about [V-ing N (related phrase)] while [V’-ing N’ (ac-              To see how learners react to generated utterances, we com-
  tion phrase from input)]?                                                pared CoAPM with ELIZA [Weizenbaum, 1966]. A pos-
                                                                           sible benchmark, CSIEC [Jia, 2009], mentioned in Section
   Figure 4: Examples of templates for RAPM / CiAPM.                       1.2, utilizes the conversational history. Because we evaluated
                                                                           only one-turn utterance exchanges this time, we instead used
                                                                           ELIZA as a baseline, which is independent of the preceding
                                                                           conversation and whose utterance rules are freely available.
Error Correction                                                           We employed python implementation of ELIZA by Jez Hig-
To improve the tutoring ability of our method, we aim at de-               gins6 .
tecting the spelling or grammatical mistakes in users’ utter-                 As the user inputs, we used the utterances of English learn-
ances. We integrates LanguageTool, an open source writing                  ers’ in The NICT JLE (Japanese Learner English) Corpus7 .
style (including spelling) and grammar checker, calling it as              This corpus comprises transcriptions of English oral profi-
a service via the HTTP API. Our method indicates errors in                 ciency interview tests for native Japanese speakers. The ut-
English usage by presenting a candidate correction with the                terances include errors in English, some of which are tagged.
error description message returned by LanguageTool. The                    Among the error-tagged data, we chose test takers’ utterances
correction candidate is taken from the top of the suggestions                  6
                                                                               http://www.jezuk.co.uk/cgi-bin/view/
list generated by LanguageTool, in “Recast” form, which was                software/eliza
preferred in the preliminary survey described in Section 3.1,                7
                                                                               https://alaginrc.nict.go.jp/nict_jle/
and is displayed before the method utterance.                              index_E.html


                                                                      37
Table 3: Examples of error correction methods for the user                        Figure 5: Examples of CoAPM and ELIZA outputs
utterance: “I spend time listening music” and the survey re-                     Input       “In free time, I like to read books.”
sults.                                                                           CoAPM “Does chapter read?”
       Methods         Examples             Respondents9                         ELIZA       “Very interesting.”
  Explicit correction “No, listening to”       40% (2)
        Recast         “listening to”          40% (2)                           Input       “What did you watch?”
        Prompt         “listening...”          20% (1)                           CoAPM       “Are we talking about watch?”
                                                                                 ELIZA       “Please consider whether you can answer
                                                                                             your own question.”.
Table 4: Average scores in the three evaluation criteria. (Stan-
                                                                             than ELIZA (especially for grammatical and semantic nat-
dard deviations are shown in parentheses.)
                                                                             uralness) were mainly caused by insufficient utterance tem-
                                      CoAPM            ELIZA                 plates and incorrect POS analysis.
   Grammatical naturalness           3.50 (1.25)     3.74 (1.45)                Among more than 100 types of templates, the POS re-
    Semantic naturalness             2.20 (1.43)     2.25 (1.49)             strictions admitted only six templates for 20 utterances of
  Motivation to keep studying        2.17 (1.37)     2.39 (1.46)             CoAPM.
                                                                                In addition, we presumed that in second-language acqui-
                                                                             sition, the questioning or confirming style of ELIZA fre-
including at least a verb and a noun that appear more than five              quently surpassed the association-based strategy of CoAPM,
times in the BNC, and applied them as the input data (to en-                 although people preferred Modalin [Higuchi et al., 2008] over
sure that the utterances convey a rich meaning, the 10 most                  ELIZA during normal chatting with no educational inclina-
frequent verbs in the BNC, expecting to include auxiliary and                tions. This implies that follow-up questions are often more
delexical verbs, were excluded from the condition). Under                    important than input-related statements in language tutoring
these restrictions, 19.6% of the examinees’ utterances were                  tasks. Considering to evaluate each turn (each utterance pair)
used as potential inputs. We used error-tagged utterances8                   separately, we here set all templates as interrogatives. How-
for the convenience of evaluation when introducing the error                 ever, a deployed system should acknowledge as well as ques-
suggestion function into our system. As mentioned above,                     tion a user’s utterance.
the 10 most frequent verbs were excluded because they in-
clude verbs with low semantic meaning such as auxiliary and                  3.4 CiAPM and RAPM Evaluation
delexical verbs, although a more principled approach could
                                                                             The following five systems were experimentally evaluated:
be taken.
   We asked the five evaluators (described in Section 3.1) to                  • Baselines
assess each of 20 utterance pairs (identical for all evaluators).                  (I) ELIZA
Evaluators were asked to rate the input and response utter-
ances generated by two methods in three categories: “gram-                        (II) ALICEBOT
matical naturalness”, “semantic naturalness” and “motivation                   • Proposed methods
to keep studying as a learner” on a 5-point scale (where 1 in-                   (IV) CiAPM
dicates unnatural language or lowest motivator of continued
study, and 5 denotes natural language or highest motivator of                     (V) RAPM-NOREL (not using relations)
continued study).                                                                (VI) RAPM-REL (using relations)
                                                                                We used the same implementation of ELIZA as described
3.3 Results and Analysis (CoAPM)                                             in a previous experiment (Section 3.2). In conversation, AL-
Table 4 shows the average scores of all evaluators in each cri-              ICEBOT needs the AIML (Artificial Intelligence Markup
teria for both systems, rated on a 1-5 scale. The inter-rater                Language) set, which contains the contents of the ALICE
agreement of the five evaluators was 0.48 (Kendall’s coeffi-                 brain written in AIML. Therefore, we adopted the stan-
cient of concordance).                                                       dard free AIML set, “AIML-en-us-foundation-ALICE”10 . By
   On average, the preliminary version of our proposed                       comparing with ELIZA and ALICE, we expected to observe
method (CoAPM) scored slightly lower than ELIZA, al-                         whether chatting with simple dialogue systems is intrinsically
though there were no statistically significant difference (p >               efficient or not for educational purposes. Although it might
0.05) between CoAPM and ELIZA in all three evaluation cri-                   be discussable, we believe that among the conversational sys-
teria (Mann-Whitney U-test, p = 0.42 for grammar, p = 0.29                   tems, ELIZA and ALICE have been well-known and cited
for semantics, p = 0.21 for motivation).                                     because other rule-based dialogue systems adopt similar pro-
   Figure 5 shows how CoAPM and ELIZA responded to sev-                      cessing of scarce context, or are unavailable for commercial
eral input utterances. The lower average scores of CoAPM                     or disclosed specification.
   8
                                                                                We created two versions of RAPM, which generate utter-
      In the evaluations, we used the original utterances without er-        ances from different templates. Specifically, RAPM-NOREL
ror corrections as inputs, so the examples may contain erroneous
expressions.                                                                   10
                                                                                  https://code.google.com/archive/p/aiml-
    9
      The number of respondents is shown in brackets.                        en-us-foundation-alice/


                                                                        38
employs templates for any relations, while RAPM-REL uti-
lizes templates for specific relations.                                 Table 5: Average scores in six evaluation criteria (A - F) and
                                                                        standard deviations (in parentheses). The highest scores for
   The user inputs were sentences in English learners’ utter-
                                                                        each criterion are highlighted in bold font.
ances extracted from The NICT JLE Corpus described in
Section 3.2. Considering there were long utterances with                                (A)      (B)      (C)        (D)        (E)       (F)
many sentences, we used sentences here. From test takers’                 ELIZA         2.35     2.80     2.57       2.32      2.50      2.72
utterances, we selected sentences including at least one ac-                           (1.19)   (1.28)   (1.21) (1.16) (1.16) (1.21)
tion phrase comprising a verb in gerund form and a noun.                ALICEBOT        2.78     2.67     2.90       2.65      2.80      2.88
This condition is set on the assumption that action phrases                            (1.21)   (1.29)   (1.31) (1.14) (1.09) (1.32)
have richer context in sentences, and facilitate the generation           CiAPM         3.13     3.15     3.37      3.02∗      3.20      3.17
of grammatically correct utterances. Under this condition,                             (1.09)   (1.11)   (1.09) (0.94) (0.94) (0.95)
6.12% of the examinees’ original sentences were retained as               RAPM          3.00     2.82     3.10      3.23∗      3.03      2.92
potential user inputs.                                                  -NONREL        (1.12)   (1.09)   (1.14) (0.91) (0.95) (1.08)
   The evaluators were six male Japanese university students              RAPM          2.97     2.62     3.03      3.18∗      2.88      2.90
                                                                          -REL         (1.21)   (1.17)   (1.20) (0.89) (1.05) (1.23)
majoring in science (three undergraduates and three graduates                                           ∗
                                                                                                          statistically significant difference
in their 20s), who were potential targets of a full-fledged tu-                                      between the model and ELIZA scores
toring system. The subjects were intermediate English learn-
ers with basic knowledge of English grammar and vocabu-
lary, but with low proficiency especially in speaking English.
   The six evaluators assessed the utterances generated by all          Figure 6: Examples of outputs in CiAPM / RAPM evaluation
five systems, in response to each of 10 inputs chosen ran-               Input        “And a woman is playing piano.”
domly from the utterances of test takers. The examinees’                 ELIZA        “Let’s change focus a bit... Tell me about your
utterances were originally separated from the interviewer’s                           family.”
utterances in the corpus. That is, each evaluator was given              ALICEBOT “Aren’t all woman playing piano ?”
the same 50 utterances from the systems. The participants                CiAPM        “Let’s talk about playing piano. What do you
received pairs of utterances in a specific order. In contrast,                        think about it?”
in the former CoAPM evaluation (Section 3.2), the utterance              RAPM-        “Learning something about music... why
pairs were presented in mixed order.                                     NONREL some people love it and some don’t?”
   The system utterances were rated on a 5-point scale (where            RAPM-        “I guess you were learning something about
1 means ‘poor’ and 5 represents ‘excellent’) in the following            REL          music - can you tell me how you did that?”
six categories.
                                                                         Input    “And they enjoyed eating delicious food and
(A) “Will to continue the conversation”                                           alcohol.”
(B) “Semantical naturalness of dialogue”                                 ELIZA    “How does that make you feel?”
(C) “Appropriateness in English conversation practice”                   ALICEBOT “Who, specifically?”
                                                                         CiAPM    “Let’s talk about eating delicious food. What
(D) “Vocabulary richness”                                                         do you think about it?”
(E) “Knowledge richness”                                                 RAPM-    “Talking about diminishing your own
(F) “Human-likeness of the system”                                       NONREL hunger... What is your opinion on that topic?”
                                                                         RAPM-    “I guess you were diminishing your own
   These evaluation criteria were based on the benchmark                 REL      hunger - can you tell me how you did that?”
used in a related work [Higuchi et al., 2008]. However, by
focusing on the action phrases, the proposed methods are sup-           3.5 Results and Analysis (CiAPM / RAPM)
posed to ensure a degree of grammatical naturalness in the ut-
terances. Therefore, the original criterion “grammatical natu-          Table 5 shows the average scores and standard deviations of
ralness of dialogues” was changed to “appropriateness in En-            all evaluators in each criterion for the five systems (rated from
glish conversation practice”, which is considered to be more            1 to 5). The Kendall’s coefficient of concordance among the
important for evaluating English-teaching dialogue systems.             six raters was 0.369. One of the proposed methods, RAPM-
   In the “vocabulary richness” evaluation, we expected sub-            NONREL with templates not using relations, scored highest
jects to rate utterances on a scale from “laconic” to “wordy”.          in “vocabulary richness (criterion D)”, and scored second-
Some of these criteria could be evaluated by specialists fa-            highest in other criteria. In all criteria except vocabulary rich-
miliar with English education, or at least by native English            ness, CiAPM achieved the highest score. The other proposed
speakers. However, at this stage of our project, we focus               methods, RAPM-REL, also achieved a high average score in
on the user experience of learners who are easily bored with            vocabulary richness. According to the Steel-Dwass test (eval-
learning. Therefore we set the criteria in terms of the user ex-        uated by the asymptotic method), the “vocabulary richness
perience, expecting evaluation from the learners’ standpoint.           (D)” score of our three methods significantly differed from
In the questionnaire, the criteria (without specific descrip-           the ELIZA score (p < 0.05), but no statistically significant dif-
tions) were presented to the evaluators in the Japanese lan-            ferences were observed in the other criteria. Figure 6 shows
guage.                                                                  some responses of each method to different input utterances.


                                                                   39
   The result suggests that the input-related phrases from              this observation extends to other cultural backgrounds and
ConceptNet are useful to expand the vocabulary of the sys-              individuals, broader experiments with more evaluators are
tem, and hopefully that of interacting users. For instance, the         needed. Furthermore, the evaluated conversations were very
input “And a woman is playing piano.” elicited the responses            short, limited to one-turn dialogue (a user’s utterance and
“Learning something about music... why some people love it              the corresponding system utterances). Whether the proposed
and some don’t?” (RAPM-NONREL) and “Let’s talk about                    methods maintain users’ interest in an actual conversation
playing piano. What do you think about it?” (CiAPM). The                cannot be known at this stage. For this purpose, we must
retrieved phrase ‘learning something about music’, which had            evaluate a fully developed system on multiple turns of free
a relation to the input phrase ‘playing piano’, and appears to          conversation. In long conversations for second language ac-
enrich the vocabulary over merely repeating the input phrase.           quisition, a system that generates only repetitive utterances
In RAPM-NONREL and CiAPM, the criterion “vocabulary                     would bore users. The wide vocabulary of RAPM, providing
richness” was rated 4 by 6/6 and 2/6 evaluators, respectively.          related topics to user utterances, could potentially mitigate
This example indicates the potential usefulness of expanding            conversational deadlocks. Thus, combining the two methods
the variety of expressions with phrases including hypernyms             (one that with repetitive utterances, the other using related
or hyponyms, based on the relations in ConceptNet.                      topics) might be more efficient for language tutoring tasks.
   However, when the action phrases from a user input are in-
serted into the system output, the utterances may sound more
natural, as demonstrated in the following example. The in-
                                                                        4        Conclusion and Future Works
put “And they enjoyed eating delicious food and alcohol.”,              We proposed methods that automatically generate utterances
brought the outputs “Let’s talk about eating delicious food.            for an English language tutor, and compared their perfor-
What do you think about it?” (CiAPM) and “Talking about                 mances with those of classic chatbots. Specifically, we
diminishing your own hunger... What is your opinion on that             evaluated how the generated expressions were received by
topic?” (RAPM-NONREL). In this case, a discussion about                 Japanese subjects. Although our small-scale experiment does
human needs, suggesting the related subject of ‘diminishing             not allow drawing any conclusions about the stickiness level
your own hunger’ to ‘eating delicious food’, would be a good            of these approaches yet, we found that ELIZA-like outputs
topic for a deeper conversation. However, the preference of             offer more encouragement to users than Web- or common
the conversational topic depends on the user, his or her in-            sense-based approaches. These inferences oppose the find-
terests and their English levels. For this reason, the repeat-          ings of [Rzepka et al., 2005], who evaluated non-learning
ing method (CiAPM) is considered to score above the other               dialogues. In enriching the vocabulary of the system ut-
methods on average in all criteria except vocabulary richness.          terances, the proposed methods had shown their superiority,
   We presume that the related concepts in ConceptNet are               which could be potentially useful to improve users’ command
not always compatible with the dialogue context. In such                of a foreign language.
cases, the responses are unsuited to the user’s need. This                 However, using external corpora or crowd-sourced knowl-
could be partly attributable to random selection of the related         edge sources might incur serious drawbacks. Allowing the tu-
concepts. To avoid wandering away from the subject of the               tor excessive freedom, especially in learning material beyond
conversation, the related phrases must be carefully chosen to           the preferences of the user, risks misuse, as has occurred in
suit the context and the individual user, especially when ap-           Microsoft’s Tay and other chatbots [Michael, 2016]. In our
plying phrases with their relations. In future work, the ran-           approach, adaption of hand-crafted syntactic rules seem to be
dom selection must be replaced by a context processing mod-             the only restriction, but because of majority voting in both
ule, a user profiler, and a language level estimator. A context         British National Corpus- and ConceptNet-based methods we
processing module could select proper phrases by semantic               indirectly try to avoid semantic strangeness. This does not
analysis. Considering the ambiguity of multi-word expres-               mean that corpora guarantee safe communication, and some
sions, detecting phrases after applying a topic modeling such           topic restrictions might be needed from the outset. However,
as latent Dirichlet allocation might be useful for this purpose.        blocking slang and offensive words completely can be prob-
In addition, complete reliance on ConceptNet, which lacks               lematic, especially when considering more sophisticated per-
knowledge of some items and includes dubious entries, is also           sonality modeling, which is required in longer-term conver-
problematic.                                                            sational sessions.
   As wrong inputs were not corrected, the open source                     As the next step, we plan to combine our method with es-
checker found no mistakes. We might require a more pow-                 timating language level and supporting vocabulary acquisi-
erful error correction approach. For error detection and cor-           tion algorithms [Mazur, 2016]. Error corrections could be
rection suggestions, a promising solution is the Grammati-              improved by the annotated data11 , taking into account that
cal Error Correction (GEC) system based on the Neural Ma-               Japanese students often make non-word spelling errors (mak-
chine Translation (NMT) approach [Yuan, 2017]. In the ex-               ing not existing spellings) [Nagata and Neubig, 2017]. Al-
periments [Yuan, 2017], the NMT-based GEC system outper-                though our dialogue system is not yet ready for long-run con-
formed the SMT (Statistical Machine Translation)-based sys-             versational sessions, we should experiment on the tutor’s au-
tem even in a difficult subject-verb agreement problem.                 tonomy level in choosing topics related to user’s input, prior
   From these results we can assume that repetition for con-            to larger scale testing. We plan to analyze which outputs
firmation plays an important part in conversation practice by
                                                                            11
Japanese learners of English. However, to assess whether                         http://www.gsk.or.jp/catalog/gsk2016-b/


                                                                   40
are potentially harmful, and to determine appropriate coun-                Language Vocabulary Acquisition Method. PhD thesis,
termeasures against these expressions.                                     Hokkaido University, 2016. https://eprints.lib.
                                                                           hokudai.ac.jp/dspace/handle/2115/61833.
Acknowledgements                                                        [Michael, 2016] Katina Michael. Science fiction is full of
                                                                           bots that hurt people:... but these bots are here now. IEEE
The authors would like to thank DENSO CORPORATION                          Consumer Electronics Magazine, 5(4):112–117, 2016.
for funding this work. We are grateful to anonymous review-
ers for their detailed comments and helpful suggestions.                [Nagata and Neubig, 2017] Ryo Nagata and Graham Neu-
                                                                           big. Construction of japanese efl learner corpus for a study
                                                                           of spelling mistakes (in Japanese). In Proceedings of the
References                                                                 Twenty-third Annual Meeting of the Association for Natu-
[Chen, 2014] Yi-Cheng Chen. An empirical examination of                    ral Language Processing, pages 1030–1033, 2017.
   factors affecting college students’ proactive stickiness with        [Rzepka et al., 2005] Rafal Rzepka, Yali Ge, and Kenji
   a web-based english learning environment. Computers in                  Araki. Naturalness of an utterance based on the automat-
   Human Behavior, 31:159 – 171, 2014.                                     ically retrieved commonsense. In Proceedings of IJCAI
[Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer,                    2005 - Nineteenth International Joint Conference on Ar-
   Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,                      tificial Intelligence, Edinburgh, Scotland, pages 996–998,
   Holger Schwenk, and Yoshua Bengio. Learning Phrase                      August 2005.
   Representations using RNN Encoder-Decoder for Statisti-              [Speer and Havasi, 2012] Robert Speer and Catherine
   cal Machine Translation. Proceedings of the 2014 Confer-                Havasi. Representing general relational knowledge in
   ence on Empirical Methods in Natural Language Process-                  conceptnet 5. In Proceedings of the Eight International
   ing (EMNLP), pages 1724–1734, 2014.                                     Conference on Language Resources and Evaluation
[Doyon, 2000] Paul Doyon. Shyness in the Japanese EFL                      (LREC’12), Istanbul, Turkey, may 2012.
   class: Why it is a problem, what it is, what causes it, and          [Tedick, 1986] Diane J Tedick. Research on error correction
   what to do about it. The Language Teacher, 24(1):11–16,                 and implications for classroom. ACIE Newsletter, 1986.
   2000.
                                                                        [Tiedemann, 2012] Jörg Tiedemann. Parallel data, tools and
[Ferschke et al., 2013] Oliver Ferschke, Johannes Daxen-                   interfaces in OPUS. In Proceedings of the 8th Interna-
   berger, and Iryna Gurevych. The People’s Web Meets                      tional Conference on Language Resources and Evaluation
   NLP. In Theory and Applications of Natural Language                     (LREC’2012), pages 2214–2218, 2012.
   Processing, pages 121–160. Springer, 2013.
                                                                        [Toutanova and Manning, 2000] Kristina Toutanova and
[Finkel et al., 2005] Jenny Rose Finkel, Trond Grenager, and               Christopher D. Manning.          Enriching the knowledge
   Christopher Manning. Incorporating non-local informa-                   sources used in a maximum entropy part-of-speech tagger.
   tion into information extraction systems by Gibbs sam-                  In Proceedings of the 2000 Joint SIGDAT conference
   pling. In Proceedings of the 43rd Annual Meeting on As-                 on Empirical Methods in Natural Language Processing
   sociation for Computational Linguistics - ACL ’05, pages                and very large corpora held in conjunction with the 38th
   363–370, Morristown, NJ, USA, 2005. Association for                     Annual Meeting of the Association for Computational
   Computational Linguistics.                                              Linguistics, volume 13, pages 63–70, Morristown, NJ,
[Higuchi et al., 2008] Shinsuke Higuchi, Rafal Rzepka, and                 USA, 2000. Association for Computational Linguistics.
   Kenji Araki. A casual conversation system using modality             [Toutanova et al., 2003] Kristina Toutanova, Dan Klein,
   and word associations retrieved from the Web. In Pro-                   Christopher D. Manning, and Yoram Singer. Feature-rich
   ceedings of the Conference on Empirical Methods in Nat-                 part-of-speech tagging with a cyclic dependency network.
   ural Language Processing, EMNLP ’08, pages 382–390,                     In Proceedings of the 2003 Conference of the North Amer-
   Stroudsburg, PA, USA, 2008. Association for Computa-                    ican Chapter of the Association for Computational Lin-
   tional Linguistics.                                                     guistics on Human Language Technology - NAACL ’03,
[Jia, 2009] Jiyou Jia. CSIEC: A computer assisted English                  volume 1, pages 173–180, Morristown, NJ, USA, 2003.
   learning chatbot based on textual knowledge and reason-                 Association for Computational Linguistics.
   ing. Knowledge-Based Systems, 22(4):249–255, 2009.                   [Vinyals and Le, 2015] Oriol Vinyals and Quoc Le. A Neural
[Lee, 2016] Peter Lee. Learning from Tay’s introduction.                   Conversational Model. arXiv preprint arXiv:1506.05869,
   https://blogs.microsoft.com/blog/2016/                                  2015.
   03/25/learning-tays-introduction/, 2016.                             [Weizenbaum, 1966] Joseph Weizenbaum. ELIZA—a com-
   (accessed May 8 2017).                                                  puter program for the study of natural language communi-
[Lison and Tiedemann, 2016] Pierre Lison and Jörg Tiede-                   cation between man and machine. Communications of the
   mann. OpenSubtitles2016: extracting large parallel cor-                 ACM, 9(1):36–45, 1966.
   pora from movie and TV subtitles. In Proceedings of                  [Yuan, 2017] Zheng Yuan. Grammatical error correction in
   the 10th International Conference on Language Resources                 non-native english. Technical report, University of Cam-
   and Evaluation (LREC 2016), 2016.                                       bridge, Computer Laboratory, 2017.
[Loewen, 2007] Shawn Loewen. Error correction in the sec-
   ond language classroom. Clear News, 11(12):1–7, 2007.
[Mazur, 2016] Michal Mazur. A Study on English Language
   Tutoring System Using Code-Switching Based Second


                                                                   41