Be More Eloquent, Professor ELIZA – Comparison of Utterance Generation Methods for Artificial Second Language Tutor Taku Nakamura1 , Rafal Rzepka2 , Kenji Araki2 , and Kentaro Inui1 1 Graduate School of Information Science, Tohoku University 2 Graduate School of Information and Technology, Hokkaido University 1 {tnakamura, inui}@ecei.tohoku.ac.jp 2 {rzepka, araki}@ist.hokudai.ac.jp Abstract (from language level to hobbies). However, there are difficult problems to be solved. First, the system must be linguistically This paper presents utterance generation methods correct. Second, the autonomy level of the software is impor- for artificial foreign language tutors and discusses tant when using external corpora as its world knowledge. Pre- some problems of more autonomous educational sumably, controlled conversation by artificial templates must tools. To tackle problem of keeping learners in- be balanced with the learning of a user’s preferences and topic terested, we propose a hybrid, half automatic (for retrieval from the big textual data, which is more interesting semantics), half rule-based (for syntax) approach but can be dangerous when left completely uncontrolled (as that utilizes topic expansion by retrieving the con- in case of Tay bot from Microsoft [Lee, 2016]). versational subjects related to users’ utterances. We The present paper introduces our prototype methods, compared the utterances generated by our methods which focus on providing the responses affected by learners’ with those of other dialogue systems. The eval- utterances based on rules and comparatively reliable knowl- uation results show that the topic expansion en- edge resources. It does not necessarily extend the state-of- riches vocabulary of the utterances. On the other the-art techniques in the language generation domain per se, hand, ELIZA-like confirmations and follow-ups but we believe it is more efficient for this specific educational were preferred by Japanese subjects when prac- purpose. We compare various utterance generation methods, ticing conversational English was considered. Al- present experimental results and discuss other findings in- though our project is in its initial stage, we have cluding user preferences for error corrections. decided to share our findings and thoughts on au- This paper concludes with ideas of measures that could be tonomy resulting from various trials, and thereby implemented to maintain balance between interesting and po- spark a discussion on the pros and cons of next gen- tentially dangerous Web-based tutors. eration of teaching applications. 1.1 Traditional vs. Web-based Dialogue Systems 1 Introduction Well-known chatbots are ELIZA [Weizenbaum, 1966] and Applications supporting second language acquisition have ALICEBOT1 . ELIZA can respond to any input, but never evolved from simple flashcards for memorizing words to provides new topics related to the user’s utterances. ALICE- more sophisticated tools using gamification, voice analysis, BOT responds based on manually created databases that al- etc. Computer applications and socializing online help to ready exist. Although creating or extending databases will improve stickiness [Chen, 2014], which is often one of the expand conversational topics, it is costly and nearly impos- biggest obstacles on the way of mastering a given topic. This sible to build a database that covers many fields and a broad problem is visible in software solutions not demanding any range of users’ interests. involvement from tutors and other peers (which is the major- Modalin [Higuchi et al., 2008] is a Japanese text-based di- ity of self-study mobile applications) but software-led teach- alogue system that uses word associations retrieved from the ing is preferable in scenarios where learners wish to improve Web and randomly adds modality to generated utterances. To skills without feeling ashamed. Our task, helping Japanese sustain motivated conversation with users, the Modalin sys- practice their communicational skills in English, is an exam- tem generates input-related utterances using word associa- ple of such scenarios. Japanese students are not eager to use tions. Presuming that a similar approach could enhance the the language in everyday life for social and cultural reasons conversation opportunities for English learners, we adopt the [Doyon, 2000], although they are often interested in foreign idea of word associations in our proposed system. languages and possess wide knowledge about grammar and 1.2 System for Language Learning vocabulary. Artificial tutor is one possible solution and we Jia [Jia, 2009] developed CSIEC (Computer Simulation in decided to start a project aiming at creating a chat system that Educational Communication) system with multiple functions could be not only conversational partner, but also a second 1 language acquisition supporter that learns user preferences http://www.alicebot.org 34 for English learning, including a chatbot as a conversational User Utternace (Input) I will buy a ticket. partner. CSIEC system has a free conversation function based on textual knowledge and reasoning, aiming to overcome the problem in ELIZA-like systems, which require numerous Keywords Extraction (buy, ticket) predefined patterns fitted to the various utterances of users. The author suggested that databases for the system responses [go, pay, ...] British I'd like to go New York how shall we do to buy Association Words Extraction National can be enriched by users’ inputs, which need to be created [trains, ticket...] Corpus New York? So I will buy the expre beforehand. The CSIEC system still had insurmountable con- : (go, trains) tent shortcomings, and the project has been discontinued. Candidate Words Generation 2 System Overview Templates Based on Utterance Generation Templates Ba Are we talking about trains? Movie Subtitles using Candidates Movie Subt 2.1 CoAPM Corpus Corpus Figure 1 outlines our first proposed method, the Co- occurring Action Phrases-based Method (CoAPM). The Are we taking about (retrieved noun)? CoAPM method adopts the word associations utilized in Figure 1: Overview : of the proposed Co-occurring Action Modalin [Higuchi et al., 2008] on the hypothesis that input- Phrases-based Method (CoAPM) and utterance generation related utterances could maintain users’ interest in the con- examples. versation. The present research applies this idea to English by replac- these words from the corpus. Nouns and verbs in the ex- ing the Web with the British National Corpus (BNC)2 . The tracted sentences are listed and sorted in frequency order as BNC was chosen primarily because Web search engines re- word associations. This process is exemplified in Table 1. strict the number of searches, and because the BNC (being taken from trustful sources like newspapers or books) is ex- pected to contain more correct English than other Web-based Table 1: Keywords and association words extracted from the corpora. Therefore the English in the BNC was deemed suit- user utterance “I drink a glass of water in the morning.” able for educational purposes. Learners of English as a sec- keywords ‘drink’, ‘glass’, ‘water’, ‘morning’ ond language, who will mainly use common English, need association verbs ‘rising’, ‘braised’, ‘cooked’, ‘fried’, ‘chopped’ association nouns ‘fruit’, ‘juice’, ‘glasses’, ‘piece’, ‘salad’ not necessarily be familiar with native standard English, es- pecially with natural expressions that rarely appear in text- books. Nevertheless, resources with more input from non- native contributors might contain dialects proper to specific Generation of Words Candidates for Utterances regions, which could baffle some leaners, whereas the BNC Using the sorted lists of extracted nouns and verbs related seems to maintain a more unified style with less potential for to the input keywords, the method generates a single verb confusion. Thus, we assume that a standard English corpus and a single noun pair from the most frequent words in the such as the BNC is still useful for realizing a system as a lists. This verb-noun pair can be a candidate for utterance widely acceptable English teacher. generation. To verify the existence of the verb-noun com- Extracting Keywords and Word Associations bination, the method then checks for co-occurrences of the given pair in the BNC. That is, the method first selects the top In the first step, the method analyzes users’ utterances using noun and top verb word associations, and then searches for the Stanford Log-linear Part-Of-Speech Tagger (POS Tagger) the co-occurrence in each sentence in the BNC using exact- [Toutanova and Manning, 2000; Toutanova et al., 2003] to matching. Even if only one pair is found in the BNC, the spot query keywords for extracting word associations lists. verb-noun combination is regarded as possible in English. If As the query keywords, we selected nouns and verbs (ex- the noun and verb are not found in the same sentence of the cluding some stop-words) because they constitute the core se- corpus, the method tests another verb-noun pair (the second mantic elements of English sentence structures, and to some most frequent verb and top noun in the list). The method extent, describe the context of the utterances. This concen- repeats this process up to the three most frequent verbs and tration also helps to reduce the exact co-occurrence matching nouns, advancing to the next verb in stepwise fashion until a costs when searching words of interest. Nouns identified as proper combination is found. We prioritize nouns because of proper nouns by the POS Tagger are further analyzed using the assumption that nouns describe the context of an utterance the Stanford Named Entity Recognizer (NER) [Finkel et al., more specifically than verbs, which influence a topic shift 2005] and are assigned to labels such as “PERSON”, “LO- more often. However, this assumption must be confirmed em- CATION”, “ORGANIZATION”. In the next step, the method pirically in the future. searches the BNC using these keywords (nouns or named entities, verbs) as queries and extracts sentences containing Utterance Generation 2 The British National Corpus, version 3 (BNC XML Edition), A CoAPM response is generated by applying the proposed 2007. Distributed by Oxford University Computing Services on be- verb-noun or one of the pair to a template. We prepared half of the BNC Consortium. http://www.natcorp.ox.ac. the templates for utterances half-manually, based on the most uk/ frequent sentences in English movie subtitles retrieved from 35 OPUS corpus3 [Tiedemann, 2012; Lison and Tiedemann, User Utternace (Input) Getting good grades is hard for me. 2016]. The sentences were automatically abstracted using POS tagging and NER, then ranked by frequency. Movie sub- titles were selected for their adequately large corpus size and Keyphrase Extraction getting_grade their potential suitability for conversational templates. Exam- ples of templates are shown in Figure 2. Using POS tag anal- Related Phrases Extraction ConceptNet ysis, the method selects the templates that fit the proposed attending class [UsedFor] words or words in users’ input. It then randomly selects a taking finals [Causes] Utterance Generation template and applies the previously chosen candidate words Templates using Related Phrases : or input words. To confirm the correctness of the expression What else can I use for getting good grades except attending class? in an applied template, the method searches the core phrase of the given template (such as “visit* Tokyo” for “Would you like to visit Tokyo?”, where * is a wildcard for matching var- Figure 3: Overview of the proposed Related Action Phrases- ious forms of a verb; in this case, visited, visits or visiting) in based Method (RAPM) and utterance generation examples. the corpus by exact matching. If more than five matches occur in the BNC, the method outputs that template inserted with sufficient size and compatibility with our objective: conversa- the retrieved words or input words. The number of matches tional practice for language learning. Mainly because of this is set experimentally, accounting for the processing time and difficulty, we abandoned this attempt after a few trials. validity of the output. If no template satisfies the condition, At that time, the latest iteration of ConceptNet was an- CoAPM tries another combination of candidate words. nounced, which can be regarded as reliable, up-to-date and one of the biggest freely available common sense knowl- Figure 2: Examples of CoAPM templates edge resources. Commonsensical utterances are known to be Speaking of (noun from user utterance), do you (retrieved verb)? a factor for enriching the naturalness of system responses: Would you like to visit LOCATION? consequently, they enhance users’ will to continue conver- What do you think about (retrieved noun)s? sations [Rzepka et al., 2005]. Therefore, we adopted Con- Everybody (retrieved verb), right? ceptNet, which includes knowledge from ConceptNet 55 and Does (noun from user utterance) belong to ORGANIZATION? many different sources, in our methods. Extracting the Key Phrase and Related Phrases 2.2 CiAPM and RAPM CoAPM identifies single words, so cannot handle idiomatic phrasal expressions. CiAPM and RAPM, which detect The BNC used in CoAPM contains formal and reliable En- phrases including a gerund and a noun, can handle multi- glish, which could be suitable for learners of English. How- word expressions in a limited syntactic form, but they do not ever, the corpus covers few expressions of latest events or cover inflections in the phrase or other syntactic forms. For trends. In our next models, we relied on a more up-to-date example, CiAPM and RAPM will detect “making a mistake”, ontology, ConceptNet4 , enabling response to ongoing top- but ignore variations such as “made a mistake” or the phrasal ics. Based on the evaluation outcome and analysis of the first verb “break down”. method evaluation, which we describes in Section 3.3, we de- In the first step, the method parses the input utterances us- veloped two variations of our second method, named “Cited ing the Stanford POS tagger to detect action phrases consist- Action Phrases-based Method (CiAPM)” and “Related Ac- ing of the -ing (gerund) form of a verb and a noun. Articles tion Phrases-based Method (RAPM)”. CiAPM uses the cited and adjectives between the verb and the noun are also cap- phrases from user utterances without replacing the relevant tured. As key phrases, this form of action phrases is selected text. RAPM retrieves the input-related concepts using the because they play various grammatical roles in English sen- semantic network, ConceptNet, which contains natural lan- tences without inflection, and to a certain degree, represent guage phrases. The method is outlined in Figure 3. the semantic essence of utterances. In this stage, we partially ConceptNet detect the action phrases using the gerund without lemmati- ConceptNet is a large-scale semantic network providing gen- zation, which facilitates the maintenance of grammatical va- eral human knowledge [Speer and Havasi, 2012] expressed in lidity. However, a fully developed system should respond to natural language. It includes words, common phrases and the any utterances, requiring a more flexible method. If there are relations between them. more than two action phrases in the input, the method selects In the course of our study for better system utterances, we the first phrase, based on an assumption that the first phrase considered to employ sequence to sequence model introduced has priority over other action phrases in the utterance context in [Cho et al., 2014]. Inspired by [Vinyals and Le, 2015], in English. The extracted phrase is transformed into a form we tried to apply this model to build a conversational sys- of query phrase for ConceptNet API. Next, RAPM searches tem. However, it was difficult to find a training corpus with ConceptNet using this key phrase as a query. Finally, the method extracts the related action phrases from the results 3 http://opus.lingfil.uu.se/OpenSubtitles. in natural language form. The phrase-extraction process is php 4 5 http://conceptnet.io/ http://conceptnet5.media.mit.edu 36 3 Experiments and Results Table 2: Example of key phrase and related phrases extrac- tion. 3.1 Survey on Error Correction Methods User utterance “I was reading a newspaper, listening to music.” Since we plan to equip our system with the function that de- Key phrase “reading_newspaper” Related phrases (HasSubevent: “learning about current events”) tects the mistakes in users’ utterances and convey these mis- and relations (HasPrerequisite: “getting a newspaper”) takes to the users in the dialogue, we conducted a question- naire about how people prefer to be corrected. Five eval- demonstrated in Table 2. uators (four male students in their early 20s, one male in his early 30s), selected from among the potential users of an automated tutor, chose their preference as learners from Utterance Generation (RAPM / CiAPM) three error correction methods, “Explicit-correction”, “Re- cast”, and “Prompt” (or “Elicitation”) (see Table 3). These To generate responses from the proposed methods (RAPM, options were based previous studies of error correction in a CiAPM), a related phrase or a cited phrase from an input is second language classroom [Loewen, 2007; Tedick, 1986]. applied to a template. The related phrase and template are “Explicit-correction” refers to the direct indication and cor- selected randomly. The templates were manually prepared rection of mistakes. “Recast” is implicit reformulation of er- based on the analysis of the first method (Section 3.3). rors to the correct form. “Prompt” induces self-correction in- They were divided into two types: templates for any re- stead of providing the corrected form. Among many types of lation and templates for specific relations. Referring to the error correction, these three methods were selected for their statistics of common relations [Ferschke et al., 2013] in Con- efficiency and applicability to automatic dialogue generation ceptNet 5, we selected 11 relations in ConceptNet, namely, methods. IsA, PartOf, RelatedTo, HasProperty, UsedFor, DerivedForm, This survey and the evaluation experiment of CoAPM in Cause, CapableOf, MotivatedbyGoal, HasSubevent, HasPre- Section 3.2 were conducted online in a bundle. The survey requiste. CiAPM applies phrases to the former type of tem- presents participants with an erroneous utterance and its cor- plates, without using relations. In the template examples of rections by each method. Figure 4, ‘V-ing N’ denotes an action phrase which compris- Majority of evaluators answered that “Explicit-correction” ing a verb in gerund form and a noun. (40%) or “Recast” (40%) is preferable for learners, while the remaining 20% supported “Prompt” (Table 3). According to the result, “Explicit-correction” and “Recast” were consid- Templates for any relation Talking about [V-ing N (related phrase)]... What is your opin- ered to be more suitable than “Prompt” for error correction ion on that topic? in utterances, although a broader survey is needed to reach a Speaking of that, what do you think about [V-ing N (related more definite conclusion. phrase)]? The lower score for prompting might be related to the fact that we are not willing to keep people waiting and feel embar- Templates for specific relations rassed when we are not sure what is the correct form. How- relation: RelatedTo ever, replacing a human teacher by a patient machine might Often [V’-ing N’ (action phrase from input)] and [V-ing N significantly alter these results. This possibility requires eval- (related phrase)] are a good combination. uation in future study. What do you think? relation: HasProperty 3.2 CoAPM Evaluation What about [V-ing N (related phrase)] while [V’-ing N’ (ac- To see how learners react to generated utterances, we com- tion phrase from input)]? pared CoAPM with ELIZA [Weizenbaum, 1966]. A pos- sible benchmark, CSIEC [Jia, 2009], mentioned in Section Figure 4: Examples of templates for RAPM / CiAPM. 1.2, utilizes the conversational history. Because we evaluated only one-turn utterance exchanges this time, we instead used ELIZA as a baseline, which is independent of the preceding conversation and whose utterance rules are freely available. Error Correction We employed python implementation of ELIZA by Jez Hig- To improve the tutoring ability of our method, we aim at de- gins6 . tecting the spelling or grammatical mistakes in users’ utter- As the user inputs, we used the utterances of English learn- ances. We integrates LanguageTool, an open source writing ers’ in The NICT JLE (Japanese Learner English) Corpus7 . style (including spelling) and grammar checker, calling it as This corpus comprises transcriptions of English oral profi- a service via the HTTP API. Our method indicates errors in ciency interview tests for native Japanese speakers. The ut- English usage by presenting a candidate correction with the terances include errors in English, some of which are tagged. error description message returned by LanguageTool. The Among the error-tagged data, we chose test takers’ utterances correction candidate is taken from the top of the suggestions 6 http://www.jezuk.co.uk/cgi-bin/view/ list generated by LanguageTool, in “Recast” form, which was software/eliza preferred in the preliminary survey described in Section 3.1, 7 https://alaginrc.nict.go.jp/nict_jle/ and is displayed before the method utterance. index_E.html 37 Table 3: Examples of error correction methods for the user Figure 5: Examples of CoAPM and ELIZA outputs utterance: “I spend time listening music” and the survey re- Input “In free time, I like to read books.” sults. CoAPM “Does chapter read?” Methods Examples Respondents9 ELIZA “Very interesting.” Explicit correction “No, listening to” 40% (2) Recast “listening to” 40% (2) Input “What did you watch?” Prompt “listening...” 20% (1) CoAPM “Are we talking about watch?” ELIZA “Please consider whether you can answer your own question.”. Table 4: Average scores in the three evaluation criteria. (Stan- than ELIZA (especially for grammatical and semantic nat- dard deviations are shown in parentheses.) uralness) were mainly caused by insufficient utterance tem- CoAPM ELIZA plates and incorrect POS analysis. Grammatical naturalness 3.50 (1.25) 3.74 (1.45) Among more than 100 types of templates, the POS re- Semantic naturalness 2.20 (1.43) 2.25 (1.49) strictions admitted only six templates for 20 utterances of Motivation to keep studying 2.17 (1.37) 2.39 (1.46) CoAPM. In addition, we presumed that in second-language acqui- sition, the questioning or confirming style of ELIZA fre- including at least a verb and a noun that appear more than five quently surpassed the association-based strategy of CoAPM, times in the BNC, and applied them as the input data (to en- although people preferred Modalin [Higuchi et al., 2008] over sure that the utterances convey a rich meaning, the 10 most ELIZA during normal chatting with no educational inclina- frequent verbs in the BNC, expecting to include auxiliary and tions. This implies that follow-up questions are often more delexical verbs, were excluded from the condition). Under important than input-related statements in language tutoring these restrictions, 19.6% of the examinees’ utterances were tasks. Considering to evaluate each turn (each utterance pair) used as potential inputs. We used error-tagged utterances8 separately, we here set all templates as interrogatives. How- for the convenience of evaluation when introducing the error ever, a deployed system should acknowledge as well as ques- suggestion function into our system. As mentioned above, tion a user’s utterance. the 10 most frequent verbs were excluded because they in- clude verbs with low semantic meaning such as auxiliary and 3.4 CiAPM and RAPM Evaluation delexical verbs, although a more principled approach could The following five systems were experimentally evaluated: be taken. We asked the five evaluators (described in Section 3.1) to • Baselines assess each of 20 utterance pairs (identical for all evaluators). (I) ELIZA Evaluators were asked to rate the input and response utter- ances generated by two methods in three categories: “gram- (II) ALICEBOT matical naturalness”, “semantic naturalness” and “motivation • Proposed methods to keep studying as a learner” on a 5-point scale (where 1 in- (IV) CiAPM dicates unnatural language or lowest motivator of continued study, and 5 denotes natural language or highest motivator of (V) RAPM-NOREL (not using relations) continued study). (VI) RAPM-REL (using relations) We used the same implementation of ELIZA as described 3.3 Results and Analysis (CoAPM) in a previous experiment (Section 3.2). In conversation, AL- Table 4 shows the average scores of all evaluators in each cri- ICEBOT needs the AIML (Artificial Intelligence Markup teria for both systems, rated on a 1-5 scale. The inter-rater Language) set, which contains the contents of the ALICE agreement of the five evaluators was 0.48 (Kendall’s coeffi- brain written in AIML. Therefore, we adopted the stan- cient of concordance). dard free AIML set, “AIML-en-us-foundation-ALICE”10 . By On average, the preliminary version of our proposed comparing with ELIZA and ALICE, we expected to observe method (CoAPM) scored slightly lower than ELIZA, al- whether chatting with simple dialogue systems is intrinsically though there were no statistically significant difference (p > efficient or not for educational purposes. Although it might 0.05) between CoAPM and ELIZA in all three evaluation cri- be discussable, we believe that among the conversational sys- teria (Mann-Whitney U-test, p = 0.42 for grammar, p = 0.29 tems, ELIZA and ALICE have been well-known and cited for semantics, p = 0.21 for motivation). because other rule-based dialogue systems adopt similar pro- Figure 5 shows how CoAPM and ELIZA responded to sev- cessing of scarce context, or are unavailable for commercial eral input utterances. The lower average scores of CoAPM or disclosed specification. 8 We created two versions of RAPM, which generate utter- In the evaluations, we used the original utterances without er- ances from different templates. Specifically, RAPM-NOREL ror corrections as inputs, so the examples may contain erroneous expressions. 10 https://code.google.com/archive/p/aiml- 9 The number of respondents is shown in brackets. en-us-foundation-alice/ 38 employs templates for any relations, while RAPM-REL uti- lizes templates for specific relations. Table 5: Average scores in six evaluation criteria (A - F) and standard deviations (in parentheses). The highest scores for The user inputs were sentences in English learners’ utter- each criterion are highlighted in bold font. ances extracted from The NICT JLE Corpus described in Section 3.2. Considering there were long utterances with (A) (B) (C) (D) (E) (F) many sentences, we used sentences here. From test takers’ ELIZA 2.35 2.80 2.57 2.32 2.50 2.72 utterances, we selected sentences including at least one ac- (1.19) (1.28) (1.21) (1.16) (1.16) (1.21) tion phrase comprising a verb in gerund form and a noun. ALICEBOT 2.78 2.67 2.90 2.65 2.80 2.88 This condition is set on the assumption that action phrases (1.21) (1.29) (1.31) (1.14) (1.09) (1.32) have richer context in sentences, and facilitate the generation CiAPM 3.13 3.15 3.37 3.02∗ 3.20 3.17 of grammatically correct utterances. Under this condition, (1.09) (1.11) (1.09) (0.94) (0.94) (0.95) 6.12% of the examinees’ original sentences were retained as RAPM 3.00 2.82 3.10 3.23∗ 3.03 2.92 potential user inputs. -NONREL (1.12) (1.09) (1.14) (0.91) (0.95) (1.08) The evaluators were six male Japanese university students RAPM 2.97 2.62 3.03 3.18∗ 2.88 2.90 -REL (1.21) (1.17) (1.20) (0.89) (1.05) (1.23) majoring in science (three undergraduates and three graduates ∗ statistically significant difference in their 20s), who were potential targets of a full-fledged tu- between the model and ELIZA scores toring system. The subjects were intermediate English learn- ers with basic knowledge of English grammar and vocabu- lary, but with low proficiency especially in speaking English. The six evaluators assessed the utterances generated by all Figure 6: Examples of outputs in CiAPM / RAPM evaluation five systems, in response to each of 10 inputs chosen ran- Input “And a woman is playing piano.” domly from the utterances of test takers. The examinees’ ELIZA “Let’s change focus a bit... Tell me about your utterances were originally separated from the interviewer’s family.” utterances in the corpus. That is, each evaluator was given ALICEBOT “Aren’t all woman playing piano ?” the same 50 utterances from the systems. The participants CiAPM “Let’s talk about playing piano. What do you received pairs of utterances in a specific order. In contrast, think about it?” in the former CoAPM evaluation (Section 3.2), the utterance RAPM- “Learning something about music... why pairs were presented in mixed order. NONREL some people love it and some don’t?” The system utterances were rated on a 5-point scale (where RAPM- “I guess you were learning something about 1 means ‘poor’ and 5 represents ‘excellent’) in the following REL music - can you tell me how you did that?” six categories. Input “And they enjoyed eating delicious food and (A) “Will to continue the conversation” alcohol.” (B) “Semantical naturalness of dialogue” ELIZA “How does that make you feel?” (C) “Appropriateness in English conversation practice” ALICEBOT “Who, specifically?” CiAPM “Let’s talk about eating delicious food. What (D) “Vocabulary richness” do you think about it?” (E) “Knowledge richness” RAPM- “Talking about diminishing your own (F) “Human-likeness of the system” NONREL hunger... What is your opinion on that topic?” RAPM- “I guess you were diminishing your own These evaluation criteria were based on the benchmark REL hunger - can you tell me how you did that?” used in a related work [Higuchi et al., 2008]. However, by focusing on the action phrases, the proposed methods are sup- 3.5 Results and Analysis (CiAPM / RAPM) posed to ensure a degree of grammatical naturalness in the ut- terances. Therefore, the original criterion “grammatical natu- Table 5 shows the average scores and standard deviations of ralness of dialogues” was changed to “appropriateness in En- all evaluators in each criterion for the five systems (rated from glish conversation practice”, which is considered to be more 1 to 5). The Kendall’s coefficient of concordance among the important for evaluating English-teaching dialogue systems. six raters was 0.369. One of the proposed methods, RAPM- In the “vocabulary richness” evaluation, we expected sub- NONREL with templates not using relations, scored highest jects to rate utterances on a scale from “laconic” to “wordy”. in “vocabulary richness (criterion D)”, and scored second- Some of these criteria could be evaluated by specialists fa- highest in other criteria. In all criteria except vocabulary rich- miliar with English education, or at least by native English ness, CiAPM achieved the highest score. The other proposed speakers. However, at this stage of our project, we focus methods, RAPM-REL, also achieved a high average score in on the user experience of learners who are easily bored with vocabulary richness. According to the Steel-Dwass test (eval- learning. Therefore we set the criteria in terms of the user ex- uated by the asymptotic method), the “vocabulary richness perience, expecting evaluation from the learners’ standpoint. (D)” score of our three methods significantly differed from In the questionnaire, the criteria (without specific descrip- the ELIZA score (p < 0.05), but no statistically significant dif- tions) were presented to the evaluators in the Japanese lan- ferences were observed in the other criteria. Figure 6 shows guage. some responses of each method to different input utterances. 39 The result suggests that the input-related phrases from this observation extends to other cultural backgrounds and ConceptNet are useful to expand the vocabulary of the sys- individuals, broader experiments with more evaluators are tem, and hopefully that of interacting users. For instance, the needed. Furthermore, the evaluated conversations were very input “And a woman is playing piano.” elicited the responses short, limited to one-turn dialogue (a user’s utterance and “Learning something about music... why some people love it the corresponding system utterances). Whether the proposed and some don’t?” (RAPM-NONREL) and “Let’s talk about methods maintain users’ interest in an actual conversation playing piano. What do you think about it?” (CiAPM). The cannot be known at this stage. For this purpose, we must retrieved phrase ‘learning something about music’, which had evaluate a fully developed system on multiple turns of free a relation to the input phrase ‘playing piano’, and appears to conversation. In long conversations for second language ac- enrich the vocabulary over merely repeating the input phrase. quisition, a system that generates only repetitive utterances In RAPM-NONREL and CiAPM, the criterion “vocabulary would bore users. The wide vocabulary of RAPM, providing richness” was rated 4 by 6/6 and 2/6 evaluators, respectively. related topics to user utterances, could potentially mitigate This example indicates the potential usefulness of expanding conversational deadlocks. Thus, combining the two methods the variety of expressions with phrases including hypernyms (one that with repetitive utterances, the other using related or hyponyms, based on the relations in ConceptNet. topics) might be more efficient for language tutoring tasks. However, when the action phrases from a user input are in- serted into the system output, the utterances may sound more natural, as demonstrated in the following example. The in- 4 Conclusion and Future Works put “And they enjoyed eating delicious food and alcohol.”, We proposed methods that automatically generate utterances brought the outputs “Let’s talk about eating delicious food. for an English language tutor, and compared their perfor- What do you think about it?” (CiAPM) and “Talking about mances with those of classic chatbots. Specifically, we diminishing your own hunger... What is your opinion on that evaluated how the generated expressions were received by topic?” (RAPM-NONREL). In this case, a discussion about Japanese subjects. Although our small-scale experiment does human needs, suggesting the related subject of ‘diminishing not allow drawing any conclusions about the stickiness level your own hunger’ to ‘eating delicious food’, would be a good of these approaches yet, we found that ELIZA-like outputs topic for a deeper conversation. However, the preference of offer more encouragement to users than Web- or common the conversational topic depends on the user, his or her in- sense-based approaches. These inferences oppose the find- terests and their English levels. For this reason, the repeat- ings of [Rzepka et al., 2005], who evaluated non-learning ing method (CiAPM) is considered to score above the other dialogues. In enriching the vocabulary of the system ut- methods on average in all criteria except vocabulary richness. terances, the proposed methods had shown their superiority, We presume that the related concepts in ConceptNet are which could be potentially useful to improve users’ command not always compatible with the dialogue context. In such of a foreign language. cases, the responses are unsuited to the user’s need. This However, using external corpora or crowd-sourced knowl- could be partly attributable to random selection of the related edge sources might incur serious drawbacks. Allowing the tu- concepts. To avoid wandering away from the subject of the tor excessive freedom, especially in learning material beyond conversation, the related phrases must be carefully chosen to the preferences of the user, risks misuse, as has occurred in suit the context and the individual user, especially when ap- Microsoft’s Tay and other chatbots [Michael, 2016]. In our plying phrases with their relations. In future work, the ran- approach, adaption of hand-crafted syntactic rules seem to be dom selection must be replaced by a context processing mod- the only restriction, but because of majority voting in both ule, a user profiler, and a language level estimator. A context British National Corpus- and ConceptNet-based methods we processing module could select proper phrases by semantic indirectly try to avoid semantic strangeness. This does not analysis. Considering the ambiguity of multi-word expres- mean that corpora guarantee safe communication, and some sions, detecting phrases after applying a topic modeling such topic restrictions might be needed from the outset. However, as latent Dirichlet allocation might be useful for this purpose. blocking slang and offensive words completely can be prob- In addition, complete reliance on ConceptNet, which lacks lematic, especially when considering more sophisticated per- knowledge of some items and includes dubious entries, is also sonality modeling, which is required in longer-term conver- problematic. sational sessions. As wrong inputs were not corrected, the open source As the next step, we plan to combine our method with es- checker found no mistakes. We might require a more pow- timating language level and supporting vocabulary acquisi- erful error correction approach. For error detection and cor- tion algorithms [Mazur, 2016]. Error corrections could be rection suggestions, a promising solution is the Grammati- improved by the annotated data11 , taking into account that cal Error Correction (GEC) system based on the Neural Ma- Japanese students often make non-word spelling errors (mak- chine Translation (NMT) approach [Yuan, 2017]. In the ex- ing not existing spellings) [Nagata and Neubig, 2017]. Al- periments [Yuan, 2017], the NMT-based GEC system outper- though our dialogue system is not yet ready for long-run con- formed the SMT (Statistical Machine Translation)-based sys- versational sessions, we should experiment on the tutor’s au- tem even in a difficult subject-verb agreement problem. tonomy level in choosing topics related to user’s input, prior From these results we can assume that repetition for con- to larger scale testing. We plan to analyze which outputs firmation plays an important part in conversation practice by 11 Japanese learners of English. However, to assess whether http://www.gsk.or.jp/catalog/gsk2016-b/ 40 are potentially harmful, and to determine appropriate coun- Language Vocabulary Acquisition Method. PhD thesis, termeasures against these expressions. Hokkaido University, 2016. https://eprints.lib. hokudai.ac.jp/dspace/handle/2115/61833. Acknowledgements [Michael, 2016] Katina Michael. Science fiction is full of bots that hurt people:... but these bots are here now. IEEE The authors would like to thank DENSO CORPORATION Consumer Electronics Magazine, 5(4):112–117, 2016. for funding this work. We are grateful to anonymous review- ers for their detailed comments and helpful suggestions. [Nagata and Neubig, 2017] Ryo Nagata and Graham Neu- big. Construction of japanese efl learner corpus for a study of spelling mistakes (in Japanese). In Proceedings of the References Twenty-third Annual Meeting of the Association for Natu- [Chen, 2014] Yi-Cheng Chen. An empirical examination of ral Language Processing, pages 1030–1033, 2017. factors affecting college students’ proactive stickiness with [Rzepka et al., 2005] Rafal Rzepka, Yali Ge, and Kenji a web-based english learning environment. Computers in Araki. Naturalness of an utterance based on the automat- Human Behavior, 31:159 – 171, 2014. ically retrieved commonsense. In Proceedings of IJCAI [Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, 2005 - Nineteenth International Joint Conference on Ar- Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, tificial Intelligence, Edinburgh, Scotland, pages 996–998, Holger Schwenk, and Yoshua Bengio. Learning Phrase August 2005. Representations using RNN Encoder-Decoder for Statisti- [Speer and Havasi, 2012] Robert Speer and Catherine cal Machine Translation. Proceedings of the 2014 Confer- Havasi. Representing general relational knowledge in ence on Empirical Methods in Natural Language Process- conceptnet 5. In Proceedings of the Eight International ing (EMNLP), pages 1724–1734, 2014. Conference on Language Resources and Evaluation [Doyon, 2000] Paul Doyon. Shyness in the Japanese EFL (LREC’12), Istanbul, Turkey, may 2012. class: Why it is a problem, what it is, what causes it, and [Tedick, 1986] Diane J Tedick. Research on error correction what to do about it. The Language Teacher, 24(1):11–16, and implications for classroom. ACIE Newsletter, 1986. 2000. [Tiedemann, 2012] Jörg Tiedemann. Parallel data, tools and [Ferschke et al., 2013] Oliver Ferschke, Johannes Daxen- interfaces in OPUS. In Proceedings of the 8th Interna- berger, and Iryna Gurevych. The People’s Web Meets tional Conference on Language Resources and Evaluation NLP. In Theory and Applications of Natural Language (LREC’2012), pages 2214–2218, 2012. Processing, pages 121–160. Springer, 2013. [Toutanova and Manning, 2000] Kristina Toutanova and [Finkel et al., 2005] Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. Enriching the knowledge Christopher Manning. Incorporating non-local informa- sources used in a maximum entropy part-of-speech tagger. tion into information extraction systems by Gibbs sam- In Proceedings of the 2000 Joint SIGDAT conference pling. In Proceedings of the 43rd Annual Meeting on As- on Empirical Methods in Natural Language Processing sociation for Computational Linguistics - ACL ’05, pages and very large corpora held in conjunction with the 38th 363–370, Morristown, NJ, USA, 2005. Association for Annual Meeting of the Association for Computational Computational Linguistics. Linguistics, volume 13, pages 63–70, Morristown, NJ, [Higuchi et al., 2008] Shinsuke Higuchi, Rafal Rzepka, and USA, 2000. Association for Computational Linguistics. Kenji Araki. A casual conversation system using modality [Toutanova et al., 2003] Kristina Toutanova, Dan Klein, and word associations retrieved from the Web. In Pro- Christopher D. Manning, and Yoram Singer. Feature-rich ceedings of the Conference on Empirical Methods in Nat- part-of-speech tagging with a cyclic dependency network. ural Language Processing, EMNLP ’08, pages 382–390, In Proceedings of the 2003 Conference of the North Amer- Stroudsburg, PA, USA, 2008. Association for Computa- ican Chapter of the Association for Computational Lin- tional Linguistics. guistics on Human Language Technology - NAACL ’03, [Jia, 2009] Jiyou Jia. CSIEC: A computer assisted English volume 1, pages 173–180, Morristown, NJ, USA, 2003. learning chatbot based on textual knowledge and reason- Association for Computational Linguistics. ing. Knowledge-Based Systems, 22(4):249–255, 2009. [Vinyals and Le, 2015] Oriol Vinyals and Quoc Le. A Neural [Lee, 2016] Peter Lee. Learning from Tay’s introduction. Conversational Model. arXiv preprint arXiv:1506.05869, https://blogs.microsoft.com/blog/2016/ 2015. 03/25/learning-tays-introduction/, 2016. [Weizenbaum, 1966] Joseph Weizenbaum. ELIZA—a com- (accessed May 8 2017). puter program for the study of natural language communi- [Lison and Tiedemann, 2016] Pierre Lison and Jörg Tiede- cation between man and machine. Communications of the mann. OpenSubtitles2016: extracting large parallel cor- ACM, 9(1):36–45, 1966. pora from movie and TV subtitles. In Proceedings of [Yuan, 2017] Zheng Yuan. Grammatical error correction in the 10th International Conference on Language Resources non-native english. Technical report, University of Cam- and Evaluation (LREC 2016), 2016. bridge, Computer Laboratory, 2017. [Loewen, 2007] Shawn Loewen. Error correction in the sec- ond language classroom. Clear News, 11(12):1–7, 2007. [Mazur, 2016] Michal Mazur. A Study on English Language Tutoring System Using Code-Switching Based Second 41