Refinement of utterance database and concatenation of utterances for enhancing system utterances in chat-oriented dialogue systems Yuiko Tsunomori1 , Ryuichiro Higashinaka2 , Takeshi Yoshimura1 1 NTT DOCOMO 2 NTT Corporation 1 {yuiko.tsunomori.fc, yoshimurat}@nttdocomo.com, 2 higashinaka.ryuichiro@lab.ntt.co.jp Abstract the web to extract predicate-argument structures (PASs) and convert them into utterances. The result of this method is We have been using an utterance database cre- a database of utterances with their associated topics (called ated from a massive amount of predicate-argument foci) (see Section 3 for details). We are using the utter- structures extracted from the web for generating ance database created by this method in our commercial chat- utterances of our commercial chat-orientated di- oriented dialogue system1 [1]. alogue system. However, since the creation of Although this method can generate utterances correspond- this database involves several automated processes, ing to a variety of foci by exploiting the richness of the web, the database often includes non-sentences (ungram- system utterances have the following problems: matical or uninterpretable sentences) and utter- ances with inappropriate topic information (called • Because of errors resulting from automatic analysis of off-focus utterances). Also, utterances tend to PASs and their automatic conversion into utterances, be monotonous and uninformative because they non-sentences (ungrammatical or uninterpretable sen- are created from single predicate-argument struc- tences) and utterances inappropriate for their associated tures. To tackle these problems, we propose meth- foci (called off-focus utterances) can sometimes be gen- ods for filtering non-sentences by using neural- erated. network-based methods and utterances inappropri- • The system utterances tend to be monotonous and unin- ate for their associated foci by using co-occurrence formative because they are created from single PASs. statistics. To reduce monotony, we also propose In this paper, we propose methods for improving the qual- a method for concatenating automatically gener- ity of the utterance database created by using Higashinaka et ated utterances so that the utterances can be longer al.’s method [7] and for reducing the monotony of system ut- and richer in content. Experimental results indicate terances. In particular, our methods filter non-sentences and that our non-sentence filter can successfully remove off-focus utterances using neural-network-based methods and non-sentences with an accuracy of 95% and that we co-occurrence statistics. We also propose a method of reduc- can filter utterances inappropriate for their foci with ing monotony by concatenating pairs of automatically gen- high recall. We also examined the effectiveness of erated utterances about the same focus so that the utterances our filtering and concatenation methods through an can be longer and richer in content. We verified the effective- experiment involving human participants. The ex- ness of our methods through an experiment involving human perimental results show that our methods signifi- participants. Our contributions are as follows: cantly outperformed the baseline in terms of un- derstandability and that the concatenation of two • We successfully created non-sentence and off-focus fil- utterances leads to higher familiarity and content ters that can greatly refine the utterance database created richness while retaining understandability. from PASs on the web. In terms of the utterance quality, we observed significant improvements regarding famil- iarity, understandability, and content richness in subjec- 1 Introduction tive evaluations. By using our methods, the utterances Chat-oriented dialogue systems have become increasingly of the database can be safely used by chat-oriented dia- popular [1; 2; 3; 4; 5]. Such systems need to generate a wide logue systems. variety of utterances to cope with the many topics contained • We found that, by concatenating two utterances about in user utterances. Although rule-based methods have typi- the same focus from the utterance database, we can cre- cally been used to generate system utterances, the topics that ate utterances that are significantly better in terms of fa- appear in chats are diverse, and it is extremely expensive to miliarity and content richness. We confirmed that this create rules with adequate coverage [6]. effect is brought about only when we use the utterance To overcome this weakness, Higashinaka et al. [7] pro- 1 posed a method of using a large volume of text data on https://dev.smt.docomo.ne.jp/ 44 database refined by the non-sentence and off-focus fil- detection of inappropriate utterances has also been tackled ters. in dialogue breakdown detection challenges (DBDCs) [17; We believe our proposed methods can especially contribute 18]. However, the main focus is on detecting inappropri- to commercial chat-oriented dialogue systems, in which the ate utterances in the context of dialogue, whereas we focus quality of utterances is critical. on refining an utterance database. Inaba et al. [19] pro- The paper is structured as follows. In Section 2, we cover posed a monologue-generation method for non-task-oriented related work. In Section 3, we explain our PAS-based utter- dialogue systems by concatenating sentences extracted from ance database and examine the proportions of non-sentences Twitter. This is similar to our concatenation method in that it and utterances inappropriate for their associated foci. In Sec- concatenates utterances to reduce monotony but different in tion 4, we explain our proposed methods for filtering inappro- that it targets monologues rather than dialogues. priate utterances and our utterance-concatenation method. In Section 5, we explain our experiment involving human par- 3 PAS-based utterance database ticipants. Finally, we summarize the paper and discuss future We first describe the construction and details of the utterance work in Section 6. database of our chat-oriented dialogue system. Then, to illus- trate the problems with the database, we examine the propor- 2 Related work tions of non-sentences and off-focus utterances. Various methods have been proposed to generate utterances 3.1 Creation of the utterance database in chat-oriented dialogue systems, such as rule-, retrieval-, We use the utterance database created by using the method and generation-based methods. described by Higashinaka et al. [7]. The method uses PAS Rule-based methods generate system utterances on the ba- analysis [16] to extract PASs with their foci from a large sis of hand-crafted rules. Representative systems that use amount of text data. To extract high-quality PASs and their such rules are ELIZA [8] and A.L.I.C.E. [9]. However, the foci, the method extracts predicates with just two arguments topics that appear in chat are diverse, and it is extremely ex- explicitly marked with particles ‘wa’ and ‘ga.’ ‘Wa’ is a topic pensive to hand-craft rules with wide coverage [6]. marker and ‘ga’ is a nominative case marker in Japanese. This Retrieval-based methods have been proposed to improve way, a subject and a predicate can be extracted as constituents coverage. The recent increase in web data has propelled the of a PAS together with a focus. development of methods that use data retrieved from the web Since PASs cannot be uttered as they are, they need to be for open-domain conversation [10; 11; 2]. The advantage of converted into utterances. Given a PAS and a dialogue-act such retrieval-based methods is that, owing to the diversity type (we need this as input because utterances require under- of the web, systems can retrieve at least some responses for lying intentions; dialogue-act types are described below), an user input, which can solve the coverage problem. However, utterance is automatically created. The PASs are first con- this comes at the cost of utterance quality. Since the web verted into declarative sentences using a simple rule. Then, is inherently noisy, it is, in many cases, difficult to sift out their sentence-end expressions (NB. In Japanese, modali- appropriate sentences from retrieval results. ties are mostly expressed by sentence-end expressions) are Recently, generation-based methods based on neural net- swapped with those matching the target dialogue-act type. works have been extensively researched. However, these The sentence-end expressions used are those automatically methods generally tend to generate utterances with little con- mined from dialogue-act-annotated dialogue data. The de- tent, although there has been research on improving the diver- tails of the method of obtaining and swapping sentence-end sity in generated utterances [12; 13]. We acknowledge that expressions are given by Miyazaki et al. [20]. current neural-network-based methods are yielding promis- From the list of 32 dialogue-act types [21], 21, which are ing results. However, we use an utterance database created mainly related to self-disclosure and question, are used for from PASs on the web [7] because it is guaranteed to out- conversion. From blog data (about three years’ worth of blog put system utterances with content related to the focus of the articles) and by the combination of the extracted PASs and the conversation and because system utterances can be more con- dialogue-act types, the resulting utterance database contains trollable, which is particularly important for commercial ap- 7,116,597 utterances associated with 204,497 foci. plications. The detection of inappropriate utterances including non- 3.2 Quality of utterance database sentences is related to the detection of grammatical errors Since the PASs are extracted and converted into utterances made by second-language learners. Imaeda et al. [14] pro- automatically, errors in the resulting utterances are inevitable, posed a dictionary-based method for detecting case particle affecting the quality of the utterance database. From our ob- errors by using a lexicon, Oyama et al. [15] proposed a sup- servation, there can be two types of erroneous utterances: port vector machine (SVM)-based method for detecting case non-sentence and off-focus utterances. particle errors in documents created by non-native Japanese Non-sentence Sentences that we cannot understand due to speakers, and Imamura et al. [16] proposed a method for de- grammatical errors or a strange combination of words. tecting all types of particle errors. However, these methods Non-sentences are generated mainly in the conversion cannot be directly applied to the utterances of dialogue sys- of sentence-end expressions; some propositions can- tems since the error tendency of automatically generated ut- not be uttered with certain sentence-end expressions in terances differs from that of second-language learners. The Japanese (see [22] for such examples). 45 Table 1: Examples of non-sentence annotation (0: non-sentence, 1: Table 3: Examples of focus annotation (0: off-focus,1: on-focus). valid-sentence). A1 and A2 indicate the labels given by the two dif- Annotators 1 and 2 give labels by two different annotators (A1 and ferent annotators. Utterances were originally in Japanese. English A2). translations in parentheses were done by authors. Focus Utterance A1 A2 Focus Utterance A1 A2 秋 冬 (Fall & 単価が高いんですか? (Is the 0 0 秋 冬 (Fall & どんなんが流行りますよね 0 0 winter) unit price high?) winter) (What kind of types is pop- 秋 冬 (Fall & ブーツが多いのでしょうか? 1 1 ular, isn’t it?) (NB. This winter) (Are there a lot of boots?) sentence sounds odd because 秋 冬 (Fall & 空気が澄んでるんですかね? 1 0 its subject is an interrogative winter) (Is air clear?) while the sentence is declara- tive.) 秋 冬 (Fall & レギンス男子が増えてます 1 1 Table 4: Statistics of focus annotation (0: Off-focus, 1: On-focus) winter) ねぇ (Boys wearing leggings are increasing, aren’t they?) # of utterances Percentage 秋 冬 (Fall & 空気が乾燥したりとかです 1 0 2 annotators labeled 0 7,528 5% winter) (Air is dry and so on.) 2 annotators labeled 1 121,511 80% 1 annotator labeled 0, 21,916 15% other labeled 1 Table 2: Statistics of non-sentence annotation (0: non-sentence, 1: Total 150,955 100% valid-sentence) # of utterances Percentage 2 annotators labeled 0 23,052 12% Focus annotation 2 annotators labeled 1 150,955 75% By using the utterances annotated as valid-sentences in the 1 annotator labeled 0, 25,993 13% non-sentence corpus (i.e., 150,955 utterances), two annota- other labeled 1 tors labeled whether the utterances were appropriate to their Total 200,000 100% foci. The annotators were shown pairs of a focus and utter- ance and labeled each pair with the following instructions: • If you feel the combination of utterance and focus is un- Off-focus utterances Utterances inappropriate for their as- natural, label it 0 (off-focus). sociated foci. Although the utterances in the database are created from PASs in which the focus and subject are • If you feel the combination of utterance and focus is nat- explicitly marked by the topic marker and case marker, ural, label it 1 (on-focus). When the focus has multiple respectively, the focus and content of an utterance are meanings, if there is at least one reasonable interpreta- often not closely associated. This occurs when there is tion, label the combination 1. an error in the PAS analysis or when the meaning of the A total of 24 annotators participated; pairs of annotators focus is just too broad or vague. were randomly selected for labeling pairs of a focus and ut- We investigated the current quality of the database in terms terance. Cohen’s κ value was 0.32, which indicates a rea- of how many non-sentences and off-focus utterances are con- sonable degree of agreement when considering the subjective tained. For this purpose, we performed annotations regarding nature of judging naturalness. Table 3 shows an example of non-sentence and off-focus utterances, which are described this annotation, and Table 4 gives the annotation breakdown. below. Utterances inappropriate for their associated foci accounted for 5% of the database. Hereafter, we call the focus annota- Non-sentence annotation tion data on which the annotators agreed ”the focus corpus” We randomly sampled 200,000 utterances from the utterance (containing 129,039 utterances). database. The annotators labeled each utterance with the fol- lowing instructions: 4 Proposed methods • If you think the utterance is a non-sentence, label it 0. We found that there are 12% non-sentences and 5% utter- • If you do not think the utterance is a non-sentence (i.e., ances inappropriate for their associated foci in our database. it is a valid-sentence), label it 1. Since this means the system utterances can often be erro- A total of 24 annotators participated; two annotators were neous, we need to reduce these utterances to improve the randomly assigned to each utterance. Cohen’s κ value, which quality of our database. We also see it as a problem that the assesses the agreement between the two annotators, was cal- utterances in our database are monotonous and uninformative culated as 0.56. This indicates an intermediate degree of because they were generated from single PASs. agreement. Table 1 lists annotation examples, and Table 2 In this paper, we propose methods of filtering non- gives the annotation breakdown. Non-sentences accounted sentences and off-focus utterances for refining the database. for 12% of the database. Hereafter, we call the non-sentence We also propose a method to concatenate pairs of utterances annotation data on which the annotators agreed ”the non- about the same focus to reduce monotony of system utter- sentence corpus” (containing 174,007 utterances). ances. 46 4.1 Method for creating non-sentence filter If the PMI value is below a certain threshold, we can filter Since the detection of non-sentences can be regarded as a task the utterance because the association can be considered low. of sentence classification, we created a non-sentence filter by The threshold can be determined experimentally, that is, we using machine-learning methods. We used standard machine- find the threshold that produces the best accuracy using train- learning methods for sentence classification such as SVM and ing/development data. Note that the best accuracy depends neural-network-based methods, which have been extensively on the objective. If we want the resulting database to be as used in recent years. We used the following machine-learning clean as possible, we can set a high threshold. If we do not methods for training our classifiers2 : want to lose much data, the threshold can be set lower. In this study, we set the target recall for detecting off-focus ut- SVM We train an SVM classifier with a linear kernel. The terances to 80% because we want most off-focus utterances features are the averaged word vectors of words con- removed. We determine the threshold that achieves this re- tained in an utterance. We use a pretrained word vector call on the training/development data and use it for filtering provided by Suzuki et al. [23], the dimensions of which possible off-focus utterances. are 200. We use the same pretrained word vectors for Note that an appropriate text database must be chosen for MLP, CNN, and LSTM, which we describe below. calculating the PMI. We consider using Wikipedia (contain- Multi-Layer Perceptron (MLP) We train a classifier by ing roughly 8M sentences) and blogs (we use one year’s MLP. We have five layers: the input layer, three non- worth of blogs containing about 2B sentences). The former linear layers (each layer having 200 units) with sigmoid is smaller but more informational. The latter is larger but activation, and the output layer. We use averaged word noisy and is a mixture of contents of varying quality. We vectors as input. The output layer outputs a binary deci- will verify which one is more useful in a later experiment, al- sion by a softmax function. though we naturally assume that blog data are more suitable Convolutional Neural Network (CNN) We train a classi- because they have more variety, which is a requirement for fier by a CNN. We have an input layer, a convolutional chat-oriented dialogue systems. layer, a pooling layer, and an output layer. The model structure is the same as that used by Kim [24]. A filter 4.3 Utterance concatenation whose size is 200 × 3 is used for convolution. The stride For one solution to reduce monotony, we propose a method is set to one. We used relu as an activation function. The of concatenating pairs of automatically generated utterances max pooling layer uses a window size of three to output about the same focus so that the utterances can be longer and a fixed length vector. The output layer outputs a binary richer in content. More specifically, we propose concatenat- decision by a softmax function. ing two random utterances that have the same focus. Long Short-Term Memory (LSTM) We train a classifier Although this approach may seem simplistic, it can be ef- by LSTM. We have an input layer, an LSTM layer, three fective because, at the very least, it increases the utterance hidden layers, and an output layer. The LSTM layer has length of a system. Note that it is not trivial to create a reason- 200 units. Each word is converted into an embedding, able utterance by concatenating two utterances. It has been and the sequence of word embeddings is converted into a shown that implicit discourse relations are still hard to de- hidden representation, corresponding to a sentence vec- tect [26]. This means that utterances that will be coherent in tor. Then, this vector is fed to three non-linear layers terms of discourse are difficult to accurately select. In addi- (each layer having 200 units) with sigmoid activation, tion, we believe our simple concatenation method may just the output of which is input to the output layer, making work because the concatenated utterance will satisfy the lo- a binary decision by a softmax function. cal coherence [27] with the same underlying entity (i.e., the focus). 4.2 Method for creating focus filter To filter out off-focus utterances, we use co-occurrence statis- 5 Evaluation tics, namely, point-wise mutual information (PMI) between the subject of the utterance and its focus. We use PMI be- We first individually evaluated the performance of our non- cause it has been successfully used to filter sentences unre- sentence and focus filtering methods and then conducted a lated to topics [25]. We calculate the PMI with the following subjective evaluation involving human participants on the fil- equation: tered and concatenated utterances. count(S, F )/N PMI(S, F ) = log2 , (1) 5.1 Evaluation of our non-sentence filtering count(S)/N ∗ count(F )/N methods where S is a subject; F is a focus; ‘count’ is a function that re- We trained a non-sentence filter by using the non-sentence turns the number of documents containing S, F , or both; and corpus (see Section 3.2). We split the data into training, de- N is the maximum number of documents in a text database. velopment, and test sets corresponding to 3837, 500, and 500 We use a sentence as a document unit. foci, respectively. 2 We trained the classifiers using the training data and eval- We used scikit-learn (http://scikit-learn.org/) for SVM and Chainer (http://chainer.org/) for MLP, CNN, and LSTM. uated the accuracy with the test data by using the highest 47 Table 5: Precision, recall, and F-measure for the detection of non- ances). sentences. Bold font represents top score for each evaluation crite- rion. 5.3 Subjective evaluation Method Accuracy Precision Recall F-measure We conducted a subjective evaluation involving human par- SVM 0.93 0.81 0.71 0.76 ticipants to verify the effectiveness of our non-sentence and MLP 0.90 0.63 0.84 0.72 focus filtering methods as well as our concatenation method CNN 0.94 0.86 0.73 0.79 (see Section 4.3). LSTM 0.95 0.88 0.78 0.83 Evaluation procedure Four participants took part in the evaluation. We made each Table 6: Precision, recall, and F-measure for off-focus/on-focus ut- of eight methods for comparison (see the following subsec- terances for training and test data when thresholds of 2.2 and 2.8 are tion for details) generate utterances for 100 randomly selected used for Wikipedia and blog data, respectively. foci, resulting in 800 utterances (8×100 foci) for use in the experiment. The utterances were randomly shuffled and pre- Precision Recall F-measure sented to the participants. Each participant rated the 800 ut- Wikipedia terances in terms of familiarity, understandability, and content train off-focus 0.09 0.82 0.16 richness (we describe these criteria later). on-focus 0.98 0.49 0.65 Methods for comparison test off-focus 0.09 0.80 0.16 We compared the following eight methods (a)–(h). Note that on-focus 0.97 0.42 0.59 for non-sentence filtering, we use the LSTM model, which Blog data showed the best performance in our experiment. For focus train off-focus 0.12 0.81 0.20 filtering, we use the PMI threshold of 2.8 calculated by using on-focus 0.98 0.62 0.76 the blog data. test off-focus 0.13 0.81 0.23 on-focus 0.98 0.64 0.77 (a) Random (Single): Baseline We randomly select a single utterance from the utterance database. F-measure model yielded from the development data3 . The (b) Random (Pair): Proposed classification results are listed in Table 5. We can see that We randomly select two utterances from the utterance our method successfully detected the non-sentences with high database and concatenate them to create a system utter- accuracy. The model that uses LSTM had the highest accu- ance. racy (0.95) and F-measure (0.83). LSTM has the highest ac- (c) NS-filtered (Single): Proposed curacy probably because the determination of non-sentences We randomly select one utterance from the test data of depends on the sequence of words that can best be captured the non-sentence corpus that was classified as a valid- with recurrent models. sentence with non-sentence filtering. 5.2 Evaluation of our focus filtering method (d) NS-filtered (Pair): Proposed We randomly select two utterances from the test data We split the focus corpus (see Section 3.2) into 80% training of the non-sentence corpus that were classified as valid- data and 20% test data. We first calculated the PMI values sentences with non-sentence filtering. Then we concate- between the subjects and foci for all utterances by using the nate these utterances to create a system utterance. training data. Then, we looked for the threshold of the PMI that achieved 80% recall for off-focus utterances through a (e) NS+F-filtered (Single): Proposed grid search. We randomly select one utterance from the test data of When we used Wikipedia as the data for PMI calculation, the non-sentence corpus that was classified as a valid- we obtained a threshold of 2.2, and when we used the blog sentence with non-sentence filtering and as on-focus data, the threshold was 2.8. See Figures 1 and 2 for the with focus filtering. changes in precision, recall, and F-measure when we changed (f) NS+F-filtered (Pair): Proposed the threshold by an interval of 0.1. Table 6 shows the preci- We randomly select two utterances from the test data of sion, recall, and F-measure for off-focus/on-focus utterances the non-sentence corpus that were classified as valid- for training and test data when the thresholds of 2.2 and 2.8 sentences with non-sentence filtering and as on-focus are used for Wikipedia and the blog data, respectively. As with focus filtering. Then we concatenate these utter- expected, the use of blog data yielded much better results, ances to create a system utterance. resulting in higher precision/recall for on-focus utterances at (g) Gold NS (Single) the point of 80% recall for off-focus utterances. The results We randomly select one utterance annotated as a valid indicate that our off-focus filter can successfully filter utter- sentence in the test data of the non-sentences corpus. ances that are not associated with their foci (off-focus utter- (h) Gold F (Single) 3 We randomly select one utterance annotated as on-focus Note that for SVM, we used the training data for training and the test data for evaluation; we did not use the development data. in the test data of the focus corpus. 48 1.0 1.0 Off-focus precision Off-focus recall Off-focus f-measure On-focus precision 0.8 0.8 On-focus recall On-focus f-measure Off-focus precision 0.6 0.6 Off-focus recall Off-focus f-measure On-focus precision On-focus recall 0.4 0.4 On-focus f-measure 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 PMI PMI Figure 1: Changes in precision, recall, and F-measure Figure 2: Changes in precision, recall, and F-measure when we changed PMI threshold by interval of 0.1. This when we changed PMI threshold by interval of 0.1. This is when Wikipedia is used for PMI calculation. is when blog data are used for PMI calculation. Table 7: Example utterances generated by eight methods used in subjective evaluation Method Focus Utterance (a) Random (Single) カルボナーラ カルボナーラはパスタがいいたいですか? (Does carbonara want to say pasta?) (NB. (Carbonara) This is a non-sentence; the inanimate subject carbonara cannot be the subject of “say.”) (b) Random (Pair) 視 力 視力は出ないってことがわかりますねぇ 視力は右が下がります?? (We understand (Eye sight) that your eye sight is not good. Has the sight of your right eye decreased?) (c) NS-filtered (Single) マ フ ラ ー マフラーはバーバリーマフラーが欲しいですね (I want a Burberry scarf.) (Scarf) (d) NS-filtered (Pair) 水 曜 水曜は授業が終わってますか?水曜は授業が入ってないんです (Has Wednesday’s (Wednesday) class ended? There is no class on Wednesday.) (e) NS+F-filtered (Single) バ ナ ナ バナナはおいしいのが多いですね (Bananas are generally delicious, aren’t they?) (Banana) (f) NS+F-filtered (Pair) 夕 食 夕食は和食が食べたいんですって 夕食は鍋がいいですね (Somebody wants to have (Dinner) Japanese for dinner. Japanese stew should be fine.) (g) Gold NS (Single) ワ ン コ ワンコは耳がいいですよね (Doggies have good ears, don’t they?) (Doggy) (h) Gold F (Single) 観 光 客 観光客は欧米人が多いとかですか? (Are there many tourists from Europe and the (Tourist) US?) Random (Single) is the baseline, which is our current dicates the highest agreement. method of just using a single utterance for a given focus from the utterance database. Table 7 lists example utterances gen- Results erated by the eight methods. Table 8 lists the evaluation results. By comparing (a) Ran- Evaluation criteria dom (Single) to (c) NS-filtered (Single), we can see that un- Sugiyama et al. [29] used the semantic differential (SD) derstandability and familiarity were improved by using non- method to derive the dimensions to evaluate utterances in sentence filtering. By comparing (c) NS-filtered (Single) chat-oriented dialogue systems. They identified three dimen- to (e) NS+F-filtered (Single), although there was no signif- sions, and we used them in our evaluation. The evaluation icant difference, we can see that understandability further im- criteria together with the statements used in the evaluation proved. Since both (c) NS-filtered (Single) and (e) NS+F- were as follows: filtered (Single) significantly outperform the baseline, this verifies the effectiveness of our filters. In addition, by com- • Familiarity: You feel familiar with the system and that paring (g) Gold NS (Single) to (h) Gold F (Single), we can you want to talk more. confirm that utterances need to be appropriate for their asso- • Content Richness: You feel that the utterance is interest- ciated foci. The results here indicate that our filters contribute ing and informative. greatly to the understandability of the utterances in the utter- ance database. In addition, we surprisingly also see improve- • Understandability: You feel that the utterance is natural ments in familiarity and content richness. and easy to understand. By comparing (a) Random (Single) to (b) Random (Pair), Each participant rated their level of agreement to the above although content richness improved, we can see that under- statements using a Likert scale between 1 and 5, where 5 in- standability significantly decreased. This means that just 49 Table 8: Subjective evaluation results (5 is high). Superscripts a–h next to numbers indicate methods with which that value was statistically better. Double-letters (e.g., aa) mean p < .01; otherwise, p < .05. For statistical test, we used Steel-Dwass multiple comparison test [28]. Bold font represents top three scores for each evaluation criterion. Familiarity Understandability Content richness Baseline (a) Random (Single) 3.52 3.37bb 3.25 (b) Random (Pair) 3.60 2.87 3.74aacceegg (c) NS-filtered (Single) 3.75aa 3.73aabbddf 3.42 Proposed (d) NS-filtered (Pair) 3.76aa 3.17b 3.89aacceegghh (e) NS+F-filtered (Single) 3.75a 3.87aabbddf f 3.53aa (f) NS+F-filtered (Pair) 3.90aabbgg 3.49bbdd 4.12aabbccdeegghh Gold (g) Gold NS (Single) 3.63 3.69abbdd 3.40 (h) Gold F (Single) 3.88aabbgg 4.21aabbccddeef f gg 3.64aag randomly concatenating utterances in the current utterance propriately, for example, by taking discourse relations [30; database for the same focus does not lead to good utter- 26] into account. ances. However, by comparing (a) Random (Single) to (d) NS-filtered (Pair), we can see that our concatenation method References improved familiarity and content richness while maintain- [1] Kanako Onishi and Takeshi Yoshimura. Casual conver- ing understandability. By comparing (d) NS-filtered (Pair) to (f) NS+F-filtered (Pair), we can see further improvements in sation technology achieving natural dialog with comput- content richness and understandability. Although it does not ers. NTT DOCOMO Technical Jouranl, Vol. 15, No. 4, seem to be a good idea to concatenate possibly low-quality pp. 16–21, 2014. utterances, it is a good idea to concatenate valid and on-focus [2] Alan Ritter, Colin Cherry, and William B Dolan. Data- utterances. Because content richness has improved without driven response generation in social media. In Proceed- loss of understandability, we can safely say that our concate- ings of the Conference on Empirical Methods in Natural nation method can reduce the monotony and generate richer Language Processing, pp. 583–593, 2011. utterances. [3] Oriol Vinyals and Quoc Le. A neural conversational model. In Proceedings of the 32nd International Con- 6 Summary and future work ference on Machine Learning Deep Learning Workshop, 2015. To refine our utterance database and generate non- [4] Zhou Yu, Ziyu Xu, Alan W Black, and Alexander I monotonous utterances, we proposed methods of filtering Rudnicky. Strategy and policy learning for non-task- non-sentences and utterances inappropriate for their asso- oriented conversational systems. In Proceedings of the ciated foci using neural-network-based methods and co- 17th Annual SIGdial Meeting on Discourse and Dia- occurrence statistics. To reduce monotony, we also proposed logue, pp. 404–412, 2016. a simple but powerful method of concatenating two utter- ances related to the same focus so that the utterances can [5] Rafael E Banchs and Haizhou Li. IRIS: a chat-oriented be longer and richer in content. Experimental results show dialogue system based on the vector space model. In that our non-sentence filter can successfully remove non- Proceedings of the ACL 2012 System Demonstrations, sentences with an accuracy of 95% and that we can filter ut- pp. 37–42, 2012. terances inappropriate for their foci with high recall. Also, [6] Ryuichiro Higashinaka, Toyomi Meguro, Hiroaki we examined the effectiveness of our filtering methods and Sugiyama, Toshiro Makino, and Yoshihiro Matsuo. On concatenation method through an experiment involving hu- the difficulty of improving hand-crafted rules in chat- man participants. Experimental results show that our auto- oriented dialogue systems. In Proceedings of the Signal matic methods of incorporating non-sentence and focus fil- and Information Processing Association Annual Summit tering significantly outperformed the current single-utterance and Conference, pp. 1014–1018, 2015. baseline. The experimental results also indicate that the con- [7] Ryuichiro Higashinaka, Kenji Imamura, Toyomi Me- catenation of two utterances leads to higher familiarity and content richness while maintaining understandability. We be- guro, Chiaki Miyazaki, Nozomi Kobayashi, Hiroaki lieve our proposed methods can especially contribute to com- Sugiyama, Toru Hirano, Toshiro Makino, and Yoshihiro mercial chat-oriented dialogue systems, in which the quality Matsuo. Towards an open-domain conversational sys- of utterances is critical. tem fully based on natural language processing. In Pro- For future work, we plan to update the utterance database ceedings of the 25th International Conference on Com- of our current chat-oriented dialogue system with our filter- putational Linguistics, pp. 928–939, 2014. ing methods and concatenation method. We also plan to [8] Joseph Weizenbaum. ELIZA— a computer program for consider methods of concatenating two utterances more ap- the study of natural language communication between 50 man and machine. Communications of the ACM, Vol. 9, Japanese Society for Artificial Intelligence, Vol. 31, No. 1, pp. 36–45, 1966. No. 1, pp. DSF–F 1, 2016. (In Japanese). [9] Richard S Wallace. The Anatomy of A.L.I.C.E. In Pars- [20] Chiaki Miyazaki, Toru Hirano, Ryuichiro Higashinaka, ing the Turing Test, pp. 181–210. Springer, 2009. Toshiro Makino, and Yoshihiro Matsuo. Automatic con- [10] Fumihiro Bessho, Tatsuya Harada, and Yasuo Ku- version of sentence-end expressions for utterance char- acterization of dialogue systems. In Proceedings of the niyoshi. Dialog system using real-time crowdsourcing 29th Pacific Asia Conference on Language, Information and twitter large-scale corpus. In Proceedings of the and Computation, pp. 307–314, 2015. 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 227–231, 2012. [21] Toyomi Meguro, Ryuichiro Higashinaka, Yasuhiro Mi- nami, and Kohji Dohsaka. Controlling listening- [11] Masahiro Shibata, Tomomi Nishiguchi, and Yoichi oriented dialogue using partially observable Markov de- Tomiura. Dialog system for open-ended conversation cision processes. In Proceedings of the 23rd inter- using web documents. Informatica (Slovenia), Vol. 33, national conference on computational linguistics, pp. No. 3, pp. 277–284, 2009. 761–769, 2010. [12] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, [22] Ryuichiro Higashinaka, Kotaro Funakoshi, Masahiro and Bill Dolan. A diversity-promoting objective func- Araki, Hiroshi Tsukahara, Yuka Kobayashi, and tion for neural conversation models. In Proceedings Masahiro Mizukami. Towards taxonomy of errors in of the 15th Annual Conference of the North American chat-oriented dialogue systems. In Proceedings of the Chapter of the Association for Computational Linguis- 16th Annual Meeting of the Special Interest Group on tics, 2016. Discourse and Dialogue, pp. 87–95, 2015. [13] Yuanlong Shao, Stephan Gouws, Denny Britz, Anna [23] Masatoshi Suzuki, Koji Matsuda, Satoshi Sekine, Goldie, Brian Strope, and Ray Kurzweil. Generat- Naoaki Okazaki, and Kentaro Inui. Neural joint learn- ing high-quality and informative conversation responses ing for classifying wikipedia articles into fine-grained with sequence-to-sequence models. In Proceedings of named entity types. In Proceedings of the 30th Pacific the 2017 Conference on Empirical Methods in Natural Asia Conference on Language, Information and Com- Language Processing, pp. 2210–2219, 2017. putation, pp. 535–544, 2016. [14] Koji Imaeda, Atsuo Kawai, Yuji Ishikawa, Ryo Nagata, [24] Yoon Kim. Convolutional neural networks for sentence and Fumito Masui. Error detection and correction of classification. In Proceedings of the 2014 Conference case particles in Japanese learner’s composition. In on Empirical Methods on Natural Language Process- Proceedings of the Information Processing Society of ing, pp. 1746–1751, 2014. Japan SIG, No. 13 (2002-CE-068), pp. 39–46, 2003. (In [25] David Newman, Jey Han Lau, Karl Grieser, and Timo- Japanese). thy Baldwin. Automatic evaluation of topic coherence. [15] Hiromi Oyama, Yuji Matsumoto, Masayuki Asahara, In Proceedings of the 2010 Annual Conference of the and Kosuke Sakata. Construction of an error informa- North American Chapter of the Association for Compu- tion tagged corpus of Japanese language learners and tational Linguistics, pp. 100–108, 2010. automatic error detection. In Proceedings of the Com- [26] Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. Rec- puter Assisted Language Instruction Consortium, 2008. ognizing implicit discourse relations in the penn dis- [16] Kenji Imamura, Kuniko Saito, Kugatsu Sadamitsu, and course treebank. In Proceedings of the 2009 Conference Hitoshi Nishikawa. Grammar error correction using on Empirical Methods in Natural Language Processing, pseudo-error sentences and domain adaptation. In Pro- pp. 343–351, 2009. ceedings of the 50th Annual Meeting of the Association [27] Regina Barzilay and Mirella Lapata. Modeling local for Computational Linguistics, pp. 388–392, 2012. coherence: An entity-based approach. Computational [17] Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Linguistics, Vol. 34, No. 1, pp. 1–34, 2008. Kobayashi, and Michimasa Inaba. The dialogue break- [28] Meyer Dwass. Some k-sample rank-order tests. Contri- down detection challenge: Task description, datasets, butions to probability and statistics, pp. 198–202, 1960. and evaluation metrics. In Proceedings of the 2016 Lan- [29] Hiroaki Sugiyama, Toyomi Meguro, and Ryuichiro Hi- guage Resources and Evaluation Conference, pp. 3146– gashinaka. Multi-aspect evaluation for utterances in 3150, 2016. chat dialogues. In Proceedings of the Special Inter- [18] Ryuichiro Higashinaka, Kotaro Funakoshi, Michi- est Group on Spoken Language Understanding and masa Inaba, Yuiko Tsunomori, Tetsuro Takahashi, and Dialogue Processing, Vol. 4, pp. 31–36, 2014. (In Nobuhiro Kaji. Overview of dialogue breakdown de- Japanese). tection challenge 3. In Proceedings of Dialog System [30] Atsushi Otsuka, Toru Hirano, Chiaki Miyazaki, Technology Challenges Workshop (DSTC6), 2017. Ryuichiro Higashinaka, Toshiro Makino, and Yoshihiro [19] Michimasa Inaba, Yuka Yoshino, and Kenichi Taka- Matsuo. Utterance selection using discourse relation hashi. Open domain monologue generation for filter for chat-oriented dialogue systems. In Dialogues speaking-oriented dialogue systems. Journal of with Social Robots, pp. 355–365. Springer, 2017. 51