Refinement of utterance database and concatenation of utterances for enhancing
                  system utterances in chat-oriented dialogue systems
                         Yuiko Tsunomori1 , Ryuichiro Higashinaka2 , Takeshi Yoshimura1
                                                1
                                                  NTT DOCOMO
                                               2
                                                  NTT Corporation
                       1
                         {yuiko.tsunomori.fc, yoshimurat}@nttdocomo.com,
                               2
                                 higashinaka.ryuichiro@lab.ntt.co.jp

                           Abstract                                   the web to extract predicate-argument structures (PASs) and
                                                                      convert them into utterances. The result of this method is
      We have been using an utterance database cre-                   a database of utterances with their associated topics (called
      ated from a massive amount of predicate-argument                foci) (see Section 3 for details). We are using the utter-
      structures extracted from the web for generating                ance database created by this method in our commercial chat-
      utterances of our commercial chat-orientated di-                oriented dialogue system1 [1].
      alogue system. However, since the creation of                      Although this method can generate utterances correspond-
      this database involves several automated processes,             ing to a variety of foci by exploiting the richness of the web,
      the database often includes non-sentences (ungram-              system utterances have the following problems:
      matical or uninterpretable sentences) and utter-
      ances with inappropriate topic information (called                 • Because of errors resulting from automatic analysis of
      off-focus utterances). Also, utterances tend to                       PASs and their automatic conversion into utterances,
      be monotonous and uninformative because they                          non-sentences (ungrammatical or uninterpretable sen-
      are created from single predicate-argument struc-                     tences) and utterances inappropriate for their associated
      tures. To tackle these problems, we propose meth-                     foci (called off-focus utterances) can sometimes be gen-
      ods for filtering non-sentences by using neural-                      erated.
      network-based methods and utterances inappropri-                   • The system utterances tend to be monotonous and unin-
      ate for their associated foci by using co-occurrence                  formative because they are created from single PASs.
      statistics. To reduce monotony, we also propose                    In this paper, we propose methods for improving the qual-
      a method for concatenating automatically gener-                 ity of the utterance database created by using Higashinaka et
      ated utterances so that the utterances can be longer            al.’s method [7] and for reducing the monotony of system ut-
      and richer in content. Experimental results indicate            terances. In particular, our methods filter non-sentences and
      that our non-sentence filter can successfully remove            off-focus utterances using neural-network-based methods and
      non-sentences with an accuracy of 95% and that we               co-occurrence statistics. We also propose a method of reduc-
      can filter utterances inappropriate for their foci with         ing monotony by concatenating pairs of automatically gen-
      high recall. We also examined the effectiveness of              erated utterances about the same focus so that the utterances
      our filtering and concatenation methods through an              can be longer and richer in content. We verified the effective-
      experiment involving human participants. The ex-                ness of our methods through an experiment involving human
      perimental results show that our methods signifi-               participants. Our contributions are as follows:
      cantly outperformed the baseline in terms of un-
      derstandability and that the concatenation of two                  • We successfully created non-sentence and off-focus fil-
      utterances leads to higher familiarity and content                    ters that can greatly refine the utterance database created
      richness while retaining understandability.                           from PASs on the web. In terms of the utterance quality,
                                                                            we observed significant improvements regarding famil-
                                                                            iarity, understandability, and content richness in subjec-
1    Introduction                                                           tive evaluations. By using our methods, the utterances
Chat-oriented dialogue systems have become increasingly                     of the database can be safely used by chat-oriented dia-
popular [1; 2; 3; 4; 5]. Such systems need to generate a wide               logue systems.
variety of utterances to cope with the many topics contained             • We found that, by concatenating two utterances about
in user utterances. Although rule-based methods have typi-                  the same focus from the utterance database, we can cre-
cally been used to generate system utterances, the topics that              ate utterances that are significantly better in terms of fa-
appear in chats are diverse, and it is extremely expensive to               miliarity and content richness. We confirmed that this
create rules with adequate coverage [6].                                    effect is brought about only when we use the utterance
   To overcome this weakness, Higashinaka et al. [7] pro-
                                                                         1
posed a method of using a large volume of text data on                       https://dev.smt.docomo.ne.jp/


                                                                 44
     database refined by the non-sentence and off-focus fil-            detection of inappropriate utterances has also been tackled
     ters.                                                              in dialogue breakdown detection challenges (DBDCs) [17;
We believe our proposed methods can especially contribute               18]. However, the main focus is on detecting inappropri-
to commercial chat-oriented dialogue systems, in which the              ate utterances in the context of dialogue, whereas we focus
quality of utterances is critical.                                      on refining an utterance database. Inaba et al. [19] pro-
   The paper is structured as follows. In Section 2, we cover           posed a monologue-generation method for non-task-oriented
related work. In Section 3, we explain our PAS-based utter-             dialogue systems by concatenating sentences extracted from
ance database and examine the proportions of non-sentences              Twitter. This is similar to our concatenation method in that it
and utterances inappropriate for their associated foci. In Sec-         concatenates utterances to reduce monotony but different in
tion 4, we explain our proposed methods for filtering inappro-          that it targets monologues rather than dialogues.
priate utterances and our utterance-concatenation method. In
Section 5, we explain our experiment involving human par-               3 PAS-based utterance database
ticipants. Finally, we summarize the paper and discuss future           We first describe the construction and details of the utterance
work in Section 6.                                                      database of our chat-oriented dialogue system. Then, to illus-
                                                                        trate the problems with the database, we examine the propor-
2   Related work                                                        tions of non-sentences and off-focus utterances.

Various methods have been proposed to generate utterances               3.1 Creation of the utterance database
in chat-oriented dialogue systems, such as rule-, retrieval-,           We use the utterance database created by using the method
and generation-based methods.                                           described by Higashinaka et al. [7]. The method uses PAS
   Rule-based methods generate system utterances on the ba-             analysis [16] to extract PASs with their foci from a large
sis of hand-crafted rules. Representative systems that use              amount of text data. To extract high-quality PASs and their
such rules are ELIZA [8] and A.L.I.C.E. [9]. However, the               foci, the method extracts predicates with just two arguments
topics that appear in chat are diverse, and it is extremely ex-         explicitly marked with particles ‘wa’ and ‘ga.’ ‘Wa’ is a topic
pensive to hand-craft rules with wide coverage [6].                     marker and ‘ga’ is a nominative case marker in Japanese. This
   Retrieval-based methods have been proposed to improve                way, a subject and a predicate can be extracted as constituents
coverage. The recent increase in web data has propelled the             of a PAS together with a focus.
development of methods that use data retrieved from the web                Since PASs cannot be uttered as they are, they need to be
for open-domain conversation [10; 11; 2]. The advantage of              converted into utterances. Given a PAS and a dialogue-act
such retrieval-based methods is that, owing to the diversity            type (we need this as input because utterances require under-
of the web, systems can retrieve at least some responses for            lying intentions; dialogue-act types are described below), an
user input, which can solve the coverage problem. However,              utterance is automatically created. The PASs are first con-
this comes at the cost of utterance quality. Since the web              verted into declarative sentences using a simple rule. Then,
is inherently noisy, it is, in many cases, difficult to sift out        their sentence-end expressions (NB. In Japanese, modali-
appropriate sentences from retrieval results.                           ties are mostly expressed by sentence-end expressions) are
   Recently, generation-based methods based on neural net-              swapped with those matching the target dialogue-act type.
works have been extensively researched. However, these                  The sentence-end expressions used are those automatically
methods generally tend to generate utterances with little con-          mined from dialogue-act-annotated dialogue data. The de-
tent, although there has been research on improving the diver-          tails of the method of obtaining and swapping sentence-end
sity in generated utterances [12; 13]. We acknowledge that              expressions are given by Miyazaki et al. [20].
current neural-network-based methods are yielding promis-                  From the list of 32 dialogue-act types [21], 21, which are
ing results. However, we use an utterance database created              mainly related to self-disclosure and question, are used for
from PASs on the web [7] because it is guaranteed to out-               conversion. From blog data (about three years’ worth of blog
put system utterances with content related to the focus of the          articles) and by the combination of the extracted PASs and the
conversation and because system utterances can be more con-             dialogue-act types, the resulting utterance database contains
trollable, which is particularly important for commercial ap-           7,116,597 utterances associated with 204,497 foci.
plications.
   The detection of inappropriate utterances including non-             3.2 Quality of utterance database
sentences is related to the detection of grammatical errors             Since the PASs are extracted and converted into utterances
made by second-language learners. Imaeda et al. [14] pro-               automatically, errors in the resulting utterances are inevitable,
posed a dictionary-based method for detecting case particle             affecting the quality of the utterance database. From our ob-
errors by using a lexicon, Oyama et al. [15] proposed a sup-            servation, there can be two types of erroneous utterances:
port vector machine (SVM)-based method for detecting case               non-sentence and off-focus utterances.
particle errors in documents created by non-native Japanese             Non-sentence Sentences that we cannot understand due to
speakers, and Imamura et al. [16] proposed a method for de-                  grammatical errors or a strange combination of words.
tecting all types of particle errors. However, these methods                 Non-sentences are generated mainly in the conversion
cannot be directly applied to the utterances of dialogue sys-                of sentence-end expressions; some propositions can-
tems since the error tendency of automatically generated ut-                 not be uttered with certain sentence-end expressions in
terances differs from that of second-language learners. The                  Japanese (see [22] for such examples).


                                                                   45
Table 1: Examples of non-sentence annotation (0: non-sentence, 1:           Table 3: Examples of focus annotation (0: off-focus，1: on-focus).
valid-sentence). A1 and A2 indicate the labels given by the two dif-        Annotators 1 and 2 give labels by two different annotators (A1 and
ferent annotators. Utterances were originally in Japanese. English          A2).
translations in parentheses were done by authors.
                                                                                Focus            Utterance                        A1     A2
   Focus             Utterance                           A1    A2               秋 冬 (Fall &      単価が高いんですか？ (Is the               0      0
   秋 冬 (Fall &       どんなんが流行りますよね                        0     0                winter)          unit price high?)
   winter)           (What kind of types is pop-                                秋 冬 (Fall &      ブーツが多いのでしょうか？                    1      1
                     ular, isn’t it?) (NB. This                                 winter)          (Are there a lot of boots?)
                     sentence sounds odd because                                秋 冬 (Fall &      空気が澄んでるんですかね？                    1      0
                     its subject is an interrogative                            winter)          (Is air clear?)
                     while the sentence is declara-
                     tive.)
   秋 冬 (Fall &       レギンス男子が増えてます                        1     1            Table 4: Statistics of focus annotation (0: Off-focus, 1: On-focus)
   winter)           ねぇ (Boys wearing leggings
                     are increasing, aren’t they?)                                                          # of utterances    Percentage
   秋 冬 (Fall &       空気が乾燥したりとかです                        1     0                 2 annotators labeled 0               7,528           5%
   winter)           (Air is dry and so on.)                                     2 annotators labeled 1            121,511           80%
                                                                                 1 annotator labeled 0,              21,916          15%
                                                                                 other labeled 1
Table 2: Statistics of non-sentence annotation (0: non-sentence, 1:
                                                                                 Total                             150,955            100%
valid-sentence)

                                 # of utterances       Percentage
     2 annotators labeled 0               23,052             12%            Focus annotation
     2 annotators labeled 1             150,955              75%            By using the utterances annotated as valid-sentences in the
     1 annotator labeled 0,               25,993             13%            non-sentence corpus (i.e., 150,955 utterances), two annota-
     other labeled 1                                                        tors labeled whether the utterances were appropriate to their
     Total                              200,000              100%           foci. The annotators were shown pairs of a focus and utter-
                                                                            ance and labeled each pair with the following instructions:
                                                                                • If you feel the combination of utterance and focus is un-
Off-focus utterances Utterances inappropriate for their as-                       natural, label it 0 (off-focus).
     sociated foci. Although the utterances in the database
     are created from PASs in which the focus and subject are                   • If you feel the combination of utterance and focus is nat-
     explicitly marked by the topic marker and case marker,                       ural, label it 1 (on-focus). When the focus has multiple
     respectively, the focus and content of an utterance are                      meanings, if there is at least one reasonable interpreta-
     often not closely associated. This occurs when there is                      tion, label the combination 1.
     an error in the PAS analysis or when the meaning of the                   A total of 24 annotators participated; pairs of annotators
     focus is just too broad or vague.                                      were randomly selected for labeling pairs of a focus and ut-
   We investigated the current quality of the database in terms             terance. Cohen’s κ value was 0.32, which indicates a rea-
of how many non-sentences and off-focus utterances are con-                 sonable degree of agreement when considering the subjective
tained. For this purpose, we performed annotations regarding                nature of judging naturalness. Table 3 shows an example of
non-sentence and off-focus utterances, which are described                  this annotation, and Table 4 gives the annotation breakdown.
below.                                                                      Utterances inappropriate for their associated foci accounted
                                                                            for 5% of the database. Hereafter, we call the focus annota-
Non-sentence annotation                                                     tion data on which the annotators agreed ”the focus corpus”
We randomly sampled 200,000 utterances from the utterance                   (containing 129,039 utterances).
database. The annotators labeled each utterance with the fol-
lowing instructions:
                                                                            4     Proposed methods
   • If you think the utterance is a non-sentence, label it 0.
                                                                            We found that there are 12% non-sentences and 5% utter-
   • If you do not think the utterance is a non-sentence (i.e.,             ances inappropriate for their associated foci in our database.
     it is a valid-sentence), label it 1.                                   Since this means the system utterances can often be erro-
   A total of 24 annotators participated; two annotators were               neous, we need to reduce these utterances to improve the
randomly assigned to each utterance. Cohen’s κ value, which                 quality of our database. We also see it as a problem that the
assesses the agreement between the two annotators, was cal-                 utterances in our database are monotonous and uninformative
culated as 0.56. This indicates an intermediate degree of                   because they were generated from single PASs.
agreement. Table 1 lists annotation examples, and Table 2                      In this paper, we propose methods of filtering non-
gives the annotation breakdown. Non-sentences accounted                     sentences and off-focus utterances for refining the database.
for 12% of the database. Hereafter, we call the non-sentence                We also propose a method to concatenate pairs of utterances
annotation data on which the annotators agreed ”the non-                    about the same focus to reduce monotony of system utter-
sentence corpus” (containing 174,007 utterances).                           ances.


                                                                       46
4.1 Method for creating non-sentence filter                                  If the PMI value is below a certain threshold, we can filter
Since the detection of non-sentences can be regarded as a task            the utterance because the association can be considered low.
of sentence classification, we created a non-sentence filter by           The threshold can be determined experimentally, that is, we
using machine-learning methods. We used standard machine-                 find the threshold that produces the best accuracy using train-
learning methods for sentence classification such as SVM and              ing/development data. Note that the best accuracy depends
neural-network-based methods, which have been extensively                 on the objective. If we want the resulting database to be as
used in recent years. We used the following machine-learning              clean as possible, we can set a high threshold. If we do not
methods for training our classifiers2 :                                   want to lose much data, the threshold can be set lower. In
                                                                          this study, we set the target recall for detecting off-focus ut-
SVM We train an SVM classifier with a linear kernel. The                  terances to 80% because we want most off-focus utterances
   features are the averaged word vectors of words con-                   removed. We determine the threshold that achieves this re-
   tained in an utterance. We use a pretrained word vector                call on the training/development data and use it for filtering
   provided by Suzuki et al. [23], the dimensions of which                possible off-focus utterances.
   are 200. We use the same pretrained word vectors for                      Note that an appropriate text database must be chosen for
   MLP, CNN, and LSTM, which we describe below.                           calculating the PMI. We consider using Wikipedia (contain-
Multi-Layer Perceptron (MLP) We train a classifier by                     ing roughly 8M sentences) and blogs (we use one year’s
    MLP. We have five layers: the input layer, three non-                 worth of blogs containing about 2B sentences). The former
    linear layers (each layer having 200 units) with sigmoid              is smaller but more informational. The latter is larger but
    activation, and the output layer. We use averaged word                noisy and is a mixture of contents of varying quality. We
    vectors as input. The output layer outputs a binary deci-             will verify which one is more useful in a later experiment, al-
    sion by a softmax function.                                           though we naturally assume that blog data are more suitable
Convolutional Neural Network (CNN) We train a classi-                     because they have more variety, which is a requirement for
    fier by a CNN. We have an input layer, a convolutional                chat-oriented dialogue systems.
    layer, a pooling layer, and an output layer. The model
    structure is the same as that used by Kim [24]. A filter              4.3 Utterance concatenation
    whose size is 200 × 3 is used for convolution. The stride             For one solution to reduce monotony, we propose a method
    is set to one. We used relu as an activation function. The            of concatenating pairs of automatically generated utterances
    max pooling layer uses a window size of three to output               about the same focus so that the utterances can be longer and
    a fixed length vector. The output layer outputs a binary              richer in content. More specifically, we propose concatenat-
    decision by a softmax function.                                       ing two random utterances that have the same focus.
Long Short-Term Memory (LSTM) We train a classifier                          Although this approach may seem simplistic, it can be ef-
    by LSTM. We have an input layer, an LSTM layer, three                 fective because, at the very least, it increases the utterance
    hidden layers, and an output layer. The LSTM layer has                length of a system. Note that it is not trivial to create a reason-
    200 units. Each word is converted into an embedding,                  able utterance by concatenating two utterances. It has been
    and the sequence of word embeddings is converted into a               shown that implicit discourse relations are still hard to de-
    hidden representation, corresponding to a sentence vec-               tect [26]. This means that utterances that will be coherent in
    tor. Then, this vector is fed to three non-linear layers              terms of discourse are difficult to accurately select. In addi-
    (each layer having 200 units) with sigmoid activation,                tion, we believe our simple concatenation method may just
    the output of which is input to the output layer, making              work because the concatenated utterance will satisfy the lo-
    a binary decision by a softmax function.                              cal coherence [27] with the same underlying entity (i.e., the
                                                                          focus).
4.2 Method for creating focus filter
To filter out off-focus utterances, we use co-occurrence statis-          5   Evaluation
tics, namely, point-wise mutual information (PMI) between
the subject of the utterance and its focus. We use PMI be-                We first individually evaluated the performance of our non-
cause it has been successfully used to filter sentences unre-             sentence and focus filtering methods and then conducted a
lated to topics [25]. We calculate the PMI with the following             subjective evaluation involving human participants on the fil-
equation:                                                                 tered and concatenated utterances.
                                count(S, F )/N
      PMI(S, F ) = log2                             ,         (1)         5.1 Evaluation of our non-sentence filtering
                           count(S)/N ∗ count(F )/N                           methods
where S is a subject; F is a focus; ‘count’ is a function that re-        We trained a non-sentence filter by using the non-sentence
turns the number of documents containing S, F , or both; and              corpus (see Section 3.2). We split the data into training, de-
N is the maximum number of documents in a text database.                  velopment, and test sets corresponding to 3837, 500, and 500
We use a sentence as a document unit.                                     foci, respectively.
   2                                                                        We trained the classifiers using the training data and eval-
     We used scikit-learn (http://scikit-learn.org/) for SVM and
Chainer (http://chainer.org/) for MLP, CNN, and LSTM.                     uated the accuracy with the test data by using the highest


                                                                     47
Table 5: Precision, recall, and F-measure for the detection of non-           ances).
sentences. Bold font represents top score for each evaluation crite-
rion.                                                                         5.3 Subjective evaluation
    Method      Accuracy     Precision     Recall    F-measure                We conducted a subjective evaluation involving human par-
    SVM         0.93         0.81          0.71      0.76                     ticipants to verify the effectiveness of our non-sentence and
    MLP         0.90         0.63          0.84      0.72                     focus filtering methods as well as our concatenation method
    CNN         0.94         0.86          0.73      0.79                     (see Section 4.3).
    LSTM        0.95         0.88          0.78      0.83                     Evaluation procedure
                                                                              Four participants took part in the evaluation. We made each
Table 6: Precision, recall, and F-measure for off-focus/on-focus ut-          of eight methods for comparison (see the following subsec-
terances for training and test data when thresholds of 2.2 and 2.8 are        tion for details) generate utterances for 100 randomly selected
used for Wikipedia and blog data, respectively.                               foci, resulting in 800 utterances (8×100 foci) for use in the
                                                                              experiment. The utterances were randomly shuffled and pre-
                           Precision Recall         F-measure                 sented to the participants. Each participant rated the 800 ut-
                            Wikipedia                                         terances in terms of familiarity, understandability, and content
    train    off-focus     0.09       0.82          0.16                      richness (we describe these criteria later).
             on-focus      0.98       0.49          0.65
                                                                              Methods for comparison
    test     off-focus     0.09       0.80          0.16
                                                                              We compared the following eight methods (a)–(h). Note that
             on-focus      0.97       0.42          0.59
                                                                              for non-sentence filtering, we use the LSTM model, which
                            Blog data
                                                                              showed the best performance in our experiment. For focus
    train    off-focus     0.12       0.81          0.20                      filtering, we use the PMI threshold of 2.8 calculated by using
             on-focus      0.98       0.62          0.76                      the blog data.
    test     off-focus     0.13       0.81          0.23
             on-focus      0.98       0.64          0.77                      (a) Random (Single): Baseline
                                                                                   We randomly select a single utterance from the utterance
                                                                                   database.
F-measure model yielded from the development data3 . The                      (b) Random (Pair): Proposed
classification results are listed in Table 5. We can see that                      We randomly select two utterances from the utterance
our method successfully detected the non-sentences with high                       database and concatenate them to create a system utter-
accuracy. The model that uses LSTM had the highest accu-                           ance.
racy (0.95) and F-measure (0.83). LSTM has the highest ac-                    (c) NS-filtered (Single): Proposed
curacy probably because the determination of non-sentences                         We randomly select one utterance from the test data of
depends on the sequence of words that can best be captured                         the non-sentence corpus that was classified as a valid-
with recurrent models.                                                             sentence with non-sentence filtering.
5.2 Evaluation of our focus filtering method                                  (d) NS-filtered (Pair): Proposed
                                                                                   We randomly select two utterances from the test data
We split the focus corpus (see Section 3.2) into 80% training                      of the non-sentence corpus that were classified as valid-
data and 20% test data. We first calculated the PMI values                         sentences with non-sentence filtering. Then we concate-
between the subjects and foci for all utterances by using the                      nate these utterances to create a system utterance.
training data. Then, we looked for the threshold of the PMI
that achieved 80% recall for off-focus utterances through a                   (e) NS+F-filtered (Single): Proposed
grid search.                                                                       We randomly select one utterance from the test data of
   When we used Wikipedia as the data for PMI calculation,                         the non-sentence corpus that was classified as a valid-
we obtained a threshold of 2.2, and when we used the blog                          sentence with non-sentence filtering and as on-focus
data, the threshold was 2.8. See Figures 1 and 2 for the                           with focus filtering.
changes in precision, recall, and F-measure when we changed                   (f) NS+F-filtered (Pair): Proposed
the threshold by an interval of 0.1. Table 6 shows the preci-                      We randomly select two utterances from the test data of
sion, recall, and F-measure for off-focus/on-focus utterances                      the non-sentence corpus that were classified as valid-
for training and test data when the thresholds of 2.2 and 2.8                      sentences with non-sentence filtering and as on-focus
are used for Wikipedia and the blog data, respectively. As                         with focus filtering. Then we concatenate these utter-
expected, the use of blog data yielded much better results,                        ances to create a system utterance.
resulting in higher precision/recall for on-focus utterances at               (g) Gold NS (Single)
the point of 80% recall for off-focus utterances. The results                      We randomly select one utterance annotated as a valid
indicate that our off-focus filter can successfully filter utter-                  sentence in the test data of the non-sentences corpus.
ances that are not associated with their foci (off-focus utter-
                                                                              (h) Gold F (Single)
    3                                                                              We randomly select one utterance annotated as on-focus
      Note that for SVM, we used the training data for training and
the test data for evaluation; we did not use the development data.                 in the test data of the focus corpus.


                                                                         48
         1.0                                                                                                                                                                 1.0

                                                                                                                                     Off-focus precision

                                                                                                                                     Off-focus recall

                                                                                                                                     Off-focus f-measure

                                                                                                                                     On-focus precision
         0.8                                                                                                                                                                 0.8
                                                                                                                                     On-focus recall

                                                                                                                                     On-focus f-measure


                                                                                                                                                                                            Off-focus precision
         0.6                                                                                                                                                                 0.6
                                                                                                                                                                                            Off-focus recall

                                                                                                                                                                                            Off-focus f-measure

                                                                                                                                                                                            On-focus precision

                                                                                                                                                                                            On-focus recall
         0.4                                                                                                                                                                 0.4
                                                                                                                                                                                            On-focus f-measure


         0.2                                                                                                                                                                 0.2


         0.0                                                                                                                                                                 0.0
           0.0   0.2   0.4   0.6   0.8   1.0   1.2   1.4   1.6   1.8   2.0   2.2   2.4   2.6   2.8   3.0   3.2   3.4   3.6   3.8   4.0   4.2   4.4   4.6   4.8                  0.0   0.2   0.4   0.6   0.8   1.0   1.2   1.4   1.6   1.8   2.0   2.2   2.4   2.6   2.8   3.0   3.2   3.4   3.6   3.8   4.0   4.2   4.4   4.6   4.8

                                                                                    PMI                                                                                                                                                                  PMI


       Figure 1: Changes in precision, recall, and F-measure                                                                                                               Figure 2: Changes in precision, recall, and F-measure
       when we changed PMI threshold by interval of 0.1. This                                                                                                              when we changed PMI threshold by interval of 0.1. This
       is when Wikipedia is used for PMI calculation.                                                                                                                      is when blog data are used for PMI calculation.

                                                            Table 7: Example utterances generated by eight methods used in subjective evaluation

  Method                                                                     Focus                                                   Utterance
  (a) Random (Single)                                                        カルボナーラ                                                  カルボナーラはパスタがいいたいですか？ (Does carbonara want to say pasta?) (NB.
                                                                              (Carbonara)                                            This is a non-sentence; the inanimate subject carbonara cannot be the subject of “say.”)
  (b) Random (Pair)                                                          視            力                                          視力は出ないってことがわかりますねぇ 視力は右が下がります？？ (We understand
                                                                              (Eye sight)                                            that your eye sight is not good. Has the sight of your right eye decreased?)
  (c) NS-filtered (Single)                                                   マ フ ラ ー                                                 マフラーはバーバリーマフラーが欲しいですね (I want a Burberry scarf.)
                                                                              (Scarf)
  (d) NS-filtered (Pair)                                                     水            曜                                          水曜は授業が終わってますか？水曜は授業が入ってないんです (Has Wednesday’s
                                                                              (Wednesday)                                            class ended? There is no class on Wednesday.)
  (e) NS+F-filtered (Single)                                                 バ     ナ      ナ                                          バナナはおいしいのが多いですね (Bananas are generally delicious, aren’t they?)
                                                                              (Banana)
  (f) NS+F-filtered (Pair)                                                   夕            食                                          夕食は和食が食べたいんですって 夕食は鍋がいいですね (Somebody wants to have
                                                                              (Dinner)                                               Japanese for dinner. Japanese stew should be fine.)
  (g) Gold NS (Single)                                                       ワ     ン      コ                                          ワンコは耳がいいですよね (Doggies have good ears, don’t they?)
                                                                              (Doggy)
  (h) Gold F (Single)                                                        観     光      客                                          観光客は欧米人が多いとかですか？ (Are there many tourists from Europe and the
                                                                              (Tourist)                                              US?)


   Random (Single) is the baseline, which is our current                                                                                                              dicates the highest agreement.
method of just using a single utterance for a given focus from
the utterance database. Table 7 lists example utterances gen-                                                                                                         Results
erated by the eight methods.                                                                                                                                          Table 8 lists the evaluation results. By comparing (a) Ran-
Evaluation criteria                                                                                                                                                   dom (Single) to (c) NS-filtered (Single), we can see that un-
Sugiyama et al. [29] used the semantic differential (SD)                                                                                                              derstandability and familiarity were improved by using non-
method to derive the dimensions to evaluate utterances in                                                                                                             sentence filtering. By comparing (c) NS-filtered (Single)
chat-oriented dialogue systems. They identified three dimen-                                                                                                          to (e) NS+F-filtered (Single), although there was no signif-
sions, and we used them in our evaluation. The evaluation                                                                                                             icant difference, we can see that understandability further im-
criteria together with the statements used in the evaluation                                                                                                          proved. Since both (c) NS-filtered (Single) and (e) NS+F-
were as follows:                                                                                                                                                      filtered (Single) significantly outperform the baseline, this
                                                                                                                                                                      verifies the effectiveness of our filters. In addition, by com-
  • Familiarity: You feel familiar with the system and that                                                                                                           paring (g) Gold NS (Single) to (h) Gold F (Single), we can
    you want to talk more.                                                                                                                                            confirm that utterances need to be appropriate for their asso-
  • Content Richness: You feel that the utterance is interest-                                                                                                        ciated foci. The results here indicate that our filters contribute
    ing and informative.                                                                                                                                              greatly to the understandability of the utterances in the utter-
                                                                                                                                                                      ance database. In addition, we surprisingly also see improve-
  • Understandability: You feel that the utterance is natural                                                                                                         ments in familiarity and content richness.
    and easy to understand.                                                                                                                                               By comparing (a) Random (Single) to (b) Random (Pair),
Each participant rated their level of agreement to the above                                                                                                          although content richness improved, we can see that under-
statements using a Likert scale between 1 and 5, where 5 in-                                                                                                          standability significantly decreased. This means that just


                                                                                                                                                                 49
Table 8: Subjective evaluation results (5 is high). Superscripts a–h next to numbers indicate methods with which that value was statistically
better. Double-letters (e.g., aa) mean p < .01; otherwise, p < .05. For statistical test, we used Steel-Dwass multiple comparison test [28].
Bold font represents top three scores for each evaluation criterion.


                                                                Familiarity    Understandability       Content richness
                 Baseline      (a) Random (Single)              3.52           3.37bb                  3.25
                               (b) Random (Pair)                3.60           2.87                    3.74aacceegg
                               (c) NS-filtered (Single)         3.75aa         3.73aabbddf             3.42
                 Proposed      (d) NS-filtered (Pair)           3.76aa         3.17b                   3.89aacceegghh
                               (e) NS+F-filtered (Single)       3.75a          3.87aabbddf f           3.53aa
                               (f) NS+F-filtered (Pair)         3.90aabbgg     3.49bbdd                4.12aabbccdeegghh
                   Gold        (g) Gold NS (Single)             3.63           3.69abbdd               3.40
                               (h) Gold F (Single)              3.88aabbgg     4.21aabbccddeef f gg    3.64aag


randomly concatenating utterances in the current utterance                propriately, for example, by taking discourse relations [30;
database for the same focus does not lead to good utter-                  26] into account.
ances. However, by comparing (a) Random (Single) to (d)
NS-filtered (Pair), we can see that our concatenation method              References
improved familiarity and content richness while maintain-
                                                                          [1] Kanako Onishi and Takeshi Yoshimura. Casual conver-
ing understandability. By comparing (d) NS-filtered (Pair) to
(f) NS+F-filtered (Pair), we can see further improvements in                  sation technology achieving natural dialog with comput-
content richness and understandability. Although it does not                  ers. NTT DOCOMO Technical Jouranl, Vol. 15, No. 4,
seem to be a good idea to concatenate possibly low-quality                    pp. 16–21, 2014.
utterances, it is a good idea to concatenate valid and on-focus           [2] Alan Ritter, Colin Cherry, and William B Dolan. Data-
utterances. Because content richness has improved without                     driven response generation in social media. In Proceed-
loss of understandability, we can safely say that our concate-                ings of the Conference on Empirical Methods in Natural
nation method can reduce the monotony and generate richer                     Language Processing, pp. 583–593, 2011.
utterances.                                                               [3] Oriol Vinyals and Quoc Le. A neural conversational
                                                                              model. In Proceedings of the 32nd International Con-
6    Summary and future work                                                  ference on Machine Learning Deep Learning Workshop,
                                                                              2015.
To refine our utterance database and generate non-                        [4] Zhou Yu, Ziyu Xu, Alan W Black, and Alexander I
monotonous utterances, we proposed methods of filtering
                                                                              Rudnicky. Strategy and policy learning for non-task-
non-sentences and utterances inappropriate for their asso-
                                                                              oriented conversational systems. In Proceedings of the
ciated foci using neural-network-based methods and co-
                                                                              17th Annual SIGdial Meeting on Discourse and Dia-
occurrence statistics. To reduce monotony, we also proposed
                                                                              logue, pp. 404–412, 2016.
a simple but powerful method of concatenating two utter-
ances related to the same focus so that the utterances can                [5] Rafael E Banchs and Haizhou Li. IRIS: a chat-oriented
be longer and richer in content. Experimental results show                    dialogue system based on the vector space model. In
that our non-sentence filter can successfully remove non-                     Proceedings of the ACL 2012 System Demonstrations,
sentences with an accuracy of 95% and that we can filter ut-                  pp. 37–42, 2012.
terances inappropriate for their foci with high recall. Also,             [6] Ryuichiro Higashinaka, Toyomi Meguro, Hiroaki
we examined the effectiveness of our filtering methods and                    Sugiyama, Toshiro Makino, and Yoshihiro Matsuo. On
concatenation method through an experiment involving hu-                      the difficulty of improving hand-crafted rules in chat-
man participants. Experimental results show that our auto-                    oriented dialogue systems. In Proceedings of the Signal
matic methods of incorporating non-sentence and focus fil-                    and Information Processing Association Annual Summit
tering significantly outperformed the current single-utterance                and Conference, pp. 1014–1018, 2015.
baseline. The experimental results also indicate that the con-
                                                                          [7] Ryuichiro Higashinaka, Kenji Imamura, Toyomi Me-
catenation of two utterances leads to higher familiarity and
content richness while maintaining understandability. We be-                  guro, Chiaki Miyazaki, Nozomi Kobayashi, Hiroaki
lieve our proposed methods can especially contribute to com-                  Sugiyama, Toru Hirano, Toshiro Makino, and Yoshihiro
mercial chat-oriented dialogue systems, in which the quality                  Matsuo. Towards an open-domain conversational sys-
of utterances is critical.                                                    tem fully based on natural language processing. In Pro-
   For future work, we plan to update the utterance database                  ceedings of the 25th International Conference on Com-
of our current chat-oriented dialogue system with our filter-                 putational Linguistics, pp. 928–939, 2014.
ing methods and concatenation method. We also plan to                     [8] Joseph Weizenbaum. ELIZA— a computer program for
consider methods of concatenating two utterances more ap-                     the study of natural language communication between


                                                                     50
     man and machine. Communications of the ACM, Vol. 9,                 Japanese Society for Artificial Intelligence, Vol. 31,
     No. 1, pp. 36–45, 1966.                                             No. 1, pp. DSF–F 1, 2016. (In Japanese).
[9] Richard S Wallace. The Anatomy of A.L.I.C.E. In Pars-           [20] Chiaki Miyazaki, Toru Hirano, Ryuichiro Higashinaka,
     ing the Turing Test, pp. 181–210. Springer, 2009.                   Toshiro Makino, and Yoshihiro Matsuo. Automatic con-
[10] Fumihiro Bessho, Tatsuya Harada, and Yasuo Ku-                      version of sentence-end expressions for utterance char-
                                                                         acterization of dialogue systems. In Proceedings of the
     niyoshi. Dialog system using real-time crowdsourcing
                                                                         29th Pacific Asia Conference on Language, Information
     and twitter large-scale corpus. In Proceedings of the
                                                                         and Computation, pp. 307–314, 2015.
     13th Annual Meeting of the Special Interest Group on
     Discourse and Dialogue, pp. 227–231, 2012.                     [21] Toyomi Meguro, Ryuichiro Higashinaka, Yasuhiro Mi-
                                                                         nami, and Kohji Dohsaka.           Controlling listening-
[11] Masahiro Shibata, Tomomi Nishiguchi, and Yoichi
                                                                         oriented dialogue using partially observable Markov de-
     Tomiura. Dialog system for open-ended conversation                  cision processes. In Proceedings of the 23rd inter-
     using web documents. Informatica (Slovenia), Vol. 33,               national conference on computational linguistics, pp.
     No. 3, pp. 277–284, 2009.                                           761–769, 2010.
[12] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,         [22] Ryuichiro Higashinaka, Kotaro Funakoshi, Masahiro
     and Bill Dolan. A diversity-promoting objective func-               Araki, Hiroshi Tsukahara, Yuka Kobayashi, and
     tion for neural conversation models. In Proceedings                 Masahiro Mizukami. Towards taxonomy of errors in
     of the 15th Annual Conference of the North American                 chat-oriented dialogue systems. In Proceedings of the
     Chapter of the Association for Computational Linguis-               16th Annual Meeting of the Special Interest Group on
     tics, 2016.                                                         Discourse and Dialogue, pp. 87–95, 2015.
[13] Yuanlong Shao, Stephan Gouws, Denny Britz, Anna                [23] Masatoshi Suzuki, Koji Matsuda, Satoshi Sekine,
     Goldie, Brian Strope, and Ray Kurzweil. Generat-                    Naoaki Okazaki, and Kentaro Inui. Neural joint learn-
     ing high-quality and informative conversation responses             ing for classifying wikipedia articles into fine-grained
     with sequence-to-sequence models. In Proceedings of                 named entity types. In Proceedings of the 30th Pacific
     the 2017 Conference on Empirical Methods in Natural                 Asia Conference on Language, Information and Com-
     Language Processing, pp. 2210–2219, 2017.                           putation, pp. 535–544, 2016.
[14] Koji Imaeda, Atsuo Kawai, Yuji Ishikawa, Ryo Nagata,           [24] Yoon Kim. Convolutional neural networks for sentence
     and Fumito Masui. Error detection and correction of                 classification. In Proceedings of the 2014 Conference
     case particles in Japanese learner’s composition. In                on Empirical Methods on Natural Language Process-
     Proceedings of the Information Processing Society of                ing, pp. 1746–1751, 2014.
     Japan SIG, No. 13 (2002-CE-068), pp. 39–46, 2003. (In          [25] David Newman, Jey Han Lau, Karl Grieser, and Timo-
     Japanese).                                                          thy Baldwin. Automatic evaluation of topic coherence.
[15] Hiromi Oyama, Yuji Matsumoto, Masayuki Asahara,                     In Proceedings of the 2010 Annual Conference of the
     and Kosuke Sakata. Construction of an error informa-                North American Chapter of the Association for Compu-
     tion tagged corpus of Japanese language learners and                tational Linguistics, pp. 100–108, 2010.
     automatic error detection. In Proceedings of the Com-          [26] Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. Rec-
     puter Assisted Language Instruction Consortium, 2008.               ognizing implicit discourse relations in the penn dis-
[16] Kenji Imamura, Kuniko Saito, Kugatsu Sadamitsu, and                 course treebank. In Proceedings of the 2009 Conference
     Hitoshi Nishikawa. Grammar error correction using                   on Empirical Methods in Natural Language Processing,
     pseudo-error sentences and domain adaptation. In Pro-               pp. 343–351, 2009.
     ceedings of the 50th Annual Meeting of the Association         [27] Regina Barzilay and Mirella Lapata. Modeling local
     for Computational Linguistics, pp. 388–392, 2012.                   coherence: An entity-based approach. Computational
[17] Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka                       Linguistics, Vol. 34, No. 1, pp. 1–34, 2008.
     Kobayashi, and Michimasa Inaba. The dialogue break-            [28] Meyer Dwass. Some k-sample rank-order tests. Contri-
     down detection challenge: Task description, datasets,               butions to probability and statistics, pp. 198–202, 1960.
     and evaluation metrics. In Proceedings of the 2016 Lan-        [29] Hiroaki Sugiyama, Toyomi Meguro, and Ryuichiro Hi-
     guage Resources and Evaluation Conference, pp. 3146–
                                                                         gashinaka. Multi-aspect evaluation for utterances in
     3150, 2016.
                                                                         chat dialogues. In Proceedings of the Special Inter-
[18] Ryuichiro Higashinaka, Kotaro Funakoshi, Michi-                     est Group on Spoken Language Understanding and
     masa Inaba, Yuiko Tsunomori, Tetsuro Takahashi, and                 Dialogue Processing, Vol. 4, pp. 31–36, 2014. (In
     Nobuhiro Kaji. Overview of dialogue breakdown de-                   Japanese).
     tection challenge 3. In Proceedings of Dialog System           [30] Atsushi Otsuka, Toru Hirano, Chiaki Miyazaki,
     Technology Challenges Workshop (DSTC6), 2017.                       Ryuichiro Higashinaka, Toshiro Makino, and Yoshihiro
[19] Michimasa Inaba, Yuka Yoshino, and Kenichi Taka-                    Matsuo. Utterance selection using discourse relation
     hashi.      Open domain monologue generation for                    filter for chat-oriented dialogue systems. In Dialogues
     speaking-oriented dialogue systems.         Journal of              with Social Robots, pp. 355–365. Springer, 2017.


                                                               51