Towards a Response Selection System for Spoken Requests in a Physical Domain

                         Andisheh Partovi, Ingrid Zukerman, Quan Tran
                        Faculty of Information Technology, Monash University
                              Clayton, VICTORIA 3800, AUSTRALIA
                 {andi.partovi,ingrid.zukerman,quan.tran}@monash.edu


                         Abstract                                     room, one big and red, and one small and blue. Further, am-
                                                                      biguous or inaccurate references may occur as a result of dif-
    In this paper, we introduce a corpus comprising re-               ferences in parse trees (e.g., due to variants in prepositional
    quests for objects in physical spaces, and responses              attachments).
    given by people to these requests. We generated                      In addition to improving ASR and SLU modules, Spo-
    two datasets based on this corpus: a manually-tagged              ken Dialogue Systems (SDSs) must be able to cope with
    dataset, and a dataset which includes features that are           these problems by generating appropriate responses to users’
    automatically extracted from the output of a Spoken               spoken utterances. Recently, deep-learning algorithms have
    Language Understanding module. These datasets are                 been used for response generation [Serban et al., 2016;
    used in a classification-based approach for generating            Yang et al., 2016]. However, these algorithms rely solely on
    responses to spoken requests. Our results show that,              requests and responses, without taking into account the (ex-
    surprisingly, classifiers trained on the second dataset           tra linguistic) context, and typically require large amounts of
    outperform those trained on the first, and produce ac-            data, which may not be available in some applications. In this
    ceptable levels of performance.                                   paper, we offer a supervised-learning approach to response-
                                                                      generation that is suitable for smaller datasets. Our approach
1       Introduction                                                  harnesses the properties of utterances, dialogue history and
                                                                      context to choose response types for users’ requests.
In recent times, there have been significant improvements                To obtain an upper bound for classifier performance, we
in Automatic Speech Recognition (ASR) [Chorowski et al.,              trained a classifier using human-observable features of spo-
2015; Bahdanau et al., 2016]. For example, a research proto-          ken requests and response types selected by participants for
type of a spoken slot-filling dialogue system reported a Word         these requests. We then trained a second classifier using fea-
Error Rate (WER) of 13.8% when using “a generic dictation             tures that were automatically extracted from the output pro-
ASR system” [Mesnil et al., 2015], and Google reported an             duced by our SLU system (Section 5). Surprisingly, the sec-
8% WER for its ASR API.1 However, this API had a WER                  ond classifier produced significantly better results than the
of 54.6% when applied to the Let’s Go corpus [Lange and               first one.
Suendermann-Oeft, 2014].                                                 The rest of this paper is organized as follows. In the next
   ASR errors not only produce wrongly recognized entities            section, we discuss related work. Our corpus is described
or actions, but may also yield ungrammatical utterances that          in Section 3. In Section 4, we detail the human-observable
cannot be processed by a Spoken Language Understanding                features and the response-classification results obtained with
(SLU) system (e.g., “the plate inside the microwave” being            them. We then offer a brief account of our SLU system, fol-
mis-heard as “of plating sight the microwave”), or yield in-          lowed by a description of the features that are automatically
correct results when processed by an SLU system (e.g., due            extracted from its output and the resultant classification per-
to fillers such as “hmm” being mis-heard as “and” or “on”).           formance. Concluding remarks appear in Section 7.
   The problems caused by ASR errors are exacerbated by the
fact that people often express themselves ambiguously or in-          2   Related Work
accurately [Trafton et al., 2005; Moratz and Tenbrink, 2006;
                                                                      Decision-theoretic approaches have been the accepted stan-
Funakoshi et al., 2012; Zukerman et al., 2015]. An ambigu-
                                                                      dard for response generation in dialogue systems for some
ous reference to an object matches several objects well, while
                                                                      time [Carlson, 1983]. These approaches were initially imple-
an inaccurate reference matches one or more objects partially.
                                                                      mented in SDSs in the form of Influence Diagrams that make
For instance, in a household domain, a reference to a “big
                                                                      myopic (one-shot) decisions regarding dialogue acts [Paek
blue mug” is ambiguous if there is more than one big blue
                                                                      and Horvitz, 2000], procedures that optimize responses [In-
mug in the room, and inaccurate if there are two mugs in the
                                                                      ouye and Biermann, 2005; Sugiura et al., 2009], and Dy-
    1
    venturebeat.com/2015/05/28/google-says-its-speech-                namic Decision Networks that make decisions about dia-
recognition-technology-now-has-only-an-8-word-error-rate.             logue acts over time [Horvitz et al., 2003; Liao et al., 2006].


                                                                 20
Later on, reinforcement learning was employed to learn op-               “the green bookcase”, uttered in the context of Figure 1(d),
timal policies over time [Lemon, 2010], with particular at-              was presented as “move the green bookcase”.
tention being paid to Partially Observable Markov Decision                  Each participant was then asked to choose a response for
Processes [Williams and Young, 2007; Young et al., 2013;                 each request from the following four response types (partici-
Gašić and Young, 2014], and their extension Hidden Infor-              pants were given a description of each response type):
mation State [Young et al., 2007; Young et al., 2013]. Ow-                 • D O: suitable when the addressee is sure about which ob-
ing to the complexity of these formalisms, they have been                    ject the request refers to.
used mainly in slot-filling applications, e.g., making airline
and restaurant reservations [Young et al., 2013].                          • C ONFIRM: suitable when the addressee feels the need to
                                                                             confirm the requested object before taking action.
   Recently, deep learning has been applied to various aspects
of SDSs [Wen et al., 2015; Li et al., 2016; Mrkšic et al., 2016;          • C HOOSE: suitable when the addressee hesitates between
Prakash et al., 2016; Serban et al., 2016; Yang et al., 2016].               several objects.
Wen et al. [2015] focused on the generation of linguisti-                  • R EPHRASE: suitable when part or all of a request is so
cally varied responses, and Mrkšic et al. [2016] proposed a                 unintelligible that the addressee cannot understand it.
dialogue-state tracking framework. The generation of dia-
                                                                            These choices were made under two cost settings: low-cost
logue contributions was studied in [Li et al., 2016; Prakash et
                                                                         – where participants were told that the requested object must
al., 2016] for chatbots; in [Serban et al., 2016] for help-desk
                                                                         be delivered to someone in the same room; and high-cost –
responses and Twitter follow-up statements; and in [Yang
                                                                         where they were told that the object must be delivered to a far-
et al., 2016] for slot tagging, and user-intent and system-
                                                                         away location. These settings were designed to discriminate
action prediction in slot-filling applications. A combination
                                                                         between situations where mistakes are fairly inconsequential
of deep learning and reinforcement learning has been used
                                                                         and situations where mistakes are costly.
in end-to-end dialogue systems that query a knowledge-base,
                                                                            40 people took part in this experiment (six of them also par-
where user utterances are mapped to a clarification ques-
                                                                         ticipated in the first experiment); half of the participants were
tion or a knowledge-base query [Williams and Zweig, 2016;
                                                                         native English speakers, and half were male. Thirteen peo-
Zhao and Eskenazi, 2016; Dhingra et al., 2017]. All these
                                                                         ple participated in an initial version of the experiment where
systems learn to generate complete responses from large cor-
                                                                         they first chose response types for all the requests under the
pora comprising request-response pairs.
                                                                         low-cost setting, and then chose response types for the same
   Our work follows this supervised-learning trend in a setting          requests under the high-cost setting. We modified the ex-
where the appropriateness of a response depends both on the              periment on the basis of the participants’ feedback, so that
request and on the physical context. Further, our dataset is             the remaining 27 participants considered each request under
significantly smaller than those used by neural mechanisms.              the low-cost setting, and were immediately asked how their
                                                                         response would differ under the high-cost setting. This ex-
3   The Corpus                                                           perimental variation had no effect on response-classification
                                                                         performance (Sections 4.1 and 6.1).
Our corpus, which was gathered in two experiments, com-
                                                                            To determine the effect of personal variations on classifi-
prises requests to fetch or move household objects, and re-
                                                                         cation performance, one of the authors, who is familiar with
sponses to these requests.
                                                                         the system, selected response types for all the requests.
Experiment 1 – This experiment replicates the experiment                 3.1   Analysis and Post Processing
described in [Zukerman et al., 2015] using the Google ASR
API, instead of Microsoft Speech SDK 6.1 — the WER of                    In total, we collected 960 request-response pairs (=
the Google API was 13% for our corpus. 35 participants were              12 requests×2 cost factors×40 participants). 24.2% of these
asked to describe 12 designated objects (labelled A to L) in             requests had an unintelligible semantic role in at least one
the scenarios depicted in Figure 1. Each scenario contains be-           ASR output, with the vast majority occurring in the OBJECT
tween 8 and 16 household objects varying in size, colour and             of the descriptions; 17.9% were ambiguous (i.e., they had
position. The participants were allowed to repeat a descrip-             more than one reasonable referent); and only 3.8% were in-
tion up to two times. In total, they recorded 478 descriptions           accurate (i.e., they did not match perfectly any referent).
such as the following: “the computer under the table”, “the                 In order to train both classifiers on the same corpus, we
picture on the wall”, “the green plate next to the screwdriver           removed requests that don’t fit the requirements of the au-
at the top of the table”, “the plate in the corner of the table”,        tomatic feature-extraction process (Section 6). Specifically,
and “the large pink ball in the middle of the room”.                     we excluded 62 descriptions (13%) that had more than one
                                                                         prepositional phrase, and 43 descriptions (9%) that could not
Experiment 2 – This experiment took the form of an online                be processed by our SLU module [Zukerman et al., 2015]
survey where participants had to indicate how they would re-             (Section 5). As a result, our corpus contains 375 descrip-
spond to a (potentially mis-heard) request. Each participant             tions, which yield a total of 750 requests for both cost set-
was shown the top four ASR outputs for the request versions              tings. The responses to these requests were distributed as fol-
of 12 descriptions generated by one participant in the first ex-         lows: 51.9% D O (majority class), 21.6% C HOOSE, 14.1%
periment, along with the images in Figure 1. For instance,               R EPHRASE, and 12.4% C ONFIRM.
“the pot plant on the table”, uttered in the context of Fig-                It is worth noting that the response types chosen for the
ure 1(a), was converted to “get the pot plant on the table”; and         excluded requests were included in the dataset as features in


                                                                    21
                  (a) Positional relations in a room                          (b) Colour, size and positional relations on a table


          (c) Projective and positional relations on a table                  (d) Colour, size and positional relations in a room

                                         Figure 1: Household scenes used in our experiments

order to enable us to determine the effect of dialogue his-                table in Figure 1(b) are reasonable interpretations for the
tory on performance (Sections 4 and 6). Clearly, removing                  second request.
requests disrupts the actual sequence of events, which has                 People often compensate for mis-heard utterances by pos-
an adverse effect on the performance of sequence classifiers               tulating reasonable words that sound similar to what was
(Section 4.2). In the near future, we will address this problem            heard. We take this behaviour into account by splitting
by including a feature set for all the requests in a sequence.             this feature into two sub-features: (2a) With phonetic sim-
                                                                           ilarity and (2b) Without phonetic similarity. For exam-
4   Classification with Manually-Tagged                                    ple, when considering phonetic similarity in the context
    Features                                                               of Figure 1(b), “blue plate” is a sensible replacement for
                                                                           “Blues play”, yielding one reasonable interpretation for
Two team members annotated each description obtained from                  the second request (the green plate labeled E).
the first experiment with the following features, which are in-         3. Do the reasonable interpretations include fewer than all
dicative of inaccuracy and ambiguity, and were deemed rel-                 the objects in the context? (YES , NO) – This feature indi-
evant to a person’s decision regarding how to respond to a                 cates how much information can be extracted from a de-
request (the first annotator labelled the features, and the sec-           scription, e.g., the value of this feature is NO for “the blue
ond annotator verified the annotations; disagreements were                 plate on the table” in the context of Figure 1(c), since all
resolved by consensus).                                                    the objects on the table are reasonable referents for this
1. Unintelligible role – This is the semantic role of a gar-               description. As above, this feature is split into two sub-
   bled portion of a description, where the possible values                features: (3a) With phonetic similarity and (3b) Without
   are {NONE , ALL , OBJECT, LANDMARK , OTHER}. For                        phonetic similarity.
   example, “the hottest under the table” has an unintelligi-           4. # of perfect interpretations – How many objects match
   ble OBJECT, and “the green plate on the left of the Blues               perfectly a description? For example, the two balls in Fig-
   play” has an unintelligible LANDMARK.                                   ure 1(d) match perfectly the description “the ball”. Note
2. # of reasonable interpretations – How many objects are                  that the difference between # of reasonable interpreta-
   reasonable referents for a description? For instance, the               tions and # of perfect interpretations indicates the accu-
   first of the above requests has two reasonable referents in             racy of a description.
   the context of Figure 1(a), as there are only two objects            5. Do the perfect interpretations include fewer than all the
   under the table. Similarly, the two green plates on the                 objects in the context? (YES , NO) – This feature is similar


                                                                   22
      to Feature #3. However, since we are considering only
      interpretations that match a request perfectly, there is no                  Table 1: Performance with manually-tagged features
      need to take into account phonetic similarity.                           Classifier           Manually-              Precision Recall
                                                                                                 tagged Features
4.1     Response Classification                                                   RF                                            0.58     0.65
We experimented with several classification algorithms, in-                       RF         + Gender & English nat.            0.62     0.67
cluding Naı̈ve Bayes, Support Vector Machines, Decision                                      + Gender & English nat.
Trees (DT) and Random Forests (RF), to learn response types                       DT                                            0.63     0.68
                                                                                             + 3-Back responses
from the data collected in our experiments. Here we report on                     RNN        + entire previous sequence         0.55     0.62
the results obtained with DT and RF, which had the best per-
                                                                                  RF1P       + 3-Back responses                 0.81     0.82
formance.2 We used the above features to determine baseline
performance, and experimented with four additional features:
Gender; English nativeness – whether the participant is a na-
tive English speaker; 3-Back responses (vector of length 4) –
the counts of the response types provided by an Experiment 2
participant for the three preceding requests;3 and Cost – high
or low. This feature worsened performance in all cases, and
was removed.
   We performed 10-fold cross-validation to evaluate classi-
fier performance; statistical significance was computed using
the Wilcoxon signed-ranked test. Rows 2-4 in Table 1 display
the best results obtained by our classifiers for each feature
configuration.
   RF yielded the best results for the manually-tagged fea-                              Figure 2: RNN for response-type selection
tures alone, and for these features plus Gender and English
nativeness; while DT produced the best results overall when                   Recurrent Neural Network (RNN) as a sequence classifier.
3-Back responses were added (statistically significant with p-                   Our RNN model is based on the Long-Short-Term-Memory
value=0.05). The most influential features in the decision tree               (LSTM) architecture [Hochreiter and Schmidhuber, 1997],
were # of perfect interpretations, # of reasonable interpreta-                which can capture long-range dependencies. If we denote
tions with phonetic similarity, and # of rephrases in 3-Back                  the features of the t-th utterance as ft , the hidden state of the
responses. The per-class performance of DT appears in the                     RNN at time step t + 1 is calculated as a function of the input
second and third columns of Table 5. Note the poor precision                  at time step t + 1, ft+1 , and the previous hidden state, ht :
and recall obtained for C ONFIRM, which was often confused                    ht+1 = LSTM(ht , ft+1 ). With this mechanism, the model
with D O. DT’s deficient performance for R EPHRASE may be                     maps the sequence of features to a sequence of hidden vec-
attributed to the fact that requests that had the same features,              tors, which are decoded into a sequence of labels by a linear
in particular those with partially or completely unintelligible               neural net layer: yt ∼ softmax(W ht + b).
ASR outputs, elicited the different responses from the partic-                   A natural extension of this model is to stack the LSTM lay-
ipants.                                                                       ers, i.e., the outputs of the first LSTM layer are given as input
   As mentioned in Section 3, we also trained and tested the                  to the second layer, and so on; our model stacks 15 layers of
classifiers using response types selected by only one per-                    LSTMs. This model was implemented with Keras [Chollet,
son – the first author. The best performance was achieved                     2017] and Theano [Theano Development Team, 2016], and
with an RF classifier that includes 3-Back responses, denoted                 was trained to minimize categorical cross-entropy loss using
RF1P . This performance was much better than of the classi-                   the Adam SGD learner [Kingma and Ba, 2014].
fiers trained with the response types of 40 participants, which                  Owing to time limitations, we performed only 5-fold cross-
indicates that personal attributes affect people’s responses.4                validation. The results of the RNN appear in the penultimate
                                                                              row of Table 1. The RNN’s disappointing performance may
4.2     Sequence Classification                                               be attributed to the relatively small dataset combined with the
In order to investigate the influence of a sequence of request-               disruption of several sequences due to the removal of request-
response pairs on future responses, we trained and tested a                   response pairs (in order to reduce sequence disruption, we
   2                                                                          retained the 43 pairs corresponding to descriptions that could
      We used over- and under-sampling to try to deal with the large
majority class, but neither affected the classifiers’ performance.
                                                                              not be processed by our SLU system).
    3
      We experimented with several sequence lengths, of which 3-
Back yielded the best results. We also investigated a setting where           5     The SLU System Scusi?
the counts of the response types chosen for all the other 23 requests
were used as features. This setting, which is clearly unfeasible, gave        Scusi? [Zukerman et al., 2015] is a system that implements
the best results, achieving 0.70 precision and 0.68 recall.                   an anytime, numerical mechanism for the interpretation of
    4
      We tried to address this issue by clustering users based on the         spoken descriptions, focusing on a household context. It has
number of times they chose each response type, but didn’t get good            four processing stages, where (intermediate) interpretations
clusters for k < 10.                                                          in each stage can have multiple parents in the previous stage,


                                                                         23
                                    Speech wave:                                                     Table 2: Features obtained from the word-error detector
                                                                                                         Is there an ASR output with all correct words?
ASR output 1:
the brown stool near the table
                                                   ASR output 2:
                                                   the blown store near the table                        % of wrong words in the top ASR output
                                                                                                         % of wrong words in all ASR outputs
         OBJECT


                                                                         OBJECT
                     stool                               store
                     DEFINITE: YES
                     COLOUR: BROWN
                                                         DEFINITE: YES
                                                         ATTRIBUTE:
                                                                                                         % of ASR outputs with all correct words
                     (coordinates in                        UNKNOWN
                      colour space)

UCG−1                                                                               UCG−2
                         near                                near
                                                                                                        Table 3: Features extracted from the top-10 ICGs
         LANDMARK


                                                                         LANDMARK
                     table                               table                                    Number of top-ranked ICGs with similar scores            (×1)
                     DEFINITE YES                         DEFINITE YES
                                                                                                  Location match score between an ICG and the context     (×10)
                                                                                                  Per-node features for an ICG in relation to its parent UCGs
         OBJECT


                                                                         OBJECT
                     stool−L                              stool−2                                 Best colour-match score for a content node              (×20)
 ICG−1                                                                              ICG−2         Best size-match score for a content node                (×20)
                     Location_near                       Location_near
          LANDMARK


                                                                         LANDMARK
                                                                                                  Maximum # of unknowns for a content node                (×20)
                     table−1                             table−1                                  For a content node, % of UCG parents with corresponding node
                                                                                                      • with a colour match for this node                 (×20)
                                                                                                      • with a size match for this node                   (×20)
  Figure 3: Scusi?’s workflow and UCG-to-ICG relations                                                • that have unknowns                                (×20)
                                                                                                  For a node, % of UCG parents with corresponding node
and can produce multiple children in the next stage; early pro-                                       • that lexically match this node                    (×30)
cessing stages may be probabilistically revisited; and only the
most promising options in each stage are explored further.                                       UCG-2. ICG-1 matches the context well, as stool-L is near
Scusi’s workflow – The system takes as input a speech signal,                                    table-1. The details of the calculation of the scores are de-
and uses an ASR to produce candidate texts. Each text is as-                                     scribed in [Zukerman et al., 2015]. The aspects that are most
signed a score given the speech wave, and passed to an error-                                    relevant to this paper are that scores are represented on a log-
detection module that postulates which words were correctly                                      arithmic scale in order to avoid underflow, and scores of value
or wrongly recognized by the ASR [Zukerman and Partovi,                                          0 are smoothed to a low value  in order not to invalidate any
2017] — this component is required, as in real life we don’t                                     interpretation.
have access to transcriptions. Next, Scusi? applies Charniak’s
probabilistic parser (bllip.cs.brown.edu/resources.                                              6    Classification with Automatically-Extracted
shtml#software) to syntactically analyze the texts, yield-                                            Features
ing at most 50 parse trees per text. The third stage applies                                     We automatically extracted features from the top-10 ICGs
mapping rules to the parse trees to generate Uninstantiated                                      generated by Scusi? for each description (the correct inter-
Concept Graphs (UCGs) [Sowa, 1984] that represent the se-                                        pretation is in the top-10 ICGs in about 90% of the cases)
mantics of the descriptions. The final stage instantiates the                                    — these features appear in Tables 2 and 3. The features in
UCGs with objects and relations from the current context, and                                    Table 2, extracted from the output of Scusi?’s word-error de-
returns candidate Instantiated Concept Graphs (ICGs) ranked                                      tector, pertain to the intelligibility of the descriptions. The
in descending order of merit (score).                                                            second and third feature in Table 2 are among the most in-
    Figure 3 illustrates this process for the description “the                                   fluential ones.5 The last feature is noteworthy because, even
brown stool near the table” in the context of Figure 1(d). All                                   though only one ASR output is correct, the error-detection
stages produce several outputs, but we show only two outputs                                     component may decide that several ASR outputs are correct,
for each of three stages (ASR, UCG and ICG). In addition,                                        e.g., “the flower on the table” and “the flour on the table”.
in this example, both UCGs are parents of the two ICGs, but                                         The first feature in Table 3 represents the ambiguity of a
only the match with ICG-1 is shown in Figure 3. The first                                        description through the similarity between the scores of suc-
ASR output is correct, and the second has “blown store” in-                                      cessive top-ranked ICGs, which is encoded as the ratio be-
stead of “brown stool”. Each of these outputs yields one UCG                                     tween the (logarithmic) score of the i+1-th ICG and the score
(via a parse tree), where the object in the second UCG has                                       of the i-th ICG. When this ratio between neighbouring ICGs
an unknown attribute, as Scusi? doesn’t recognize the modi-                                      is below an empirically-derived threshold, they are deemed
fier “blown” (unknown attributes occur when a user employs                                       similar. This feature is among the most influential ones.
out-of-vocabulary noun modifiers or the ASR mis-recognizes                                          The remaining features in Table 3 pertain to the accuracy
noun modifiers).                                                                                 of a description, which is represented by the goodness of the
    The score of each ICG depends on two factors: (1) how                                        match between an ICG and its parent UCGs, and between
well the concepts and relations in it match the corresponding                                    an ICG and the context. The second feature, which repre-
concepts and relations in its parent UCGs, and (2) how well                                      sents the accuracy of the location specified in a description,
the relations in the ICG match the context. For example, ICG-                                    is among the most influential ones (for ICGs ranked 4th, 6th
1 matches UCG-1 well, as stool-L can be called “stool” and                                       and 9th).
it is brown, and table-1 can be called “table”; but its match-
score with UCG-2 is lower, as stool-L cannot be called                                               5
                                                                                                       The frequency of features in the top-two levels of 100 trees gen-
“store” and doesn’t match the unknown attribute specified in                                     erated by RF was used as a proxy for their importance.


                                                                                            24
   As seen in Figure 3, content nodes (objects and landmarks)            Table 4: Performance with automatically-extracted Features
in UCGs may have colour and size descriptors, as well as un-
known attributes. The first six per-node features in Table 3                 Classifier       Automatically-         Precision Recall
represent the goodness of attribute matches between the con-                                extracted Features
tent nodes (object and landmark) of an ICG and the corre-                    RF                                           0.73     0.74
sponding nodes in its parent UCGs. Two size-match features,                  RF           + Gender & English nat.         0.74     0.74
one colour-match feature and one unknown feature for ob-                                  + Gender & English nat.
                                                                             DT                                           0.72     0.72
jects of ICGs at various ranks are among the most influential                             + 3-Back responses
features.                                                                    RF1P                                         0.93     0.92
   The last row in Table 3 represents the goodness of lexical
matches between the nodes in an ICG and the corresponding
                                                                         Table 5: Per-class performance of the best classifier for
nodes in its parent UCGs. This feature is among the most
                                                                         manually- and automatically-extracted features
influential for the objects of most of the top-10 ICGs.
   To illustrate these features, let’s return to the UCG-ICG                 Class              Manually-             Automatically-
matches in Figure 3 for the request “move the brown stool                                    tagged Features        extracted Features
near the table” in the context of Figure 1(d). The score of the                             Precision Recall        Precision Recall
top-ranked ICG, viz ICG-1, is significantly higher than that                 DO                  0.72    0.93            0.82     0.83
of ICG-2. Hence, the value of the first feature in Table 3 is                C ONFIRM            0.28    0.10            0.42     0.38
1. As mentioned above, stool-L is near table-1, yielding                     C HOOSE             0.70    0.64            0.74     0.76
a high location match score for ICG-1. 50% of the UCG par-                   R EPHRASE           0.54    0.31            0.70     0.70
ents have a lexical match with the object in ICG-1, as “store”
doesn’t match any designation of stool-L; but 100% of the                recall were obtained for C ONFIRM, but the performance for
UCG parents have a lexical match with the landmark in ICG-               R EPHRASE was only slightly worse than for the other classes.
1 (table-1). Due to the unknown attribute in the object of
UCG-2, the maximum number of unknowns for the ICG-1
object is 1, and the percentage of UCG parents that have un-             7     Conclusion and Future Work
knowns for the ICG-1 object is 50%; while 0% of UCG par-                 We have offered a corpus comprising requests for objects in
ents have unknowns for the ICG-1 landmark. Since the colour              physical spaces, and the responses given by people for these
specified in UCG-1 matches the colour of stool-L, the max-               requests. We generated two datasets based on this corpus: a
imum colour match for the object of ICG-1 is 1, but the per-             manually-tagged dataset, and a dataset which includes fea-
centage of UCG parents with a colour match for the ICG-1                 tures that are automatically extracted from the output of an
object is 50%, as UCG-2 doesn’t have a colour attribute.                 SLU module. These datasets were used in a classification-
                                                                         based approach for generating responses to spoken requests.
6.1   Response Classification                                               Our results show that, surprisingly, classifiers trained on
We experimented with the classifiers considered in Sec-                  the second dataset outperformed those trained on the first.
tion 4.1, except the RNN, using the 165 features described               As mentioned in Section 4, analysis of the data reveals that
in Tables 2 and 3, instead of the manually-obtained ones.6               different users often provide different responses for requests
The RNN was omitted due to the above-described removal                   that have identical manually-tagged features. For instance,
of requests, which disrupts the sequence. As before, we per-             three participants who were shown the following ASR out-
formed 10-fold cross-validation.                                         puts responded with D O, C ONFIRM and R EPHRASE (the op-
   Table 4 displays our results. The classifier with the                 tion chosen by our classifier): (1) “get a blade in the rights of
best performance for a particular configuration of manually-             the disabled”, (2) “get I played in the rights of the disabled”,
tagged features also had the best performance for the cor-               (3) “get I played in the right of the devil”, and (4) “get a blade
responding configuration of automatically-extracted features.            in the right of the devil”. This discrepancy may be partially
Surprisingly, overall performance with these features was sig-           due to a mixture of individual ability to compensate for mis-
nificantly better (with p-value=0.01) than the performance               heard utterances combined with risk-taking attitude — traits
obtained with the manually-tagged features, both for the re-             that may be related to the English nativeness and Gender fea-
sponses given by 40 participants and for the responses pro-              tures respectively, which improve performance. In light of
vided by one person. In the former case, 3-Back responses                this, we posit that additional features that reflect personal dis-
had an adverse effect on performance, and in the latter case,            position could yield further improvements. This notion is re-
it had no effect. The best performance for the 40-participant            inforced by the significantly better classification performance
dataset was obtained with RF plus Gender and English na-                 for the responses obtained from a single user (albeit one fa-
tiveness, but the differences between the classifiers were not           miliar with the system) compared with the performance for
statistically significant. The per-class performance of this             the responses of 40 participants.
classifier appears in the fourth and fifth columns of Table 5.              A complementary explanation for the worse classification
As for the manually-tagged features, the worst precision and             performance obtained for the manually-tagged dataset is that
                                                                         this dataset encodes intelligibility, ambiguity and accuracy
    6                                                                    of descriptions in a general way, while the specific infor-
      Applying Principal Components Analysis to reduce the number
of features had no effect on the classifiers’ performance.               mation encoded in the automatically-extracted dataset (i.e.,


                                                                    25
lexical, colour, size and location match for each of the top-                [Horvitz et al., 2003] E. Horvitz, C. Kadie, T. Paek, and D. Hovel.
10 ICGs) is important for classification. The only aspect                       Models of attention in computing and communication: From
where the manual encoding is more informative than the au-                      principles to applications.      Communications of the ACM,
tomatic encoding pertains to phonetic similarity, which is one                  46(3):52–57, 2003.
of the most influential features for this dataset. In the future,            [Inouye and Biermann, 2005] B. Inouye and A. Biermann. An al-
we will incorporate specific features about lexical, colour,                    gorithm that continuously seeks minimum length dialogs. In
size and location match and out-of-vocabulary words into                        Proceedings of the 4th IJCAI Workshop on Knowledge and Rea-
                                                                                soning in Practical Dialogue Systems, pages 62–67, Edinburgh,
the manually-generated tags, and phonetic-similarity into the
                                                                                Scotland, 2005.
automatically-extracted features.
                                                                             [Kingma and Ba, 2014] D. Kingma and J. Ba. Adam: A method for
   In terms of dialogue history, our results are inconclu-
                                                                                stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
sive. Our hypothesis that dialogue history affects users’
                                                                             [Lange and Suendermann-Oeft, 2014] P.             Lange           and
choices was confirmed (for three preceding requests) for
                                                                                D. Suendermann-Oeft. Tuning Sphinx to outperform Google’s
the manually-tagged requests, but not for the automatically-                    speech recognition API. In ESSV2014 – Proceedings of the
tagged ones.                                                                    Conference on Electronic Speech Signal Processing, Dresden,
   Finally, as noted in [Inouye and Biermann, 2005; Singh                       Germany, 2014.
et al., 2002], users may be satisfied with responses that dif-               [Lemon, 2010] O. Lemon. Learning what to say and how to say
fer from those provided by human consultants. To test this                      it: Joint optimisation of spoken dialogue management and nat-
idea, we propose to conduct a follow-up experiment, where                       ural language generation. Computer Speech and Language,
participants will be asked to rate the suitability of responses                 25(2):210–221, 2010.
generated by our best classifier.                                            [Li et al., 2016] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and
                                                                                D. Jurafsky. Deep reinforcement learning for dialogue gener-
Acknowledgements                                                                ation. In EMNLP2016 – Proceedings of the 2016 Conference
                                                                                on Empirical Methods in Natural Language Processing, pages
This research was supported in part by grant DP120100103                        1192–1202, Austin, Texas, 2016.
from the Australian Research Council.                                        [Liao et al., 2006] W. Liao, W. Zhang, Z. Zhu, Q. Ji, and W.D.
                                                                                Gray. Toward a decision-theoretic framework for affect recog-
References                                                                      nition and user assistance. International Journal of Human-
                                                                                Computer Studies, 64:847–873, 2006.
[Bahdanau et al., 2016] D. Bahdanau,      J. Chorowski,                      [Mesnil et al., 2015] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio,
  D. Serdyuk, and Y. Bengio. End-to-end attention-based                         L. Deng, D.Z. Hakkani-Tür, X. He, L.P. Heck, G. Tur, D. Yu,
  large vocabulary speech recognition. In ICASSP’2016                           and G. Zweig. Using recurrent neural networks for slot filling
  – Proceedings of the 2016 IEEE International Confer-                          in spoken language understanding. IEEE/ACM Transactions on
  ence on Acoustic, Speech and Signal Processing, pages                         Audio, Speech & Language Processing, 23(3):530–539, 2015.
  4945–4949, Shanghai, China, 2016.                                          [Moratz and Tenbrink, 2006] R. Moratz and T. Tenbrink. Spatial
[Carlson, 1983] L. Carlson. Dialogue Games: An Approach                         reference in linguistic human-robot interaction: Iterative, empir-
  to Discourse Analysis. D. Reidel Publishing Company,                          ically supported development of a model of projective relations.
  Dordrecht, Holland, Boston, 1983.                                             Spatial Cognition & Computation: An Interdisciplinary Journal,
[Chollet, 2017] F. Chollet. Keras. https://github.com/                          6(1):63–107, 2006.
   fchollet/keras, 2017.                                                     [Mrkšic et al., 2016] N. Mrkšic, Ó.S. Diarmuid, T.H. Wen,
[Chorowski et al., 2015] J.K Chorowski, D. Bahdanau, D. Serdyuk,                B. Thomson, and S.J. Young. Neural belief tracker: Data-driven
   K. Cho, and Y. Bengio. Attention-based models for speech recog-              dialogue state tracking. arXiv preprint arXiv:1606.03777v1,
   nition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,                2016.
   and R. Garnett, editors, Advances in Neural Information Process-          [Paek and Horvitz, 2000] T. Paek and E. Horvitz. Conversation as
   ing Systems 28, pages 577–585. Curran Associates, Inc., 2015.                action under uncertainty. In Proceedings of the 16th Conference
[Dhingra et al., 2017] B. Dhingra, L. Li, X. Li, J. Gao, Y.N. Chen,             on Uncertainty in Artificial Intelligence, pages 455–464, Stan-
   F. Ahmed, and L. Deng. Towards end-to-end reinforcement                      ford, California, 2000.
   learning of dialogue agents for information access. In ACL’17             [Prakash et al., 2016] A. Prakash, C. Brockett, and P. Agrawal. Em-
   – Proceedings of the 55th Annual Meeting of the Association for              ulating human conversations using convolutional neural network-
   Computational Linguistics, Vancouver, Canada, 2017.                          based IR. In Proceedings of the Neu-IR16 SIGIR Workshop on
[Funakoshi et al., 2012] K. Funakoshi, M. Nakano, T. Tokunaga,                  Neural Information Retrieval, Pisa, Italy, 2016.
   and R. Iida. A unified probabilistic approach to referring expres-        [Serban et al., 2016] I.V. Serban, T. Klinger, G. Tesauro, K. Tala-
   sions. In SIGDIAL’2012 – Proceedings of the 13th SIGdial Meet-               madupula, B. Zhou, Y. Bengio, and A. Courville. Multiresolution
   ing on Discourse and Dialogue, pages 237–246, Seoul, South                   recurrent neural networks: An application to dialogue response
   Korea, 2012.                                                                 generation. arXiv preprint arXiv:1606.00776v1, 2016.
[Gašić and Young, 2014] M. Gašić and S.J. Young. Gaussian                [Singh et al., 2002] S. Singh, D. Litman, M. Kearns, and
   processes for POMDP-based dialogue manager optimization.                     M. Walker. Optimizing dialogue management with reinforce-
   IEEE/ACM Transactions on Audio, Speech & Language Process-                   ment learning: Experiments with the NJFun system. Artificial
   ing, 22(1):28–40, 2014.                                                      Intelligence Research, 16:105–133, 2002.
[Hochreiter and Schmidhuber, 1997] S. Hochreiter and J. Schmid-              [Sowa, 1984] J.F. Sowa. Conceptual Structures: Information Pro-
   huber.      Long short-term memory.        Neural Computation,               cessing in Mind and Machine. Addison-Wesley, Reading, MA,
   9(8):1735–1780, 1997.                                                        1984.


                                                                        26
[Sugiura et al., 2009] K. Sugiura, N. Iwahashi, H. Kashioka, and
   S. Nakamura. Bayesian learning of confidence measure function
   for generation of utterances and motions in object manipulation
   dialogue task. In Proceedings of Interspeech 2009, pages 2483–
   2486, Brighton, United Kingdom, 2009.
[Theano Development Team, 2016] Theano Development Team.
   Theano: A Python framework for fast computation of mathemat-
   ical expressions. arXiv e-prints, abs/1605.02688, 2016.
[Trafton et al., 2005] J.G. Trafton, N.L. Cassimatis, M.D. Buga-
   jska, D.P. Brock, F.E. Mintz, and A.C. Schultz. Enabling effec-
   tive human-robot interaction using perspective-taking in robots.
   IEEE Transactions on Systems, Man and Cybernetics – Part A:
   Systems and Humans, 35(4):460–470, 2005.
[Wen et al., 2015] T.H. Wen, M. Gašić, N. Mrkšic, P. Hao Su,
   D. Vandyke, and S.J. Young. Semantically conditioned LSTM-
   based natural language generation for spoken dialogue systems.
   In EMNLP2015 – Proceedings ot the Conference on Empiri-
   cal Methods in Natural Language Processing, pages 1711–1721,
   Lisbon, Portugal, 2015.
[Williams and Young, 2007] J.D. Williams and S. Young. Partially
   observable Markov decision processes for spoken dialog sys-
   tems. Computer Speech and Language, 21(2):393–422, 2007.
[Williams and Zweig, 2016] J.D. Williams and G. Zweig. End-to-
   end LSTM-based dialog control optimized with supervised and
   reinforcement learning. arXiv preprint arXiv:1606.01269, 2016.
[Yang et al., 2016] X. Yang, Y.N. Chen, D. Hakkani-Tür, P. Gao,
   and L. Deng.        End-to-end joint learning of natural lan-
   guage understanding and dialogue manager. arXiv preprint
   arXiv:1612.00913v1, 2016.
[Young et al., 2007] S. Young, J. Schatzmann, B. Thomson,
   K. Weilhammer, and H. Ye. The hidden information state dia-
   logue manager: A real-world POMDP-based system. In NAACL-
   HLT 2007 – Proceedings of Human Language Technologies: The
   Annual Conference of the North American Chapter of the Asso-
   ciation for Computational Linguistics, Demonstration Program,
   pages 27–28, Rochester, New York, 2007.
[Young et al., 2013] S.J. Young, M. Gašić, B. Thomson, and
   J. Williams. POMDP-based statistical spoken dialogue systems:
   a review. Proceedings of IEEE, 101(5):1160–1179, 2013.
[Zhao and Eskenazi, 2016] T. Zhao and M. Eskenazi. Towards end-
   to-end learning for dialog state tracking and management using
   deep reinforcement learning. In SIGDIAL’2016 – Proceedings
   of the 17th SIGdial Meeting on Discourse and Dialogue, pages
   1–10, Los Angeles, California, 2016.
[Zukerman and Partovi, 2017] I. Zukerman and A. Partovi. Im-
   proving the understanding of spoken referring expressions
   through syntactic-semantic and contextual-phonetic error correc-
   tion. Computer Speech and Language, 2017.
[Zukerman et al., 2015] I. Zukerman, S.N. Kim, Th. Kleinbauer,
   and M. Moshtaghi. Employing distance-based semantics to in-
   terpret spoken referring expressions. Computer Speech and Lan-
   guage, pages 154–185, 2015.


                                                                      27