=Paper= {{Paper |id=Vol-1926/paper3 |storemode=property |title=Towards a Response Selection System for Spoken Requests in a Physical Domain |pdfUrl=https://ceur-ws.org/Vol-1926/paper3.pdf |volume=Vol-1926 |authors=Andisheh Partovi,Ingrid Zukerman,Quan Tran |dblpUrl=https://dblp.org/rec/conf/ijcai/PartoviZT17 }} ==Towards a Response Selection System for Spoken Requests in a Physical Domain == https://ceur-ws.org/Vol-1926/paper3.pdf

Towards a Response Selection System for Spoken Requests in a Physical Domain

Andisheh Partovi, Ingrid Zukerman, Quan Tran
Faculty of Information Technology, Monash University
Clayton, VICTORIA 3800, AUSTRALIA
{andi.partovi,ingrid.zukerman,quan.tran}@monash.edu

Abstract room, one big and red, and one small and blue. Further, am-
biguous or inaccurate references may occur as a result of dif-
In this paper, we introduce a corpus comprising re- ferences in parse trees (e.g., due to variants in prepositional
quests for objects in physical spaces, and responses attachments).
given by people to these requests. We generated In addition to improving ASR and SLU modules, Spo-
two datasets based on this corpus: a manually-tagged ken Dialogue Systems (SDSs) must be able to cope with
dataset, and a dataset which includes features that are these problems by generating appropriate responses to users’
automatically extracted from the output of a Spoken spoken utterances. Recently, deep-learning algorithms have
Language Understanding module. These datasets are been used for response generation [Serban et al., 2016;
used in a classification-based approach for generating Yang et al., 2016]. However, these algorithms rely solely on
responses to spoken requests. Our results show that, requests and responses, without taking into account the (ex-
surprisingly, classifiers trained on the second dataset tra linguistic) context, and typically require large amounts of
outperform those trained on the first, and produce ac- data, which may not be available in some applications. In this
ceptable levels of performance. paper, we offer a supervised-learning approach to response-
generation that is suitable for smaller datasets. Our approach
1 Introduction harnesses the properties of utterances, dialogue history and
context to choose response types for users’ requests.
In recent times, there have been significant improvements To obtain an upper bound for classifier performance, we
in Automatic Speech Recognition (ASR) [Chorowski et al., trained a classifier using human-observable features of spo-
2015; Bahdanau et al., 2016]. For example, a research proto- ken requests and response types selected by participants for
type of a spoken slot-filling dialogue system reported a Word these requests. We then trained a second classifier using fea-
Error Rate (WER) of 13.8% when using “a generic dictation tures that were automatically extracted from the output pro-
ASR system” [Mesnil et al., 2015], and Google reported an duced by our SLU system (Section 5). Surprisingly, the sec-
8% WER for its ASR API.1 However, this API had a WER ond classifier produced significantly better results than the
of 54.6% when applied to the Let’s Go corpus [Lange and first one.
Suendermann-Oeft, 2014]. The rest of this paper is organized as follows. In the next
ASR errors not only produce wrongly recognized entities section, we discuss related work. Our corpus is described
or actions, but may also yield ungrammatical utterances that in Section 3. In Section 4, we detail the human-observable
cannot be processed by a Spoken Language Understanding features and the response-classification results obtained with
(SLU) system (e.g., “the plate inside the microwave” being them. We then offer a brief account of our SLU system, fol-
mis-heard as “of plating sight the microwave”), or yield in- lowed by a description of the features that are automatically
correct results when processed by an SLU system (e.g., due extracted from its output and the resultant classification per-
to fillers such as “hmm” being mis-heard as “and” or “on”). formance. Concluding remarks appear in Section 7.
The problems caused by ASR errors are exacerbated by the
fact that people often express themselves ambiguously or in- 2 Related Work
accurately [Trafton et al., 2005; Moratz and Tenbrink, 2006;
Decision-theoretic approaches have been the accepted stan-
Funakoshi et al., 2012; Zukerman et al., 2015]. An ambigu-
dard for response generation in dialogue systems for some
ous reference to an object matches several objects well, while
time [Carlson, 1983]. These approaches were initially imple-
an inaccurate reference matches one or more objects partially.
mented in SDSs in the form of Influence Diagrams that make
For instance, in a household domain, a reference to a “big
myopic (one-shot) decisions regarding dialogue acts [Paek
blue mug” is ambiguous if there is more than one big blue
and Horvitz, 2000], procedures that optimize responses [In-
mug in the room, and inaccurate if there are two mugs in the
ouye and Biermann, 2005; Sugiura et al., 2009], and Dy-
1
venturebeat.com/2015/05/28/google-says-its-speech- namic Decision Networks that make decisions about dia-
recognition-technology-now-has-only-an-8-word-error-rate. logue acts over time [Horvitz et al., 2003; Liao et al., 2006].

20
Later on, reinforcement learning was employed to learn op- “the green bookcase”, uttered in the context of Figure 1(d),
timal policies over time [Lemon, 2010], with particular at- was presented as “move the green bookcase”.
tention being paid to Partially Observable Markov Decision Each participant was then asked to choose a response for
Processes [Williams and Young, 2007; Young et al., 2013; each request from the following four response types (partici-
Gašić and Young, 2014], and their extension Hidden Infor- pants were given a description of each response type):
mation State [Young et al., 2007; Young et al., 2013]. Ow- • D O: suitable when the addressee is sure about which ob-
ing to the complexity of these formalisms, they have been ject the request refers to.
used mainly in slot-filling applications, e.g., making airline
and restaurant reservations [Young et al., 2013]. • C ONFIRM: suitable when the addressee feels the need to
confirm the requested object before taking action.
Recently, deep learning has been applied to various aspects
of SDSs [Wen et al., 2015; Li et al., 2016; Mrkšic et al., 2016; • C HOOSE: suitable when the addressee hesitates between
Prakash et al., 2016; Serban et al., 2016; Yang et al., 2016]. several objects.
Wen et al. [2015] focused on the generation of linguisti- • R EPHRASE: suitable when part or all of a request is so
cally varied responses, and Mrkšic et al. [2016] proposed a unintelligible that the addressee cannot understand it.
dialogue-state tracking framework. The generation of dia-
These choices were made under two cost settings: low-cost
logue contributions was studied in [Li et al., 2016; Prakash et
– where participants were told that the requested object must
al., 2016] for chatbots; in [Serban et al., 2016] for help-desk
be delivered to someone in the same room; and high-cost –
responses and Twitter follow-up statements; and in [Yang
where they were told that the object must be delivered to a far-
et al., 2016] for slot tagging, and user-intent and system-
away location. These settings were designed to discriminate
action prediction in slot-filling applications. A combination
between situations where mistakes are fairly inconsequential
of deep learning and reinforcement learning has been used
and situations where mistakes are costly.
in end-to-end dialogue systems that query a knowledge-base,
40 people took part in this experiment (six of them also par-
where user utterances are mapped to a clarification ques-
ticipated in the first experiment); half of the participants were
tion or a knowledge-base query [Williams and Zweig, 2016;
native English speakers, and half were male. Thirteen peo-
Zhao and Eskenazi, 2016; Dhingra et al., 2017]. All these
ple participated in an initial version of the experiment where
systems learn to generate complete responses from large cor-
they first chose response types for all the requests under the
pora comprising request-response pairs.
low-cost setting, and then chose response types for the same
Our work follows this supervised-learning trend in a setting requests under the high-cost setting. We modified the ex-
where the appropriateness of a response depends both on the periment on the basis of the participants’ feedback, so that
request and on the physical context. Further, our dataset is the remaining 27 participants considered each request under
significantly smaller than those used by neural mechanisms. the low-cost setting, and were immediately asked how their
response would differ under the high-cost setting. This ex-
3 The Corpus perimental variation had no effect on response-classification
performance (Sections 4.1 and 6.1).
Our corpus, which was gathered in two experiments, com-
To determine the effect of personal variations on classifi-
prises requests to fetch or move household objects, and re-
cation performance, one of the authors, who is familiar with
sponses to these requests.
the system, selected response types for all the requests.
Experiment 1 – This experiment replicates the experiment 3.1 Analysis and Post Processing
described in [Zukerman et al., 2015] using the Google ASR
API, instead of Microsoft Speech SDK 6.1 — the WER of In total, we collected 960 request-response pairs (=
the Google API was 13% for our corpus. 35 participants were 12 requests×2 cost factors×40 participants). 24.2% of these
asked to describe 12 designated objects (labelled A to L) in requests had an unintelligible semantic role in at least one
the scenarios depicted in Figure 1. Each scenario contains be- ASR output, with the vast majority occurring in the OBJECT
tween 8 and 16 household objects varying in size, colour and of the descriptions; 17.9% were ambiguous (i.e., they had
position. The participants were allowed to repeat a descrip- more than one reasonable referent); and only 3.8% were in-
tion up to two times. In total, they recorded 478 descriptions accurate (i.e., they did not match perfectly any referent).
such as the following: “the computer under the table”, “the In order to train both classifiers on the same corpus, we
picture on the wall”, “the green plate next to the screwdriver removed requests that don’t fit the requirements of the au-
at the top of the table”, “the plate in the corner of the table”, tomatic feature-extraction process (Section 6). Specifically,
and “the large pink ball in the middle of the room”. we excluded 62 descriptions (13%) that had more than one
prepositional phrase, and 43 descriptions (9%) that could not
Experiment 2 – This experiment took the form of an online be processed by our SLU module [Zukerman et al., 2015]
survey where participants had to indicate how they would re- (Section 5). As a result, our corpus contains 375 descrip-
spond to a (potentially mis-heard) request. Each participant tions, which yield a total of 750 requests for both cost set-
was shown the top four ASR outputs for the request versions tings. The responses to these requests were distributed as fol-
of 12 descriptions generated by one participant in the first ex- lows: 51.9% D O (majority class), 21.6% C HOOSE, 14.1%
periment, along with the images in Figure 1. For instance, R EPHRASE, and 12.4% C ONFIRM.
“the pot plant on the table”, uttered in the context of Fig- It is worth noting that the response types chosen for the
ure 1(a), was converted to “get the pot plant on the table”; and excluded requests were included in the dataset as features in

21
(a) Positional relations in a room (b) Colour, size and positional relations on a table

Figure 1: Household scenes used in our experiments

order to enable us to determine the effect of dialogue his- table in Figure 1(b) are reasonable interpretations for the
tory on performance (Sections 4 and 6). Clearly, removing second request.
requests disrupts the actual sequence of events, which has People often compensate for mis-heard utterances by pos-
an adverse effect on the performance of sequence classifiers tulating reasonable words that sound similar to what was
(Section 4.2). In the near future, we will address this problem heard. We take this behaviour into account by splitting
by including a feature set for all the requests in a sequence. this feature into two sub-features: (2a) With phonetic sim-
ilarity and (2b) Without phonetic similarity. For exam-
4 Classification with Manually-Tagged ple, when considering phonetic similarity in the context
Features of Figure 1(b), “blue plate” is a sensible replacement for
“Blues play”, yielding one reasonable interpretation for
Two team members annotated each description obtained from the second request (the green plate labeled E).
the first experiment with the following features, which are in- 3. Do the reasonable interpretations include fewer than all
dicative of inaccuracy and ambiguity, and were deemed rel- the objects in the context? (YES , NO) – This feature indi-
evant to a person’s decision regarding how to respond to a cates how much information can be extracted from a de-
request (the first annotator labelled the features, and the sec- scription, e.g., the value of this feature is NO for “the blue
ond annotator verified the annotations; disagreements were plate on the table” in the context of Figure 1(c), since all
resolved by consensus). the objects on the table are reasonable referents for this
1. Unintelligible role – This is the semantic role of a gar- description. As above, this feature is split into two sub-
bled portion of a description, where the possible values features: (3a) With phonetic similarity and (3b) Without
are {NONE , ALL , OBJECT, LANDMARK , OTHER}. For phonetic similarity.
example, “the hottest under the table” has an unintelligi- 4. # of perfect interpretations – How many objects match
ble OBJECT, and “the green plate on the left of the Blues perfectly a description? For example, the two balls in Fig-
play” has an unintelligible LANDMARK. ure 1(d) match perfectly the description “the ball”. Note
2. # of reasonable interpretations – How many objects are that the difference between # of reasonable interpreta-
reasonable referents for a description? For instance, the tions and # of perfect interpretations indicates the accu-
first of the above requests has two reasonable referents in racy of a description.
the context of Figure 1(a), as there are only two objects 5. Do the perfect interpretations include fewer than all the
under the table. Similarly, the two green plates on the objects in the context? (YES , NO) – This feature is similar

22
to Feature #3. However, since we are considering only
interpretations that match a request perfectly, there is no Table 1: Performance with manually-tagged features
need to take into account phonetic similarity. Classifier Manually- Precision Recall
tagged Features
4.1 Response Classification RF 0.58 0.65
We experimented with several classification algorithms, in- RF + Gender & English nat. 0.62 0.67
cluding Naı̈ve Bayes, Support Vector Machines, Decision + Gender & English nat.
Trees (DT) and Random Forests (RF), to learn response types DT 0.63 0.68
+ 3-Back responses
from the data collected in our experiments. Here we report on RNN + entire previous sequence 0.55 0.62
the results obtained with DT and RF, which had the best per-
RF1P + 3-Back responses 0.81 0.82
formance.2 We used the above features to determine baseline
performance, and experimented with four additional features:
Gender; English nativeness – whether the participant is a na-
tive English speaker; 3-Back responses (vector of length 4) –
the counts of the response types provided by an Experiment 2
participant for the three preceding requests;3 and Cost – high
or low. This feature worsened performance in all cases, and
was removed.
We performed 10-fold cross-validation to evaluate classi-
fier performance; statistical significance was computed using
the Wilcoxon signed-ranked test. Rows 2-4 in Table 1 display
the best results obtained by our classifiers for each feature
configuration.
RF yielded the best results for the manually-tagged fea- Figure 2: RNN for response-type selection
tures alone, and for these features plus Gender and English
nativeness; while DT produced the best results overall when Recurrent Neural Network (RNN) as a sequence classifier.
3-Back responses were added (statistically significant with p- Our RNN model is based on the Long-Short-Term-Memory
value=0.05). The most influential features in the decision tree (LSTM) architecture [Hochreiter and Schmidhuber, 1997],
were # of perfect interpretations, # of reasonable interpreta- which can capture long-range dependencies. If we denote
tions with phonetic similarity, and # of rephrases in 3-Back the features of the t-th utterance as ft , the hidden state of the
responses. The per-class performance of DT appears in the RNN at time step t + 1 is calculated as a function of the input
second and third columns of Table 5. Note the poor precision at time step t + 1, ft+1 , and the previous hidden state, ht :
and recall obtained for C ONFIRM, which was often confused ht+1 = LSTM(ht , ft+1 ). With this mechanism, the model
with D O. DT’s deficient performance for R EPHRASE may be maps the sequence of features to a sequence of hidden vec-
attributed to the fact that requests that had the same features, tors, which are decoded into a sequence of labels by a linear
in particular those with partially or completely unintelligible neural net layer: yt ∼ softmax(W ht + b).
ASR outputs, elicited the different responses from the partic- A natural extension of this model is to stack the LSTM lay-
ipants. ers, i.e., the outputs of the first LSTM layer are given as input
As mentioned in Section 3, we also trained and tested the to the second layer, and so on; our model stacks 15 layers of
classifiers using response types selected by only one per- LSTMs. This model was implemented with Keras [Chollet,
son – the first author. The best performance was achieved 2017] and Theano [Theano Development Team, 2016], and
with an RF classifier that includes 3-Back responses, denoted was trained to minimize categorical cross-entropy loss using
RF1P . This performance was much better than of the classi- the Adam SGD learner [Kingma and Ba, 2014].
fiers trained with the response types of 40 participants, which Owing to time limitations, we performed only 5-fold cross-
indicates that personal attributes affect people’s responses.4 validation. The results of the RNN appear in the penultimate
row of Table 1. The RNN’s disappointing performance may
4.2 Sequence Classification be attributed to the relatively small dataset combined with the
In order to investigate the influence of a sequence of request- disruption of several sequences due to the removal of request-
response pairs on future responses, we trained and tested a response pairs (in order to reduce sequence disruption, we
2 retained the 43 pairs corresponding to descriptions that could
We used over- and under-sampling to try to deal with the large
majority class, but neither affected the classifiers’ performance.
not be processed by our SLU system).
3
We experimented with several sequence lengths, of which 3-
Back yielded the best results. We also investigated a setting where 5 The SLU System Scusi?
the counts of the response types chosen for all the other 23 requests
were used as features. This setting, which is clearly unfeasible, gave Scusi? [Zukerman et al., 2015] is a system that implements
the best results, achieving 0.70 precision and 0.68 recall. an anytime, numerical mechanism for the interpretation of
4
We tried to address this issue by clustering users based on the spoken descriptions, focusing on a household context. It has
number of times they chose each response type, but didn’t get good four processing stages, where (intermediate) interpretations
clusters for k < 10. in each stage can have multiple parents in the previous stage,

23
Speech wave: Table 2: Features obtained from the word-error detector
Is there an ASR output with all correct words?
ASR output 1:
the brown stool near the table
ASR output 2:
the blown store near the table % of wrong words in the top ASR output
% of wrong words in all ASR outputs
OBJECT

OBJECT
stool store
DEFINITE: YES
COLOUR: BROWN
DEFINITE: YES
ATTRIBUTE:
% of ASR outputs with all correct words
(coordinates in UNKNOWN
colour space)

UCG−1 UCG−2
near near
Table 3: Features extracted from the top-10 ICGs
LANDMARK

LANDMARK
table table Number of top-ranked ICGs with similar scores (×1)
DEFINITE YES DEFINITE YES
Location match score between an ICG and the context (×10)
Per-node features for an ICG in relation to its parent UCGs
OBJECT

OBJECT
stool−L stool−2 Best colour-match score for a content node (×20)
ICG−1 ICG−2 Best size-match score for a content node (×20)
Location_near Location_near
LANDMARK

LANDMARK
Maximum # of unknowns for a content node (×20)
table−1 table−1 For a content node, % of UCG parents with corresponding node
• with a colour match for this node (×20)
• with a size match for this node (×20)
Figure 3: Scusi?’s workflow and UCG-to-ICG relations • that have unknowns (×20)
For a node, % of UCG parents with corresponding node
and can produce multiple children in the next stage; early pro- • that lexically match this node (×30)
cessing stages may be probabilistically revisited; and only the
most promising options in each stage are explored further. UCG-2. ICG-1 matches the context well, as stool-L is near
Scusi’s workflow – The system takes as input a speech signal, table-1. The details of the calculation of the scores are de-
and uses an ASR to produce candidate texts. Each text is as- scribed in [Zukerman et al., 2015]. The aspects that are most
signed a score given the speech wave, and passed to an error- relevant to this paper are that scores are represented on a log-
detection module that postulates which words were correctly arithmic scale in order to avoid underflow, and scores of value
or wrongly recognized by the ASR [Zukerman and Partovi, 0 are smoothed to a low value in order not to invalidate any
2017] — this component is required, as in real life we don’t interpretation.
have access to transcriptions. Next, Scusi? applies Charniak’s
probabilistic parser (bllip.cs.brown.edu/resources. 6 Classification with Automatically-Extracted
shtml#software) to syntactically analyze the texts, yield- Features
ing at most 50 parse trees per text. The third stage applies We automatically extracted features from the top-10 ICGs
mapping rules to the parse trees to generate Uninstantiated generated by Scusi? for each description (the correct inter-
Concept Graphs (UCGs) [Sowa, 1984] that represent the se- pretation is in the top-10 ICGs in about 90% of the cases)
mantics of the descriptions. The final stage instantiates the — these features appear in Tables 2 and 3. The features in
UCGs with objects and relations from the current context, and Table 2, extracted from the output of Scusi?’s word-error de-
returns candidate Instantiated Concept Graphs (ICGs) ranked tector, pertain to the intelligibility of the descriptions. The
in descending order of merit (score). second and third feature in Table 2 are among the most in-
Figure 3 illustrates this process for the description “the fluential ones.5 The last feature is noteworthy because, even
brown stool near the table” in the context of Figure 1(d). All though only one ASR output is correct, the error-detection
stages produce several outputs, but we show only two outputs component may decide that several ASR outputs are correct,
for each of three stages (ASR, UCG and ICG). In addition, e.g., “the flower on the table” and “the flour on the table”.
in this example, both UCGs are parents of the two ICGs, but The first feature in Table 3 represents the ambiguity of a
only the match with ICG-1 is shown in Figure 3. The first description through the similarity between the scores of suc-
ASR output is correct, and the second has “blown store” in- cessive top-ranked ICGs, which is encoded as the ratio be-
stead of “brown stool”. Each of these outputs yields one UCG tween the (logarithmic) score of the i+1-th ICG and the score
(via a parse tree), where the object in the second UCG has of the i-th ICG. When this ratio between neighbouring ICGs
an unknown attribute, as Scusi? doesn’t recognize the modi- is below an empirically-derived threshold, they are deemed
fier “blown” (unknown attributes occur when a user employs similar. This feature is among the most influential ones.
out-of-vocabulary noun modifiers or the ASR mis-recognizes The remaining features in Table 3 pertain to the accuracy
noun modifiers). of a description, which is represented by the goodness of the
The score of each ICG depends on two factors: (1) how match between an ICG and its parent UCGs, and between
well the concepts and relations in it match the corresponding an ICG and the context. The second feature, which repre-
concepts and relations in its parent UCGs, and (2) how well sents the accuracy of the location specified in a description,
the relations in the ICG match the context. For example, ICG- is among the most influential ones (for ICGs ranked 4th, 6th
1 matches UCG-1 well, as stool-L can be called “stool” and and 9th).
it is brown, and table-1 can be called “table”; but its match-
score with UCG-2 is lower, as stool-L cannot be called 5
The frequency of features in the top-two levels of 100 trees gen-
“store” and doesn’t match the unknown attribute specified in erated by RF was used as a proxy for their importance.

24
As seen in Figure 3, content nodes (objects and landmarks) Table 4: Performance with automatically-extracted Features
in UCGs may have colour and size descriptors, as well as un-
known attributes. The first six per-node features in Table 3 Classifier Automatically- Precision Recall
represent the goodness of attribute matches between the con- extracted Features
tent nodes (object and landmark) of an ICG and the corre- RF 0.73 0.74
sponding nodes in its parent UCGs. Two size-match features, RF + Gender & English nat. 0.74 0.74
one colour-match feature and one unknown feature for ob- + Gender & English nat.
DT 0.72 0.72
jects of ICGs at various ranks are among the most influential + 3-Back responses
features. RF1P 0.93 0.92
The last row in Table 3 represents the goodness of lexical
matches between the nodes in an ICG and the corresponding
Table 5: Per-class performance of the best classifier for
nodes in its parent UCGs. This feature is among the most
manually- and automatically-extracted features
influential for the objects of most of the top-10 ICGs.
To illustrate these features, let’s return to the UCG-ICG Class Manually- Automatically-
matches in Figure 3 for the request “move the brown stool tagged Features extracted Features
near the table” in the context of Figure 1(d). The score of the Precision Recall Precision Recall
top-ranked ICG, viz ICG-1, is significantly higher than that DO 0.72 0.93 0.82 0.83
of ICG-2. Hence, the value of the first feature in Table 3 is C ONFIRM 0.28 0.10 0.42 0.38
1. As mentioned above, stool-L is near table-1, yielding C HOOSE 0.70 0.64 0.74 0.76
a high location match score for ICG-1. 50% of the UCG par- R EPHRASE 0.54 0.31 0.70 0.70
ents have a lexical match with the object in ICG-1, as “store”
doesn’t match any designation of stool-L; but 100% of the recall were obtained for C ONFIRM, but the performance for
UCG parents have a lexical match with the landmark in ICG- R EPHRASE was only slightly worse than for the other classes.
1 (table-1). Due to the unknown attribute in the object of
UCG-2, the maximum number of unknowns for the ICG-1
object is 1, and the percentage of UCG parents that have un- 7 Conclusion and Future Work
knowns for the ICG-1 object is 50%; while 0% of UCG par- We have offered a corpus comprising requests for objects in
ents have unknowns for the ICG-1 landmark. Since the colour physical spaces, and the responses given by people for these
specified in UCG-1 matches the colour of stool-L, the max- requests. We generated two datasets based on this corpus: a
imum colour match for the object of ICG-1 is 1, but the per- manually-tagged dataset, and a dataset which includes fea-
centage of UCG parents with a colour match for the ICG-1 tures that are automatically extracted from the output of an
object is 50%, as UCG-2 doesn’t have a colour attribute. SLU module. These datasets were used in a classification-
based approach for generating responses to spoken requests.
6.1 Response Classification Our results show that, surprisingly, classifiers trained on
We experimented with the classifiers considered in Sec- the second dataset outperformed those trained on the first.
tion 4.1, except the RNN, using the 165 features described As mentioned in Section 4, analysis of the data reveals that
in Tables 2 and 3, instead of the manually-obtained ones.6 different users often provide different responses for requests
The RNN was omitted due to the above-described removal that have identical manually-tagged features. For instance,
of requests, which disrupts the sequence. As before, we per- three participants who were shown the following ASR out-
formed 10-fold cross-validation. puts responded with D O, C ONFIRM and R EPHRASE (the op-
Table 4 displays our results. The classifier with the tion chosen by our classifier): (1) “get a blade in the rights of
best performance for a particular configuration of manually- the disabled”, (2) “get I played in the rights of the disabled”,
tagged features also had the best performance for the cor- (3) “get I played in the right of the devil”, and (4) “get a blade
responding configuration of automatically-extracted features. in the right of the devil”. This discrepancy may be partially
Surprisingly, overall performance with these features was sig- due to a mixture of individual ability to compensate for mis-
nificantly better (with p-value=0.01) than the performance heard utterances combined with risk-taking attitude — traits
obtained with the manually-tagged features, both for the re- that may be related to the English nativeness and Gender fea-
sponses given by 40 participants and for the responses pro- tures respectively, which improve performance. In light of
vided by one person. In the former case, 3-Back responses this, we posit that additional features that reflect personal dis-
had an adverse effect on performance, and in the latter case, position could yield further improvements. This notion is re-
it had no effect. The best performance for the 40-participant inforced by the significantly better classification performance
dataset was obtained with RF plus Gender and English na- for the responses obtained from a single user (albeit one fa-
tiveness, but the differences between the classifiers were not miliar with the system) compared with the performance for
statistically significant. The per-class performance of this the responses of 40 participants.
classifier appears in the fourth and fifth columns of Table 5. A complementary explanation for the worse classification
As for the manually-tagged features, the worst precision and performance obtained for the manually-tagged dataset is that
this dataset encodes intelligibility, ambiguity and accuracy
6 of descriptions in a general way, while the specific infor-
Applying Principal Components Analysis to reduce the number
of features had no effect on the classifiers’ performance. mation encoded in the automatically-extracted dataset (i.e.,

25
lexical, colour, size and location match for each of the top- [Horvitz et al., 2003] E. Horvitz, C. Kadie, T. Paek, and D. Hovel.
10 ICGs) is important for classification. The only aspect Models of attention in computing and communication: From
where the manual encoding is more informative than the au- principles to applications. Communications of the ACM,
tomatic encoding pertains to phonetic similarity, which is one 46(3):52–57, 2003.
of the most influential features for this dataset. In the future, [Inouye and Biermann, 2005] B. Inouye and A. Biermann. An al-
we will incorporate specific features about lexical, colour, gorithm that continuously seeks minimum length dialogs. In
size and location match and out-of-vocabulary words into Proceedings of the 4th IJCAI Workshop on Knowledge and Rea-
soning in Practical Dialogue Systems, pages 62–67, Edinburgh,
the manually-generated tags, and phonetic-similarity into the
Scotland, 2005.
automatically-extracted features.
[Kingma and Ba, 2014] D. Kingma and J. Ba. Adam: A method for
In terms of dialogue history, our results are inconclu-
stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
sive. Our hypothesis that dialogue history affects users’
[Lange and Suendermann-Oeft, 2014] P. Lange and
choices was confirmed (for three preceding requests) for
D. Suendermann-Oeft. Tuning Sphinx to outperform Google’s
the manually-tagged requests, but not for the automatically- speech recognition API. In ESSV2014 – Proceedings of the
tagged ones. Conference on Electronic Speech Signal Processing, Dresden,
Finally, as noted in [Inouye and Biermann, 2005; Singh Germany, 2014.
et al., 2002], users may be satisfied with responses that dif- [Lemon, 2010] O. Lemon. Learning what to say and how to say
fer from those provided by human consultants. To test this it: Joint optimisation of spoken dialogue management and nat-
idea, we propose to conduct a follow-up experiment, where ural language generation. Computer Speech and Language,
participants will be asked to rate the suitability of responses 25(2):210–221, 2010.
generated by our best classifier. [Li et al., 2016] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and
D. Jurafsky. Deep reinforcement learning for dialogue gener-
Acknowledgements ation. In EMNLP2016 – Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing, pages
This research was supported in part by grant DP120100103 1192–1202, Austin, Texas, 2016.
from the Australian Research Council. [Liao et al., 2006] W. Liao, W. Zhang, Z. Zhu, Q. Ji, and W.D.
Gray. Toward a decision-theoretic framework for affect recog-
References nition and user assistance. International Journal of Human-
Computer Studies, 64:847–873, 2006.
[Bahdanau et al., 2016] D. Bahdanau, J. Chorowski, [Mesnil et al., 2015] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio,
D. Serdyuk, and Y. Bengio. End-to-end attention-based L. Deng, D.Z. Hakkani-Tür, X. He, L.P. Heck, G. Tur, D. Yu,
large vocabulary speech recognition. In ICASSP’2016 and G. Zweig. Using recurrent neural networks for slot filling
– Proceedings of the 2016 IEEE International Confer- in spoken language understanding. IEEE/ACM Transactions on
ence on Acoustic, Speech and Signal Processing, pages Audio, Speech & Language Processing, 23(3):530–539, 2015.
4945–4949, Shanghai, China, 2016. [Moratz and Tenbrink, 2006] R. Moratz and T. Tenbrink. Spatial
[Carlson, 1983] L. Carlson. Dialogue Games: An Approach reference in linguistic human-robot interaction: Iterative, empir-
to Discourse Analysis. D. Reidel Publishing Company, ically supported development of a model of projective relations.
Dordrecht, Holland, Boston, 1983. Spatial Cognition & Computation: An Interdisciplinary Journal,
[Chollet, 2017] F. Chollet. Keras. https://github.com/ 6(1):63–107, 2006.
fchollet/keras, 2017. [Mrkšic et al., 2016] N. Mrkšic, Ó.S. Diarmuid, T.H. Wen,
[Chorowski et al., 2015] J.K Chorowski, D. Bahdanau, D. Serdyuk, B. Thomson, and S.J. Young. Neural belief tracker: Data-driven
K. Cho, and Y. Bengio. Attention-based models for speech recog- dialogue state tracking. arXiv preprint arXiv:1606.03777v1,
nition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, 2016.
and R. Garnett, editors, Advances in Neural Information Process- [Paek and Horvitz, 2000] T. Paek and E. Horvitz. Conversation as
ing Systems 28, pages 577–585. Curran Associates, Inc., 2015. action under uncertainty. In Proceedings of the 16th Conference
[Dhingra et al., 2017] B. Dhingra, L. Li, X. Li, J. Gao, Y.N. Chen, on Uncertainty in Artificial Intelligence, pages 455–464, Stan-
F. Ahmed, and L. Deng. Towards end-to-end reinforcement ford, California, 2000.
learning of dialogue agents for information access. In ACL’17 [Prakash et al., 2016] A. Prakash, C. Brockett, and P. Agrawal. Em-
– Proceedings of the 55th Annual Meeting of the Association for ulating human conversations using convolutional neural network-
Computational Linguistics, Vancouver, Canada, 2017. based IR. In Proceedings of the Neu-IR16 SIGIR Workshop on
[Funakoshi et al., 2012] K. Funakoshi, M. Nakano, T. Tokunaga, Neural Information Retrieval, Pisa, Italy, 2016.
and R. Iida. A unified probabilistic approach to referring expres- [Serban et al., 2016] I.V. Serban, T. Klinger, G. Tesauro, K. Tala-
sions. In SIGDIAL’2012 – Proceedings of the 13th SIGdial Meet- madupula, B. Zhou, Y. Bengio, and A. Courville. Multiresolution
ing on Discourse and Dialogue, pages 237–246, Seoul, South recurrent neural networks: An application to dialogue response
Korea, 2012. generation. arXiv preprint arXiv:1606.00776v1, 2016.
[Gašić and Young, 2014] M. Gašić and S.J. Young. Gaussian [Singh et al., 2002] S. Singh, D. Litman, M. Kearns, and
processes for POMDP-based dialogue manager optimization. M. Walker. Optimizing dialogue management with reinforce-
IEEE/ACM Transactions on Audio, Speech & Language Process- ment learning: Experiments with the NJFun system. Artificial
ing, 22(1):28–40, 2014. Intelligence Research, 16:105–133, 2002.
[Hochreiter and Schmidhuber, 1997] S. Hochreiter and J. Schmid- [Sowa, 1984] J.F. Sowa. Conceptual Structures: Information Pro-
huber. Long short-term memory. Neural Computation, cessing in Mind and Machine. Addison-Wesley, Reading, MA,
9(8):1735–1780, 1997. 1984.

26
[Sugiura et al., 2009] K. Sugiura, N. Iwahashi, H. Kashioka, and
S. Nakamura. Bayesian learning of confidence measure function
for generation of utterances and motions in object manipulation
dialogue task. In Proceedings of Interspeech 2009, pages 2483–
2486, Brighton, United Kingdom, 2009.
[Theano Development Team, 2016] Theano Development Team.
Theano: A Python framework for fast computation of mathemat-
ical expressions. arXiv e-prints, abs/1605.02688, 2016.
[Trafton et al., 2005] J.G. Trafton, N.L. Cassimatis, M.D. Buga-
jska, D.P. Brock, F.E. Mintz, and A.C. Schultz. Enabling effec-
tive human-robot interaction using perspective-taking in robots.
IEEE Transactions on Systems, Man and Cybernetics – Part A:
Systems and Humans, 35(4):460–470, 2005.
[Wen et al., 2015] T.H. Wen, M. Gašić, N. Mrkšic, P. Hao Su,
D. Vandyke, and S.J. Young. Semantically conditioned LSTM-
based natural language generation for spoken dialogue systems.
In EMNLP2015 – Proceedings ot the Conference on Empiri-
cal Methods in Natural Language Processing, pages 1711–1721,
Lisbon, Portugal, 2015.
[Williams and Young, 2007] J.D. Williams and S. Young. Partially
observable Markov decision processes for spoken dialog sys-
tems. Computer Speech and Language, 21(2):393–422, 2007.
[Williams and Zweig, 2016] J.D. Williams and G. Zweig. End-to-
end LSTM-based dialog control optimized with supervised and
reinforcement learning. arXiv preprint arXiv:1606.01269, 2016.
[Yang et al., 2016] X. Yang, Y.N. Chen, D. Hakkani-Tür, P. Gao,
and L. Deng. End-to-end joint learning of natural lan-
guage understanding and dialogue manager. arXiv preprint
arXiv:1612.00913v1, 2016.
[Young et al., 2007] S. Young, J. Schatzmann, B. Thomson,
K. Weilhammer, and H. Ye. The hidden information state dia-
logue manager: A real-world POMDP-based system. In NAACL-
HLT 2007 – Proceedings of Human Language Technologies: The
Annual Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics, Demonstration Program,
pages 27–28, Rochester, New York, 2007.
[Young et al., 2013] S.J. Young, M. Gašić, B. Thomson, and
J. Williams. POMDP-based statistical spoken dialogue systems:
a review. Proceedings of IEEE, 101(5):1160–1179, 2013.
[Zhao and Eskenazi, 2016] T. Zhao and M. Eskenazi. Towards end-
to-end learning for dialog state tracking and management using
deep reinforcement learning. In SIGDIAL’2016 – Proceedings
of the 17th SIGdial Meeting on Discourse and Dialogue, pages
1–10, Los Angeles, California, 2016.
[Zukerman and Partovi, 2017] I. Zukerman and A. Partovi. Im-
proving the understanding of spoken referring expressions
through syntactic-semantic and contextual-phonetic error correc-
tion. Computer Speech and Language, 2017.
[Zukerman et al., 2015] I. Zukerman, S.N. Kim, Th. Kleinbauer,
and M. Moshtaghi. Employing distance-based semantics to in-
terpret spoken referring expressions. Computer Speech and Lan-
guage, pages 154–185, 2015.