Towards a Response Selection System for Spoken Requests in a Physical Domain Andisheh Partovi, Ingrid Zukerman, Quan Tran Faculty of Information Technology, Monash University Clayton, VICTORIA 3800, AUSTRALIA {andi.partovi,ingrid.zukerman,quan.tran}@monash.edu Abstract room, one big and red, and one small and blue. Further, am- biguous or inaccurate references may occur as a result of dif- In this paper, we introduce a corpus comprising re- ferences in parse trees (e.g., due to variants in prepositional quests for objects in physical spaces, and responses attachments). given by people to these requests. We generated In addition to improving ASR and SLU modules, Spo- two datasets based on this corpus: a manually-tagged ken Dialogue Systems (SDSs) must be able to cope with dataset, and a dataset which includes features that are these problems by generating appropriate responses to users’ automatically extracted from the output of a Spoken spoken utterances. Recently, deep-learning algorithms have Language Understanding module. These datasets are been used for response generation [Serban et al., 2016; used in a classification-based approach for generating Yang et al., 2016]. However, these algorithms rely solely on responses to spoken requests. Our results show that, requests and responses, without taking into account the (ex- surprisingly, classifiers trained on the second dataset tra linguistic) context, and typically require large amounts of outperform those trained on the first, and produce ac- data, which may not be available in some applications. In this ceptable levels of performance. paper, we offer a supervised-learning approach to response- generation that is suitable for smaller datasets. Our approach 1 Introduction harnesses the properties of utterances, dialogue history and context to choose response types for users’ requests. In recent times, there have been significant improvements To obtain an upper bound for classifier performance, we in Automatic Speech Recognition (ASR) [Chorowski et al., trained a classifier using human-observable features of spo- 2015; Bahdanau et al., 2016]. For example, a research proto- ken requests and response types selected by participants for type of a spoken slot-filling dialogue system reported a Word these requests. We then trained a second classifier using fea- Error Rate (WER) of 13.8% when using “a generic dictation tures that were automatically extracted from the output pro- ASR system” [Mesnil et al., 2015], and Google reported an duced by our SLU system (Section 5). Surprisingly, the sec- 8% WER for its ASR API.1 However, this API had a WER ond classifier produced significantly better results than the of 54.6% when applied to the Let’s Go corpus [Lange and first one. Suendermann-Oeft, 2014]. The rest of this paper is organized as follows. In the next ASR errors not only produce wrongly recognized entities section, we discuss related work. Our corpus is described or actions, but may also yield ungrammatical utterances that in Section 3. In Section 4, we detail the human-observable cannot be processed by a Spoken Language Understanding features and the response-classification results obtained with (SLU) system (e.g., “the plate inside the microwave” being them. We then offer a brief account of our SLU system, fol- mis-heard as “of plating sight the microwave”), or yield in- lowed by a description of the features that are automatically correct results when processed by an SLU system (e.g., due extracted from its output and the resultant classification per- to fillers such as “hmm” being mis-heard as “and” or “on”). formance. Concluding remarks appear in Section 7. The problems caused by ASR errors are exacerbated by the fact that people often express themselves ambiguously or in- 2 Related Work accurately [Trafton et al., 2005; Moratz and Tenbrink, 2006; Decision-theoretic approaches have been the accepted stan- Funakoshi et al., 2012; Zukerman et al., 2015]. An ambigu- dard for response generation in dialogue systems for some ous reference to an object matches several objects well, while time [Carlson, 1983]. These approaches were initially imple- an inaccurate reference matches one or more objects partially. mented in SDSs in the form of Influence Diagrams that make For instance, in a household domain, a reference to a “big myopic (one-shot) decisions regarding dialogue acts [Paek blue mug” is ambiguous if there is more than one big blue and Horvitz, 2000], procedures that optimize responses [In- mug in the room, and inaccurate if there are two mugs in the ouye and Biermann, 2005; Sugiura et al., 2009], and Dy- 1 venturebeat.com/2015/05/28/google-says-its-speech- namic Decision Networks that make decisions about dia- recognition-technology-now-has-only-an-8-word-error-rate. logue acts over time [Horvitz et al., 2003; Liao et al., 2006]. 20 Later on, reinforcement learning was employed to learn op- “the green bookcase”, uttered in the context of Figure 1(d), timal policies over time [Lemon, 2010], with particular at- was presented as “move the green bookcase”. tention being paid to Partially Observable Markov Decision Each participant was then asked to choose a response for Processes [Williams and Young, 2007; Young et al., 2013; each request from the following four response types (partici- Gašić and Young, 2014], and their extension Hidden Infor- pants were given a description of each response type): mation State [Young et al., 2007; Young et al., 2013]. Ow- • D O: suitable when the addressee is sure about which ob- ing to the complexity of these formalisms, they have been ject the request refers to. used mainly in slot-filling applications, e.g., making airline and restaurant reservations [Young et al., 2013]. • C ONFIRM: suitable when the addressee feels the need to confirm the requested object before taking action. Recently, deep learning has been applied to various aspects of SDSs [Wen et al., 2015; Li et al., 2016; Mrkšic et al., 2016; • C HOOSE: suitable when the addressee hesitates between Prakash et al., 2016; Serban et al., 2016; Yang et al., 2016]. several objects. Wen et al. [2015] focused on the generation of linguisti- • R EPHRASE: suitable when part or all of a request is so cally varied responses, and Mrkšic et al. [2016] proposed a unintelligible that the addressee cannot understand it. dialogue-state tracking framework. The generation of dia- These choices were made under two cost settings: low-cost logue contributions was studied in [Li et al., 2016; Prakash et – where participants were told that the requested object must al., 2016] for chatbots; in [Serban et al., 2016] for help-desk be delivered to someone in the same room; and high-cost – responses and Twitter follow-up statements; and in [Yang where they were told that the object must be delivered to a far- et al., 2016] for slot tagging, and user-intent and system- away location. These settings were designed to discriminate action prediction in slot-filling applications. A combination between situations where mistakes are fairly inconsequential of deep learning and reinforcement learning has been used and situations where mistakes are costly. in end-to-end dialogue systems that query a knowledge-base, 40 people took part in this experiment (six of them also par- where user utterances are mapped to a clarification ques- ticipated in the first experiment); half of the participants were tion or a knowledge-base query [Williams and Zweig, 2016; native English speakers, and half were male. Thirteen peo- Zhao and Eskenazi, 2016; Dhingra et al., 2017]. All these ple participated in an initial version of the experiment where systems learn to generate complete responses from large cor- they first chose response types for all the requests under the pora comprising request-response pairs. low-cost setting, and then chose response types for the same Our work follows this supervised-learning trend in a setting requests under the high-cost setting. We modified the ex- where the appropriateness of a response depends both on the periment on the basis of the participants’ feedback, so that request and on the physical context. Further, our dataset is the remaining 27 participants considered each request under significantly smaller than those used by neural mechanisms. the low-cost setting, and were immediately asked how their response would differ under the high-cost setting. This ex- 3 The Corpus perimental variation had no effect on response-classification performance (Sections 4.1 and 6.1). Our corpus, which was gathered in two experiments, com- To determine the effect of personal variations on classifi- prises requests to fetch or move household objects, and re- cation performance, one of the authors, who is familiar with sponses to these requests. the system, selected response types for all the requests. Experiment 1 – This experiment replicates the experiment 3.1 Analysis and Post Processing described in [Zukerman et al., 2015] using the Google ASR API, instead of Microsoft Speech SDK 6.1 — the WER of In total, we collected 960 request-response pairs (= the Google API was 13% for our corpus. 35 participants were 12 requests×2 cost factors×40 participants). 24.2% of these asked to describe 12 designated objects (labelled A to L) in requests had an unintelligible semantic role in at least one the scenarios depicted in Figure 1. Each scenario contains be- ASR output, with the vast majority occurring in the OBJECT tween 8 and 16 household objects varying in size, colour and of the descriptions; 17.9% were ambiguous (i.e., they had position. The participants were allowed to repeat a descrip- more than one reasonable referent); and only 3.8% were in- tion up to two times. In total, they recorded 478 descriptions accurate (i.e., they did not match perfectly any referent). such as the following: “the computer under the table”, “the In order to train both classifiers on the same corpus, we picture on the wall”, “the green plate next to the screwdriver removed requests that don’t fit the requirements of the au- at the top of the table”, “the plate in the corner of the table”, tomatic feature-extraction process (Section 6). Specifically, and “the large pink ball in the middle of the room”. we excluded 62 descriptions (13%) that had more than one prepositional phrase, and 43 descriptions (9%) that could not Experiment 2 – This experiment took the form of an online be processed by our SLU module [Zukerman et al., 2015] survey where participants had to indicate how they would re- (Section 5). As a result, our corpus contains 375 descrip- spond to a (potentially mis-heard) request. Each participant tions, which yield a total of 750 requests for both cost set- was shown the top four ASR outputs for the request versions tings. The responses to these requests were distributed as fol- of 12 descriptions generated by one participant in the first ex- lows: 51.9% D O (majority class), 21.6% C HOOSE, 14.1% periment, along with the images in Figure 1. For instance, R EPHRASE, and 12.4% C ONFIRM. “the pot plant on the table”, uttered in the context of Fig- It is worth noting that the response types chosen for the ure 1(a), was converted to “get the pot plant on the table”; and excluded requests were included in the dataset as features in 21 (a) Positional relations in a room (b) Colour, size and positional relations on a table (c) Projective and positional relations on a table (d) Colour, size and positional relations in a room Figure 1: Household scenes used in our experiments order to enable us to determine the effect of dialogue his- table in Figure 1(b) are reasonable interpretations for the tory on performance (Sections 4 and 6). Clearly, removing second request. requests disrupts the actual sequence of events, which has People often compensate for mis-heard utterances by pos- an adverse effect on the performance of sequence classifiers tulating reasonable words that sound similar to what was (Section 4.2). In the near future, we will address this problem heard. We take this behaviour into account by splitting by including a feature set for all the requests in a sequence. this feature into two sub-features: (2a) With phonetic sim- ilarity and (2b) Without phonetic similarity. For exam- 4 Classification with Manually-Tagged ple, when considering phonetic similarity in the context Features of Figure 1(b), “blue plate” is a sensible replacement for “Blues play”, yielding one reasonable interpretation for Two team members annotated each description obtained from the second request (the green plate labeled E). the first experiment with the following features, which are in- 3. Do the reasonable interpretations include fewer than all dicative of inaccuracy and ambiguity, and were deemed rel- the objects in the context? (YES , NO) – This feature indi- evant to a person’s decision regarding how to respond to a cates how much information can be extracted from a de- request (the first annotator labelled the features, and the sec- scription, e.g., the value of this feature is NO for “the blue ond annotator verified the annotations; disagreements were plate on the table” in the context of Figure 1(c), since all resolved by consensus). the objects on the table are reasonable referents for this 1. Unintelligible role – This is the semantic role of a gar- description. As above, this feature is split into two sub- bled portion of a description, where the possible values features: (3a) With phonetic similarity and (3b) Without are {NONE , ALL , OBJECT, LANDMARK , OTHER}. For phonetic similarity. example, “the hottest under the table” has an unintelligi- 4. # of perfect interpretations – How many objects match ble OBJECT, and “the green plate on the left of the Blues perfectly a description? For example, the two balls in Fig- play” has an unintelligible LANDMARK. ure 1(d) match perfectly the description “the ball”. Note 2. # of reasonable interpretations – How many objects are that the difference between # of reasonable interpreta- reasonable referents for a description? For instance, the tions and # of perfect interpretations indicates the accu- first of the above requests has two reasonable referents in racy of a description. the context of Figure 1(a), as there are only two objects 5. Do the perfect interpretations include fewer than all the under the table. Similarly, the two green plates on the objects in the context? (YES , NO) – This feature is similar 22 to Feature #3. However, since we are considering only interpretations that match a request perfectly, there is no Table 1: Performance with manually-tagged features need to take into account phonetic similarity. Classifier Manually- Precision Recall tagged Features 4.1 Response Classification RF 0.58 0.65 We experimented with several classification algorithms, in- RF + Gender & English nat. 0.62 0.67 cluding Naı̈ve Bayes, Support Vector Machines, Decision + Gender & English nat. Trees (DT) and Random Forests (RF), to learn response types DT 0.63 0.68 + 3-Back responses from the data collected in our experiments. Here we report on RNN + entire previous sequence 0.55 0.62 the results obtained with DT and RF, which had the best per- RF1P + 3-Back responses 0.81 0.82 formance.2 We used the above features to determine baseline performance, and experimented with four additional features: Gender; English nativeness – whether the participant is a na- tive English speaker; 3-Back responses (vector of length 4) – the counts of the response types provided by an Experiment 2 participant for the three preceding requests;3 and Cost – high or low. This feature worsened performance in all cases, and was removed. We performed 10-fold cross-validation to evaluate classi- fier performance; statistical significance was computed using the Wilcoxon signed-ranked test. Rows 2-4 in Table 1 display the best results obtained by our classifiers for each feature configuration. RF yielded the best results for the manually-tagged fea- Figure 2: RNN for response-type selection tures alone, and for these features plus Gender and English nativeness; while DT produced the best results overall when Recurrent Neural Network (RNN) as a sequence classifier. 3-Back responses were added (statistically significant with p- Our RNN model is based on the Long-Short-Term-Memory value=0.05). The most influential features in the decision tree (LSTM) architecture [Hochreiter and Schmidhuber, 1997], were # of perfect interpretations, # of reasonable interpreta- which can capture long-range dependencies. If we denote tions with phonetic similarity, and # of rephrases in 3-Back the features of the t-th utterance as ft , the hidden state of the responses. The per-class performance of DT appears in the RNN at time step t + 1 is calculated as a function of the input second and third columns of Table 5. Note the poor precision at time step t + 1, ft+1 , and the previous hidden state, ht : and recall obtained for C ONFIRM, which was often confused ht+1 = LSTM(ht , ft+1 ). With this mechanism, the model with D O. DT’s deficient performance for R EPHRASE may be maps the sequence of features to a sequence of hidden vec- attributed to the fact that requests that had the same features, tors, which are decoded into a sequence of labels by a linear in particular those with partially or completely unintelligible neural net layer: yt ∼ softmax(W ht + b). ASR outputs, elicited the different responses from the partic- A natural extension of this model is to stack the LSTM lay- ipants. ers, i.e., the outputs of the first LSTM layer are given as input As mentioned in Section 3, we also trained and tested the to the second layer, and so on; our model stacks 15 layers of classifiers using response types selected by only one per- LSTMs. This model was implemented with Keras [Chollet, son – the first author. The best performance was achieved 2017] and Theano [Theano Development Team, 2016], and with an RF classifier that includes 3-Back responses, denoted was trained to minimize categorical cross-entropy loss using RF1P . This performance was much better than of the classi- the Adam SGD learner [Kingma and Ba, 2014]. fiers trained with the response types of 40 participants, which Owing to time limitations, we performed only 5-fold cross- indicates that personal attributes affect people’s responses.4 validation. The results of the RNN appear in the penultimate row of Table 1. The RNN’s disappointing performance may 4.2 Sequence Classification be attributed to the relatively small dataset combined with the In order to investigate the influence of a sequence of request- disruption of several sequences due to the removal of request- response pairs on future responses, we trained and tested a response pairs (in order to reduce sequence disruption, we 2 retained the 43 pairs corresponding to descriptions that could We used over- and under-sampling to try to deal with the large majority class, but neither affected the classifiers’ performance. not be processed by our SLU system). 3 We experimented with several sequence lengths, of which 3- Back yielded the best results. We also investigated a setting where 5 The SLU System Scusi? the counts of the response types chosen for all the other 23 requests were used as features. This setting, which is clearly unfeasible, gave Scusi? [Zukerman et al., 2015] is a system that implements the best results, achieving 0.70 precision and 0.68 recall. an anytime, numerical mechanism for the interpretation of 4 We tried to address this issue by clustering users based on the spoken descriptions, focusing on a household context. It has number of times they chose each response type, but didn’t get good four processing stages, where (intermediate) interpretations clusters for k < 10. in each stage can have multiple parents in the previous stage, 23 Speech wave: Table 2: Features obtained from the word-error detector Is there an ASR output with all correct words? ASR output 1: the brown stool near the table ASR output 2: the blown store near the table % of wrong words in the top ASR output % of wrong words in all ASR outputs OBJECT OBJECT stool store DEFINITE: YES COLOUR: BROWN DEFINITE: YES ATTRIBUTE: % of ASR outputs with all correct words (coordinates in UNKNOWN colour space) UCG−1 UCG−2 near near Table 3: Features extracted from the top-10 ICGs LANDMARK LANDMARK table table Number of top-ranked ICGs with similar scores (×1) DEFINITE YES DEFINITE YES Location match score between an ICG and the context (×10) Per-node features for an ICG in relation to its parent UCGs OBJECT OBJECT stool−L stool−2 Best colour-match score for a content node (×20) ICG−1 ICG−2 Best size-match score for a content node (×20) Location_near Location_near LANDMARK LANDMARK Maximum # of unknowns for a content node (×20) table−1 table−1 For a content node, % of UCG parents with corresponding node • with a colour match for this node (×20) • with a size match for this node (×20) Figure 3: Scusi?’s workflow and UCG-to-ICG relations • that have unknowns (×20) For a node, % of UCG parents with corresponding node and can produce multiple children in the next stage; early pro- • that lexically match this node (×30) cessing stages may be probabilistically revisited; and only the most promising options in each stage are explored further. UCG-2. ICG-1 matches the context well, as stool-L is near Scusi’s workflow – The system takes as input a speech signal, table-1. The details of the calculation of the scores are de- and uses an ASR to produce candidate texts. Each text is as- scribed in [Zukerman et al., 2015]. The aspects that are most signed a score given the speech wave, and passed to an error- relevant to this paper are that scores are represented on a log- detection module that postulates which words were correctly arithmic scale in order to avoid underflow, and scores of value or wrongly recognized by the ASR [Zukerman and Partovi, 0 are smoothed to a low value  in order not to invalidate any 2017] — this component is required, as in real life we don’t interpretation. have access to transcriptions. Next, Scusi? applies Charniak’s probabilistic parser (bllip.cs.brown.edu/resources. 6 Classification with Automatically-Extracted shtml#software) to syntactically analyze the texts, yield- Features ing at most 50 parse trees per text. The third stage applies We automatically extracted features from the top-10 ICGs mapping rules to the parse trees to generate Uninstantiated generated by Scusi? for each description (the correct inter- Concept Graphs (UCGs) [Sowa, 1984] that represent the se- pretation is in the top-10 ICGs in about 90% of the cases) mantics of the descriptions. The final stage instantiates the — these features appear in Tables 2 and 3. The features in UCGs with objects and relations from the current context, and Table 2, extracted from the output of Scusi?’s word-error de- returns candidate Instantiated Concept Graphs (ICGs) ranked tector, pertain to the intelligibility of the descriptions. The in descending order of merit (score). second and third feature in Table 2 are among the most in- Figure 3 illustrates this process for the description “the fluential ones.5 The last feature is noteworthy because, even brown stool near the table” in the context of Figure 1(d). All though only one ASR output is correct, the error-detection stages produce several outputs, but we show only two outputs component may decide that several ASR outputs are correct, for each of three stages (ASR, UCG and ICG). In addition, e.g., “the flower on the table” and “the flour on the table”. in this example, both UCGs are parents of the two ICGs, but The first feature in Table 3 represents the ambiguity of a only the match with ICG-1 is shown in Figure 3. The first description through the similarity between the scores of suc- ASR output is correct, and the second has “blown store” in- cessive top-ranked ICGs, which is encoded as the ratio be- stead of “brown stool”. Each of these outputs yields one UCG tween the (logarithmic) score of the i+1-th ICG and the score (via a parse tree), where the object in the second UCG has of the i-th ICG. When this ratio between neighbouring ICGs an unknown attribute, as Scusi? doesn’t recognize the modi- is below an empirically-derived threshold, they are deemed fier “blown” (unknown attributes occur when a user employs similar. This feature is among the most influential ones. out-of-vocabulary noun modifiers or the ASR mis-recognizes The remaining features in Table 3 pertain to the accuracy noun modifiers). of a description, which is represented by the goodness of the The score of each ICG depends on two factors: (1) how match between an ICG and its parent UCGs, and between well the concepts and relations in it match the corresponding an ICG and the context. The second feature, which repre- concepts and relations in its parent UCGs, and (2) how well sents the accuracy of the location specified in a description, the relations in the ICG match the context. For example, ICG- is among the most influential ones (for ICGs ranked 4th, 6th 1 matches UCG-1 well, as stool-L can be called “stool” and and 9th). it is brown, and table-1 can be called “table”; but its match- score with UCG-2 is lower, as stool-L cannot be called 5 The frequency of features in the top-two levels of 100 trees gen- “store” and doesn’t match the unknown attribute specified in erated by RF was used as a proxy for their importance. 24 As seen in Figure 3, content nodes (objects and landmarks) Table 4: Performance with automatically-extracted Features in UCGs may have colour and size descriptors, as well as un- known attributes. The first six per-node features in Table 3 Classifier Automatically- Precision Recall represent the goodness of attribute matches between the con- extracted Features tent nodes (object and landmark) of an ICG and the corre- RF 0.73 0.74 sponding nodes in its parent UCGs. Two size-match features, RF + Gender & English nat. 0.74 0.74 one colour-match feature and one unknown feature for ob- + Gender & English nat. DT 0.72 0.72 jects of ICGs at various ranks are among the most influential + 3-Back responses features. RF1P 0.93 0.92 The last row in Table 3 represents the goodness of lexical matches between the nodes in an ICG and the corresponding Table 5: Per-class performance of the best classifier for nodes in its parent UCGs. This feature is among the most manually- and automatically-extracted features influential for the objects of most of the top-10 ICGs. To illustrate these features, let’s return to the UCG-ICG Class Manually- Automatically- matches in Figure 3 for the request “move the brown stool tagged Features extracted Features near the table” in the context of Figure 1(d). The score of the Precision Recall Precision Recall top-ranked ICG, viz ICG-1, is significantly higher than that DO 0.72 0.93 0.82 0.83 of ICG-2. Hence, the value of the first feature in Table 3 is C ONFIRM 0.28 0.10 0.42 0.38 1. As mentioned above, stool-L is near table-1, yielding C HOOSE 0.70 0.64 0.74 0.76 a high location match score for ICG-1. 50% of the UCG par- R EPHRASE 0.54 0.31 0.70 0.70 ents have a lexical match with the object in ICG-1, as “store” doesn’t match any designation of stool-L; but 100% of the recall were obtained for C ONFIRM, but the performance for UCG parents have a lexical match with the landmark in ICG- R EPHRASE was only slightly worse than for the other classes. 1 (table-1). Due to the unknown attribute in the object of UCG-2, the maximum number of unknowns for the ICG-1 object is 1, and the percentage of UCG parents that have un- 7 Conclusion and Future Work knowns for the ICG-1 object is 50%; while 0% of UCG par- We have offered a corpus comprising requests for objects in ents have unknowns for the ICG-1 landmark. Since the colour physical spaces, and the responses given by people for these specified in UCG-1 matches the colour of stool-L, the max- requests. We generated two datasets based on this corpus: a imum colour match for the object of ICG-1 is 1, but the per- manually-tagged dataset, and a dataset which includes fea- centage of UCG parents with a colour match for the ICG-1 tures that are automatically extracted from the output of an object is 50%, as UCG-2 doesn’t have a colour attribute. SLU module. These datasets were used in a classification- based approach for generating responses to spoken requests. 6.1 Response Classification Our results show that, surprisingly, classifiers trained on We experimented with the classifiers considered in Sec- the second dataset outperformed those trained on the first. tion 4.1, except the RNN, using the 165 features described As mentioned in Section 4, analysis of the data reveals that in Tables 2 and 3, instead of the manually-obtained ones.6 different users often provide different responses for requests The RNN was omitted due to the above-described removal that have identical manually-tagged features. For instance, of requests, which disrupts the sequence. As before, we per- three participants who were shown the following ASR out- formed 10-fold cross-validation. puts responded with D O, C ONFIRM and R EPHRASE (the op- Table 4 displays our results. The classifier with the tion chosen by our classifier): (1) “get a blade in the rights of best performance for a particular configuration of manually- the disabled”, (2) “get I played in the rights of the disabled”, tagged features also had the best performance for the cor- (3) “get I played in the right of the devil”, and (4) “get a blade responding configuration of automatically-extracted features. in the right of the devil”. This discrepancy may be partially Surprisingly, overall performance with these features was sig- due to a mixture of individual ability to compensate for mis- nificantly better (with p-value=0.01) than the performance heard utterances combined with risk-taking attitude — traits obtained with the manually-tagged features, both for the re- that may be related to the English nativeness and Gender fea- sponses given by 40 participants and for the responses pro- tures respectively, which improve performance. In light of vided by one person. In the former case, 3-Back responses this, we posit that additional features that reflect personal dis- had an adverse effect on performance, and in the latter case, position could yield further improvements. This notion is re- it had no effect. The best performance for the 40-participant inforced by the significantly better classification performance dataset was obtained with RF plus Gender and English na- for the responses obtained from a single user (albeit one fa- tiveness, but the differences between the classifiers were not miliar with the system) compared with the performance for statistically significant. The per-class performance of this the responses of 40 participants. classifier appears in the fourth and fifth columns of Table 5. A complementary explanation for the worse classification As for the manually-tagged features, the worst precision and performance obtained for the manually-tagged dataset is that this dataset encodes intelligibility, ambiguity and accuracy 6 of descriptions in a general way, while the specific infor- Applying Principal Components Analysis to reduce the number of features had no effect on the classifiers’ performance. mation encoded in the automatically-extracted dataset (i.e., 25 lexical, colour, size and location match for each of the top- [Horvitz et al., 2003] E. Horvitz, C. Kadie, T. Paek, and D. Hovel. 10 ICGs) is important for classification. The only aspect Models of attention in computing and communication: From where the manual encoding is more informative than the au- principles to applications. Communications of the ACM, tomatic encoding pertains to phonetic similarity, which is one 46(3):52–57, 2003. of the most influential features for this dataset. In the future, [Inouye and Biermann, 2005] B. Inouye and A. Biermann. An al- we will incorporate specific features about lexical, colour, gorithm that continuously seeks minimum length dialogs. In size and location match and out-of-vocabulary words into Proceedings of the 4th IJCAI Workshop on Knowledge and Rea- soning in Practical Dialogue Systems, pages 62–67, Edinburgh, the manually-generated tags, and phonetic-similarity into the Scotland, 2005. automatically-extracted features. [Kingma and Ba, 2014] D. Kingma and J. Ba. Adam: A method for In terms of dialogue history, our results are inconclu- stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. sive. Our hypothesis that dialogue history affects users’ [Lange and Suendermann-Oeft, 2014] P. Lange and choices was confirmed (for three preceding requests) for D. Suendermann-Oeft. Tuning Sphinx to outperform Google’s the manually-tagged requests, but not for the automatically- speech recognition API. In ESSV2014 – Proceedings of the tagged ones. Conference on Electronic Speech Signal Processing, Dresden, Finally, as noted in [Inouye and Biermann, 2005; Singh Germany, 2014. et al., 2002], users may be satisfied with responses that dif- [Lemon, 2010] O. Lemon. Learning what to say and how to say fer from those provided by human consultants. To test this it: Joint optimisation of spoken dialogue management and nat- idea, we propose to conduct a follow-up experiment, where ural language generation. Computer Speech and Language, participants will be asked to rate the suitability of responses 25(2):210–221, 2010. generated by our best classifier. [Li et al., 2016] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue gener- Acknowledgements ation. In EMNLP2016 – Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages This research was supported in part by grant DP120100103 1192–1202, Austin, Texas, 2016. from the Australian Research Council. [Liao et al., 2006] W. Liao, W. Zhang, Z. Zhu, Q. Ji, and W.D. Gray. Toward a decision-theoretic framework for affect recog- References nition and user assistance. International Journal of Human- Computer Studies, 64:847–873, 2006. [Bahdanau et al., 2016] D. Bahdanau, J. Chorowski, [Mesnil et al., 2015] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, D. Serdyuk, and Y. Bengio. End-to-end attention-based L. Deng, D.Z. Hakkani-Tür, X. He, L.P. Heck, G. Tur, D. Yu, large vocabulary speech recognition. In ICASSP’2016 and G. Zweig. Using recurrent neural networks for slot filling – Proceedings of the 2016 IEEE International Confer- in spoken language understanding. IEEE/ACM Transactions on ence on Acoustic, Speech and Signal Processing, pages Audio, Speech & Language Processing, 23(3):530–539, 2015. 4945–4949, Shanghai, China, 2016. [Moratz and Tenbrink, 2006] R. Moratz and T. Tenbrink. Spatial [Carlson, 1983] L. Carlson. Dialogue Games: An Approach reference in linguistic human-robot interaction: Iterative, empir- to Discourse Analysis. D. Reidel Publishing Company, ically supported development of a model of projective relations. Dordrecht, Holland, Boston, 1983. Spatial Cognition & Computation: An Interdisciplinary Journal, [Chollet, 2017] F. Chollet. Keras. https://github.com/ 6(1):63–107, 2006. fchollet/keras, 2017. [Mrkšic et al., 2016] N. Mrkšic, Ó.S. Diarmuid, T.H. Wen, [Chorowski et al., 2015] J.K Chorowski, D. Bahdanau, D. Serdyuk, B. Thomson, and S.J. Young. Neural belief tracker: Data-driven K. Cho, and Y. Bengio. Attention-based models for speech recog- dialogue state tracking. arXiv preprint arXiv:1606.03777v1, nition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, 2016. and R. Garnett, editors, Advances in Neural Information Process- [Paek and Horvitz, 2000] T. Paek and E. Horvitz. Conversation as ing Systems 28, pages 577–585. Curran Associates, Inc., 2015. action under uncertainty. In Proceedings of the 16th Conference [Dhingra et al., 2017] B. Dhingra, L. Li, X. Li, J. Gao, Y.N. Chen, on Uncertainty in Artificial Intelligence, pages 455–464, Stan- F. Ahmed, and L. Deng. Towards end-to-end reinforcement ford, California, 2000. learning of dialogue agents for information access. In ACL’17 [Prakash et al., 2016] A. Prakash, C. Brockett, and P. Agrawal. Em- – Proceedings of the 55th Annual Meeting of the Association for ulating human conversations using convolutional neural network- Computational Linguistics, Vancouver, Canada, 2017. based IR. In Proceedings of the Neu-IR16 SIGIR Workshop on [Funakoshi et al., 2012] K. Funakoshi, M. Nakano, T. Tokunaga, Neural Information Retrieval, Pisa, Italy, 2016. and R. Iida. A unified probabilistic approach to referring expres- [Serban et al., 2016] I.V. Serban, T. Klinger, G. Tesauro, K. Tala- sions. In SIGDIAL’2012 – Proceedings of the 13th SIGdial Meet- madupula, B. Zhou, Y. Bengio, and A. Courville. Multiresolution ing on Discourse and Dialogue, pages 237–246, Seoul, South recurrent neural networks: An application to dialogue response Korea, 2012. generation. arXiv preprint arXiv:1606.00776v1, 2016. [Gašić and Young, 2014] M. Gašić and S.J. Young. Gaussian [Singh et al., 2002] S. Singh, D. Litman, M. Kearns, and processes for POMDP-based dialogue manager optimization. M. Walker. Optimizing dialogue management with reinforce- IEEE/ACM Transactions on Audio, Speech & Language Process- ment learning: Experiments with the NJFun system. Artificial ing, 22(1):28–40, 2014. Intelligence Research, 16:105–133, 2002. [Hochreiter and Schmidhuber, 1997] S. Hochreiter and J. Schmid- [Sowa, 1984] J.F. Sowa. Conceptual Structures: Information Pro- huber. Long short-term memory. Neural Computation, cessing in Mind and Machine. Addison-Wesley, Reading, MA, 9(8):1735–1780, 1997. 1984. 26 [Sugiura et al., 2009] K. Sugiura, N. Iwahashi, H. Kashioka, and S. Nakamura. Bayesian learning of confidence measure function for generation of utterances and motions in object manipulation dialogue task. In Proceedings of Interspeech 2009, pages 2483– 2486, Brighton, United Kingdom, 2009. [Theano Development Team, 2016] Theano Development Team. Theano: A Python framework for fast computation of mathemat- ical expressions. arXiv e-prints, abs/1605.02688, 2016. [Trafton et al., 2005] J.G. Trafton, N.L. Cassimatis, M.D. Buga- jska, D.P. Brock, F.E. Mintz, and A.C. Schultz. Enabling effec- tive human-robot interaction using perspective-taking in robots. IEEE Transactions on Systems, Man and Cybernetics – Part A: Systems and Humans, 35(4):460–470, 2005. [Wen et al., 2015] T.H. Wen, M. Gašić, N. Mrkšic, P. Hao Su, D. Vandyke, and S.J. Young. Semantically conditioned LSTM- based natural language generation for spoken dialogue systems. In EMNLP2015 – Proceedings ot the Conference on Empiri- cal Methods in Natural Language Processing, pages 1711–1721, Lisbon, Portugal, 2015. [Williams and Young, 2007] J.D. Williams and S. Young. Partially observable Markov decision processes for spoken dialog sys- tems. Computer Speech and Language, 21(2):393–422, 2007. [Williams and Zweig, 2016] J.D. Williams and G. Zweig. End-to- end LSTM-based dialog control optimized with supervised and reinforcement learning. arXiv preprint arXiv:1606.01269, 2016. [Yang et al., 2016] X. Yang, Y.N. Chen, D. Hakkani-Tür, P. Gao, and L. Deng. End-to-end joint learning of natural lan- guage understanding and dialogue manager. arXiv preprint arXiv:1612.00913v1, 2016. [Young et al., 2007] S. Young, J. Schatzmann, B. Thomson, K. Weilhammer, and H. Ye. The hidden information state dia- logue manager: A real-world POMDP-based system. In NAACL- HLT 2007 – Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Asso- ciation for Computational Linguistics, Demonstration Program, pages 27–28, Rochester, New York, 2007. [Young et al., 2013] S.J. Young, M. Gašić, B. Thomson, and J. Williams. POMDP-based statistical spoken dialogue systems: a review. Proceedings of IEEE, 101(5):1160–1179, 2013. [Zhao and Eskenazi, 2016] T. Zhao and M. Eskenazi. Towards end- to-end learning for dialog state tracking and management using deep reinforcement learning. In SIGDIAL’2016 – Proceedings of the 17th SIGdial Meeting on Discourse and Dialogue, pages 1–10, Los Angeles, California, 2016. [Zukerman and Partovi, 2017] I. Zukerman and A. Partovi. Im- proving the understanding of spoken referring expressions through syntactic-semantic and contextual-phonetic error correc- tion. Computer Speech and Language, 2017. [Zukerman et al., 2015] I. Zukerman, S.N. Kim, Th. Kleinbauer, and M. Moshtaghi. Employing distance-based semantics to in- terpret spoken referring expressions. Computer Speech and Lan- guage, pages 154–185, 2015. 27