Performance Predictors for Conversational Fashion Recommendation Maria Vlachou1 , Craig Macdonald1 1 University of Glasgow, UK Abstract In Conversational Recommendation Systems (CRS), a user can provide natural language feedback on suggested items, which the recommender uses to produce improved suggestions. Therefore, the success of a user’s conversation with the CRS is determined by how well the system is able to interpret the user’s feedback and the quality of the recommendations. Knowing whether a conversation is likely to be successful may allow the CRS to adjust accordingly - for instance, changing its retrieval strategy, or asking a clarifying question. Existing work on Query Performance Prediction (QPP) has examined a number of predictors that indicate the effectiveness of a search engine’s ranking in response to a query. Inspired by existing work in QPP, we propose a framework for Conversational Performance Prediction (CPP) that aims to predict conversation failures by considering the recommendation ranking at different turns of a conversation, either one turn at a time, or by considering multiple consecutive turns. In this regard, we adapt post-retrieval predictors to address the multi-turn nature of the CRS task. We conduct our analysis on Shoes and FashionIQ Shirts & Dresses datasets. In particular, as a ground truth, we measure conversation difficulty by the effectiveness of the ranking at a given turn of the conversation. Overall, we find some promise in score-based retrieval predictors for CPP, obtaining medium strength correlations with conversation difficulty - for instance, observing a Spearman’s 𝜌 of 0.423 on the Shoes dataset, which is comparable to correlations observed for standard QPP predictors on adhoc search tasks. 1. Introduction Traditionally, Recommender Systems (RS) help users to find items of interest on the basis of user feedback in terms of ratings, clicks or reviews. In contrast, Conversa- tional Recommendation Systems (CRS), such as personal digital assistants [1], have facilitated more complex rec- ommendation settings by suggesting items in response to voice or (natural language) chat interactions. In par- ticular, a CRS allows a multi-turn dialogue with users and aim to assist them with achieving a number of task- oriented goals [2]. Indeed, at each turn users can provide their feedback or critique [3], which helps the system to improve recommendations [4]. One important aspect of natural language-based CRS is that they allow users to explore the range of available options and elicit their preferences. For example, Bursz- Figure 1: Example of Dialog-based recommendation in CRS. tyn et al. [7] created a multi-modal system, where users Pictures and dialogues from the Shoes dataset [5, 6]. navigate in a setting of limited options, such as finding a restaurant near their location. In this setting, users tions based on different techniques of critiquing the song start exploring an initial set of restaurants and have the recommendations. In our work, we are focused upon opportunity to see their details by clicking through the conversational fashion image recommendation [5, 6, 10], options, while they are asked about the reasons for any an example of which is shown in Figure 1. In this task, negative feedback they provide. Another example of user the user has a target item in mind, and provides textual exploration is the MusicBot [8, 9], a music chatbot that feedback (critiques) to direct the system towards retriev- first collects users’ preferences and then makes sugges- ing images of fashion products that are more similar to 4th Edition of Knowledge-aware and Conversational Recommender to their perceived target item. Systems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seat- However, not all conversations may lead to a satisfying tle, WA, USA. outcome for the user. This can be easily quantified in " m.vlachou.1@research.gla.ac.uk (M. Vlachou); offline evaluation scenarios, where the CRS is evaluated craig.macdonald@glasgow.ac.uk (C. Macdonald) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License across a pre-defined number of turns. For example, Yu CEUR Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 et al. [11] found that, although users had the option to explore a list of options for a number of turns, the sys- 2.1. Query Performance Prediction tem was unable to find a relevant recommendation by Traditionally, QPP is used to predict the effectiveness of turn 7, which might mean that the algorithm was still a search results page performed in response to a query exploring the space. Also, in Wu et al. [10], the target in the absence of human relevance judgments [15]. It item was found by the system at rank 1 in only 42% of has applications to selective retrieval approaches [17, 18] conversations after (a maximum of) 10 turns. Therefore, and query features for learning-to-rank [19], to name exploration might result in an increased number of turns, but a few. Query performance predictors are generally which on one hand might mean more engaged users [8], grouped into pre-retrieval, and post-retrieval, which we but at the same time suggests that often the conversa- discuss further below. tions might fail (i.e., target item not found). In this regard, we are interested in identifying indicators that can detect how this happens – for example, a conversation could 2.1.1. Pre-retrieval Query Performance Predictors fail because the system is unable to find the target item Pre-retrieval predictors are used to estimate the perfor- or because the target item is not available. mance of queries before the retrieval stage, and therefore, In what follows, inspired by existing work on Query are independent of the search performed and the ranked Performance Prediction (QPP) (e.g., [12, 13, 14, 15]), we list of results [14]. This means that pre-retrieval predic- aim to predict conversational failures by identifying spe- tors base their predictions on properties of query-terms cific indicators that are correlated with failure. In par- or corpus-based statistics [12, 13, 14, 20, 21, 22]. Exam- ticular, we aim to determine the quality of multi-turn ples of pre-retrieval predictors that describe the statistical critiquing-based CRS recommendation by proposing pre- properties of the query terms or the corpus include the dictors that consider the multi-turn aspect of conver- query length (number of non-stop words in the query), sational recommendation. The proposed predictors ad- the standard deviation of the inverse document frequency dress characteristics of the retrieved scores of the top- of the query terms, the simplified query clarity score recommended items and can predict poor performance (SCS), which measures the occurrence of a query term across a shorter or longer number of turns in the con- in the query relative to its occurrence in the collection, versation, which we call prediction horizons. In summary, and AvICTF, which considers the overall informative- this work makes the following contributions: (i) We pro- ness of the query terms using the collection model [23]. pose a framework for Conversational Performance Pre- Another class of pre-retrieval predictors refers to linguis- diction (CPP), which extends the existing work on QPP to tic features of the queries, such as syntactic complexity a conversational recommendation setting; (ii) We show (distance between syntactically linked words) and word how to adapt QPP evaluation methodology to a multi- polysemy (number of semantic classes a word belongs turn conversational setting which allows to evaluate CPP to) [22]. Overall, with limited information available be- predictors for both short- and long-term prediction hori- fore retrieval commences, pre-retrieval predictors are zons; (iii) We evaluate some of our proposed predictors widely considered less accurate for performance predic- on the Shoes [5, 6] dataset and the Fashion IQ Dresses tion than post-retrieval predictors [14]. and Shirts categories [16], using a state-of-the-art user simulator [6]. The rest of the paper is structured as fol- 2.1.2. Post-retrieval Query Performance lows: Section 2 presents the existing research on QPP, Predictors including pre- and post-retrieval predictors, as well as their probabilistic interpretation; Section 4 outlines our On the other hand, post-retrieval predictors are applied on new proposed framework and predictors; Section 5 de- the list of the top-ranked retrieved documents, and there- scribes our experimental setup; Sections 6 & 7 present fore use the relevance scores or the (textual) contents of our results and provide concluding remarks. the returned items. A first group of post-retrieval predic- tors examines the difference of the result list from the corpus, or the focus of the result list. For example, Clarity 2. Related Work [12] measures the focus of the resulting ranking with respect to the corpus using the KL divergence between In order to predict why a conversation with a CRS might their respective language models, while the Weighted fail, we need to identify indicators that show when the Information Gain (WIG) corresponds to the difference be- user is unable to find the target item during the interac- tween the average retrieval score of the result list and of tion. In this regard, we are inspired by existing work from that of the corpus [24]. A second group includes the dis- Query Performance Prediction (QPP), which we discuss in tribution of the retrieval scores of the top-ranked items. Section 2.1; Later, in Section 2.2, we discuss applications Predictors in this group include Normalized Query Com- of QPP in conversational contexts. mitment (NQC) [25] (the standard deviation of the re- trieval scores in the result list). The standard deviation is considered to be negatively correlated with the amount However, while QPP has been widely explored for of query drift (the non-related information in the result (single turn) queries in search settings, the area of con- list) [26]. Also, this group includes the modeling of re- versational search or recommendation has seen much trieval scores; the top-ranked items could be modeled as a less work. For example, one recent work examines the certain mixture of distributions corresponding to relevant predicted effectiveness of the top-retrieved documents and non-relevant items [27]. Another related predictor for deciding to generate clarifying questions, and specifi- is autocorrelation [28], which assumes that documents cally some extracted features, such as noun phrases or whose vector space embeddings are closely related re- named entities [36]. Indeed, clarifications are useful for ceive similar scores, and therefore, closely related scores both the user and the system [37, 38, 39]. Also, Roit- would indicate similar performance. man et al. [40] examined a constrained retrieval setting, A third group of post-retrieval predictors refers to namely the interaction with a conversational assistant, the relation of the top-ranked retrieval scores with a where the assistant needs to decide whether the provided particular reference list. Recently, a more generalised ap- answer could be accepted. The authors built a classifier proach for estimating the effectiveness of a ranking was that determines the answer quality by adapting some proposed, based on the assumption that high association existing QPPs to the answer level (using the score of the with pseudo-effective reference lists and low association top item, which is provided as the answer). with pseudo-ineffective lists improves effectiveness [29]. However, QPP for conversational recommendation has One example refers to the utility estimation framework not been addressed. In particular, we are interested in (UEF) [30], which estimates the utility of a given ranking creating a prediction framework for identifying poorly with respect to how much it represents an underlying performing or failed conversations in a recommendation information need [31]. The utility is estimated by the setting. We postulate that these predictors can be useful expected similarity between a given document ranking in several use cases, for instance knowing when to ask and those induced by estimates of relevance language for clarifications, or when the users target item cannot models (these rankings are assumed to be representative be found. Towards achieving this goal, we explore score- of the information need). [32]. A similar predictor to based predictors, adapting to the multiple turn nature the UEF approach is query feedback (QF) [24], which of the task. In the next section, we define the CRS task; measures the overlap of top items between the result Later in Section 4, we define our CPP framework. list and a reference list retrieved from the corpus using a language model induced from the result list. Autocorrelation [28] could also fall under this category, if 3. Conversational Image we compare the result list of the original retrieval scores Recommendation with a reference list that contains either a perturbed version of the scores diffused in space or a list with Figure 1 describes the context of dialog-based image rec- the averaged values from multiple retrievals for the ommendation in a CRS. At each interaction turn, the same query. Lastly, an inter-list similarity predictor is user provides a critique of the current recommendation a measure of rank-biased overlap (RBO), which measures (candidate item) back to the system, aimed at direct- the expected average overlap between two rankings [33] ing it towards the desired target item. More formally, and can be applied to the QPP task. at a given interaction turn π‘˜, the user provides textual Finally, we note some recent QPP work (e.g. [29, 34]) feedback π‘“π‘˜ on the current top-ranked candidate item has focused upon probabilistic frameworks for QPP, π‘–π‘˜,1 . Based on this feedback, the conversational rec- which can integrate both pre-retrieval and post-retrieval ommendation system π’ž() provides a new ranking , i.e.: predictors. However, many of the underlying intuitions π’ž(π‘–π‘˜,1 , π‘“π‘˜ ) β†’ π‘†π‘˜ , where π‘†π‘˜ is a ranking of 𝑛 items with encapsulated by these frameworks are already addressed corresponding descending retrieval scores 𝑠1 . . . 𝑠𝑛 , i.e.: in the previously described predictors. π‘†π‘˜ = [βŸ¨π‘–π‘˜+1,1 𝑠1 ⟩, . . . βŸ¨π‘–π‘˜+1,𝑛 , 𝑠𝑛 ⟩]. However, it is challenging to train and evaluate a natu- 2.2. Query Performance Prediction in ral language-based CRS. For training, reinforcement learn- ing (RL) is widely used, as it allows optimising the recom- Conversational Search mendation model based on the long-term rewards [41], Natural language-based conversational systems allow i.e. based not just on retrieving the correct item in any users to express complex feedback through a dialogue, current iteration, but also retrieving it in later iterations. thus resulting in more natural interactions [35]. To be However, such a model needs to be trained while interact- able to predict the likelihood of success of a conversation, ing with an environment, and obtaining many samples we need to consider the salient aspect of the conversa- is hard by relying on real users [41, 42]. For evaluation, tional setting, such as the users’ feedback and the iterative ideally human users are needed to judge the system’s turn-based nature of the interaction process. efficiency and user satisfaction [43]. Instead, user simula- tors are deployed as surrogates for human users, trained on textual queries and textual documents. In contrast, on relative caption data - a form of human-annotated di- in our fashion-based CRS, the β€œunits of retrieval" are alogues on pairs of images. Recommendation models images, with embedded representations - this precludes trained and evaluated using user simulators have been the use of textual content-based predictors. Furthermore, found to be correlated with human satisfaction [6]. our β€œquery units" are critiques, which are based on the Specifically, for the purposes of training a user sim- retrieval of the previous turn. Therefore, it can be seen ulator with human-annotated dialogues, Guo et al. [6] that there is no clear distinction between pre-retrieval proposed the relative captioning task. In this task, hu- and post-retrieval predictors, since what is considered man annotators recruited through crowdsourcing are post-retrieval of one turn could be seen as a pre-retrieval placed in a context of online shopping, where the CRS predictor of the following turn. For this reason, we acts as the shopping assistant and they play the role of the propose a new framework for performance prediction customer. During the process, annotators are presented in a conversational setting, in particular conversational with candidate recommended images of items and they fashion retrieval, which we describe in Section 4.1 below. are asked to provide single instance critiques. In each Later in Section 4.2, we describe the initial score-based interaction round, they are shown a given candidate item predictors we can adapted to this framework. and based on a given target item, they provide a critique on the current candidate item. These differences between 4.1. CPP Framework the candidate and the target image are described with natural language phrases and form the relative captions. We present a framework for Conversational Performance Hence, a relative captioning dataset contains tuples of Prediction (CPP) applied to the domain of fashion recom- the following form: βŸ¨π‘–π‘‘ , 𝑖𝑐 , π‘‘π‘žπ‘,𝑑 ⟩ where 𝑖𝑑 is a represen- mendation for image retrieval [6, 16]. In this regard, we tation of the target item (for instance an image), 𝑖𝑐 is define recommendation success as the identification of the current candidate item being presented to the user the target image item by the system before a maximum and π‘‘π‘žπ‘,𝑑 is the critique by the user on the candidate, number of turns is reached, which corresponds to a user intended to direct the system more towards the target. being satisfied with the conversation. More formally, the Relative captioning data can be used to train a user sim- CPP task can be described as a function of the form ulator, which is then deployed for training or evaluating a CRS [6, 10, 11, 16, 44, 45]. 𝐢𝑃 𝑃 (𝐹, 𝑆) β†’ R Using a user simulator for evaluation, the overall suc- where 𝐹 is a sequence each containing 𝑓 feedback cri- cess of a CRS system can be reliably measured, in a offline tiques over 1 or more turns, and 𝑆 is a sequence of results Cranfield-like setting, by using ranking evaluation mea- lists consisting of retrieval scores, over 1 or more turns. sures, such as NDCG, upon the ranked list of recommen- This framework can be instantiated for single-turns, dations produced at each turn. From such an evaluation, or multiple turns. For instance, in a single-turn setting, it can be seen that even after 10 turns, some CRS models we can instance CPP task at a given turn π‘˜, i.e.: may not be able to identify the target item for some con- versations. For this reason, making a prediction as to the 𝐢𝑃 𝑃single ([π‘“π‘˜ ], [π‘ π‘˜ ]). likelihood of a user being satisfied with a conversation may have utility to improving the user experience. In the On the other hand, for two consecutive turns, π‘˜ and π‘˜ +1, next section we introduce our proposal for conversation prediction takes the following form: performance prediction for CRS. 𝐢𝑃 𝑃consecutive ([π‘“π‘˜ , π‘“π‘˜+1 ], [π‘ π‘˜ , π‘ π‘˜+1 ]). 4. Performance Prediction in Overall, from the above different formulations, it is clear Conversational that CPP is a distinct task from QPP that can be addressed by different families of predictors. In this initial work, we Recommendation adapt one category of score-based QPP predictors into the CPP framework, which we discuss further below. Our aim for conversational performance prediction differs from existing approaches on QPP in a number of ways. While QPP focuses on estimating the relevance of 4.2. Score-based Predictors for CPP a ranking to a given single query (single-turn), to predict In this work, we are inspired by post-retrieval predictors the user’s satisfaction of a conversation, we need to take that study the distributions of retrieval scores and the use into account the nature of the task, which is to consider of reference lists, as introduced in Section 2.1.2. In partic- the ranking quality across multiple turns. Another im- ular, we have the following initial intuitions concerning portant difference is that many QPP techniques are based successful interactions in the CRS task: Table 1 RQ2 Can we predict conversation performance with Proposed CPP predictors according to number of turns in- predictors based on (a) differences in retrieval scores volved. between consecutive turns and (b) overlap in retrieved Single-turn Consecutive Turns items of two consecutive turns? Top-1 item score (maximum score) Mean score of top-n items Difference in maximum score Overlap of top-ranked items To evaluate our CPP approaches, we use the Shoes Standard deviation (sd) of top-n items dataset [5, 6], which contains one relative critique (describing relative differences between recommended and target image pairs) for pairs of shoe images, and β€’ For a single turn, if the score of the top-ranked the Dresses & Shirts categories of the Fashion IQ item(s) is high, then the system has a clear repre- dataset [16], which contains two relative captions per sentation of the user’s desired item, and it can find candidate-target pair. item(s) that closely matches that representation. For a CRS, we apply a supervised GRU sequential rec- β€’ In a successful conversation, the scores of the ommendation model [6, 46], which is trained using triplet top-ranked item(s) will increase across multiple loss and uses the natural language feedback and the pre- turns, as the system becomes more confident in vious recommended images as input, thus maximizing its predictions. short-term rewards. To train our recommendation model, β€’ In a successful conversation, the retrieved items we use a recently developed user simulator for dialog- become more similar across turns as the system based interactive image retrieval based and the relative becomes more confident in its predictions and captioning task [6]. The GRU model is configured to focuses on the correct part of the item catalogue. retrieve 100 items at each turn. In QPP, the accuracy of predictors is evaluated at the Adapting the notation of Section 4.1 to disregard the query level (a given query is easy or difficult compared to feedback sequences, we define a number of score-based other queries in a set). Specifically, a ranking of queries by CPPs, for single turns – in the form of 𝐢𝑃 𝑃 ([π‘ π‘˜ ]) and the effectiveness of a system, i.e., in terms of Mean Aver- for consecutive turns – 𝐢𝑃 𝑃 ([π‘ π‘˜ , π‘ π‘˜+1 )]. All predictors age Precision (ground truth) is correlated with a ranking are described in Table 1. For instance, top-1 denotes induced by a predictor. In contrast, we evaluate CPP pre- the maximum score of any retrieved item, while mean dictors at the conversation level (across multiple dialog denotes the average of the scores of the retrieved items. turns). Consequently, for the ground truth, we evaluate When applying these predictors, we also denote the turn the effectiveness of each conversation at identifying the π‘˜ that the predictor is calculated, i.e. top-1@k is the user’s target item – more specifically, by considering the maximum score of any item retrieved in the ranking rank of the target item at a specific turn of the conver- produced for turn π‘˜. In the remainder of this paper, sation. Following existing CRS work [6, 10, 11, 16, 44], we evaluate these predictors on several conversational we set the maximum number of turns to be 10. fashion recommendation datasets. In this regard, for our proposed single-turn predictors in Table 1, we use three different ground truth settings: the rank of the target item at the end of the conversation 5. Experimental Setup (turn 10); the rank of the target item during the conversa- tion, i.e. at a given turn π‘˜; and the rank of the target item We now experiment to address salient aspects upon both directly after the prediction is made (i.e. π‘˜ + 1 for a pre- the nature of the predictors (single-turn and consecutive diction at turn π‘˜). Through these different ground truth turn), as well as upon the accuracy of the predictors on settings, we can measure CPP accuracy at both short- different prediction horizons, i.e., at what point can a term and long-term horizons, as well as their longevity. prediction be made, and how does it correspond to the Finally, for quantifying the correlations, we report effectiveness of the CRS, as measured at a later turn. In Spearman’s 𝜌. Significance testing is achieved by ex- particular, we measure short-term horizons (i.e., can we amining the p-value associated with 𝜌, which indicates predict the effectiveness of the next turn?); and long-term the probability of an uncorrelated ranking producing a horizons (i.e., can we predict the effectiveness of the last Spearman correlation as high as that observed.1 turn); as well as measuring the longevity of the prediction (i.e., how useful is an early prediction?). Focusing initially on single-turn predictors, our first research question is: 6. Results RQ1 Can we predict conversation performance with predictors based on retrieval scores of a single turn, in In this section we report experiments for score-based CPP terms of (a) long-term and (b) short-term prediction, as predictors, for single-turn (Section 6.1) and consecutive- well as (c) longevity? turn (Section 6.2) scenarios. Secondly, we consider the consecutive-turn predictors: 1 See also https://docs.scipy.org/doc/scipy/reference/generated/ scipy.stats.spearmanr.html Table 2 Results of single-turn predictors for short and long-term prediction of rank of target items at various turns. * denotes signif- icant correlations; for Shoes, all correlations are significant, so * is omitted (𝑝 < 0.05). In the first group of columns, bold values denote the maximum correlation over all turns for the same predictor and the same ground truth value. For the other two sets of columns, bold values denote the highest performing predictor of the three examined single-turn predictors in the given evaluation setting for each turn – this is because comparison of correlation values across turns (rows) is not possible, since the ground truth changes for each row. Prediction at turn π‘˜ with rank@turn10 Prediction at turn 2 with rank@turn π‘˜ Prediction at turn π‘˜ with rank@turn π‘˜ + 1 k top-1@k mean@k sd@k rank@k top-1@k mean@k sd@k k, rank@k top-1@k mean@k sd@k Shoes 2 -0.144 -0.141 -0.081 2 -0.405 -0.385 -0.059 2,3 -0.423 -0.413 -0.201 3 -0.145 -0.145 -0.097 3 -0.423 -0.413 -0.201 3,4 -0.356 -0.355 -0.254 4 -0.148 -0.148 -0.105 4 -0.357 -0.349 -0.183 4,5 -0.318 -0.317 -0.211 5 -0.155 -0.153 -0.089 5 -0.314 -0.309 -0.177 5,6 -0.293 -0.292 -0.180 6 -0.165 -0.165 -0.093 6 -0.270 -0.267 -0.163 6,7 -0.254 -0.254 -0.135 7 -0.173 -0.173 -0.100 7 -0.230 -0.226 -0.140 7,8 -0.235 -0.234 -0.126 8 -0.178 -0.177 -0.073 8 -0.213 -0.210 -0.136 8,9 -0.208 -0.207 -0.067 9 -0.184 -0.183 -0.064 9 -0.175 -0.173 -0.1149 9,10 -0.183 -0.183 -0.064 10 -0.183 -0.181 -0.026 10 -0.144 -0.141 -0.081 Dresses 2 0.012 0.003 -0.036 2 -0.281* -0.279* -0.161* 2,3 -0.248* -0.256* -0.197* 3 -0.017 -0.015 -0.004 3 -0.248* -0.256* -0.197* 3,4 -0.262* -0.257* -0.075* 4 -0.045* -0.047* -0.014 4 -0.187* -0.198* -0.173* 4,5 -0.246* -0.239* -0.038 5 -0.055* -0.051* -0.007 5 -0.128* -0.140* -0.137* 5,6 -0.206* -0.198* -0.008 6 -0.063* -0.063* -0.041* 6 -0.079* -0.092* -0.102* 6,7 -0.172* -0.168* -0.034 7 -0.069* -0.072* -0.033 7 -0.052* -0.067* -0.091* 7,8 -0.139* -0.142* -0.044* 8 -0.075* -0.076* -0.021 8 -0.039 -0.051* -0.072* 8,9 -0.103* -0.101* -0.000 9 -0.073* -0.071* -0.018 9 -0.005 -0.018 -0.053* 9,10 -0.073* -0.071* -0.018 10 -0.080* -0.078* 0.003 10 0.0127 0.003 -0.036 Shirts 2 -0.092* -0.089* -0.074* 2 -0.305* -0.298* -0.141* 2,3 -0.297* -0.305* -0.201* 3 -0.124* -0.119* -0.033 3 -0.297* -0.305* -0.201* 3,4 -0.336* -0.326* -0.03* 4 -0.145* -0.137* 0.011 4 -0.264* -0.273* -0.192* 4,5 -0.323* -0.308* 0.019 5 -0.148* -0.142* -0.016 5 -0.228* -0.231* -0.157* 5,6 -0.305* -0.293* 0.018 6 -0.139* -0.134* -0.003 6 -0.198* -0.206* -0.155* 6,7 -0.248* -0.238* 0.026 7 -0.152* -0.150* -0.003 7 -0.166* -0.168* -0.122* 7,8 -0.203* -0.196* 0.017 8 -0.160* -0.153* 0.031 8 -0.1346* -0.135* -0.096* 8,9 -0.192* -0.184* 0.049* 9 -0.149* -0.142* 0.003 9 -0.120* -0.118* -0.089* 9,10 -0.149* -0.142* 0.003 10 -0.147* -0.138* 0.053* 10 -0.092* -0.089* -0.074* 6.1. RQ1 - Single-Turn predictors aims to determine the extent that the overall conversa- tion can be successfully predicted (i.e. the ground truth Table 2 shows the results for the three single-turn is the rank of the target item at turn 10). Overall, the predictors, namely: the score of the top-ranked item at correlations2 are weak (-0.184 is the strongest observed a given turn π‘˜ (denoted top-1@k); the mean value of all for Shoes, and -0.160 for Shirts; Dresses is lower still at top-ranked items in the recommendation list at a given -0.080), yet significant (𝑝 < 0.05). This suggest the diffi- turn (mean@k); and the standard deviation values of the culty of the long-term prediction task. We do observe that scores of all top-ranked items (sd@k). correlations are relatively higher as the prediction turn The table is grouped into three sets of columns defining increases - thus indicating that it is easier to predict per- the prediction turn and the ground truth turn. Specifi- formance at turn 10 using evidence of the ranking at turn cally, Prediction at turn π‘˜ with rank@turn10 addresses 10. Finally, among the predictors, the maximum score long-term prediction; the middle group, Prediction at at each turn, along with the mean score, exhibit higher turn 2 with rank@turn π‘˜, addresses whether prediction correlations the standard deviation. To answer RQ1 (a), at an early turn can help identify success at early or late we cannot sufficiently predict long-term conversation turns; finally, the third group, Prediction at turn π‘˜ with rank@turn π‘˜ + 1, addresses short-term prediction. 2 In our analysis, we ignore the sign of the correlation - indeed, the We first examine the first group of columns, which observed correlations are negative, as our CRS system uses repre- sentation distances rather than similarities. performance using single-turn score-based predictors. Turning next to the second group of columns, we ob- serve stronger correlations. Indeed, the overall higher correlations suggests that predicting at turn 2 gives more accurate predictions, particularly when aiming to predict conversation performance at turn 2 or shortly thereafter. In particular, for the Shoes datasets, medium strength correlations of -0.423 are observed - these are in line with the best accuracy of some QPP predictors for adhoc search tasks [12, 25, 30, 24]. Correlations of -0.305 and βˆ’0.281 are observed for Shirts and Dresses, respectively. Among the predictors, top-1@k is again most successful Figure 2: Results of the difference in the top-1 ranked item on Shoes, but on Dresses and Shirts, where correlations (maximum score) between pairs of consecutive turns as a con- are lower, the overall picture is less clear across different secutive turn CPP predictor for each of the datasets. prediction horizons (i.e. as the ground truth π‘˜ is varied). For these datasets, mean is the most accurate for most values of π‘˜ β‰₯ 2. In general, when predicting conver- turn. In RQ2 below, we focus on short-term (next turn sation performance using single-turn retrieval scores, prediction), as the most promising CPP setting. prediction becomes less accurate as the longevity of the prediction increases, thus answering RQ1(c). 6.2. RQ2 - Consecutive-Turn predictors Finally, the last set of columns of the table shows the correlation of the scores of each turn π‘˜ (as a predictor) Figure 2 presents the results of our first consecutive-turn when the effectiveness of the following turn π‘˜ + 1 is used predictor, namely the difference in maximum score (top- as the ground truth (i.e. applying a short-term horizon). 1 item) for each pair of turns π‘˜, π‘˜ + 1 when predicting The scores of both the top-ranked item and the average the rank of the target item at turn π‘˜ + 1. Within the score of the top-ranked items at turn π‘˜ sufficiently pre- figure, each dataset is represented as a separate curve. dict the rank of turn π‘˜ +1, especially for early turns. This Considering the different datasets, for Shirts and Dresses, trend weakens as the number of turns increases, but the we observe a similar trend across turns, starting from observed correlations remain quite high for some cases. a correlation of -0.18 (the maximum value obtained for For example, for Shoes, we start with a correlation of this predictor) at turns 2-3, which gradually decreases as -0.423 (maximum score) and -0.413 (average score) for the number of turns increases. In contrast, Shoes does turns 2, 3 and at turns 8 βˆ’ 9 the correlation is still -0.20. not achieve any correlation stronger than -0.016 at turns For Shirts, the maximum and average score of top items 3-4. Therefore, we observe only weak correlations for sufficiently predict the ranking of turn 3 at -0.30 and the this predictor at short-term prediction, although some score of turn 8 still at -0.20. Finally, although weaker correlations are significant. To answer RQ2(a), using the than the other two datasets, the two predictors work rea- scores of two consecutive turns, does not sufficiently pre- sonably well for Dresses, achieving a maximum value of dict conversation performance, and is indeed generally -0.26 for predicting the rank of turn 3. These values sug- less effective than the predictors examined in RQ1. gest some evidence for short-term prediction when using Next, we test our final predictor, which considers the single-turn score-based predictors, to answer RQ1(b). overlap of top-ranked items (i.e., the size of intersection) Overall, we observe that there is some evidence for between consecutive turns. We considered various rank short (score of one turn predicting the rank of the fol- cutoff values for calculating the overlap, ranging from lowing turn) and early prediction (a score of initial turn rank 5 to rank 1000, and all pairs of turns. Figure 3 re- predicting the rank of some turns ahead). The score ports the observed correlations (y-axis), where each pair of the top-ranked item and the mean scores of the rec- of turns is a curve, and the x-axis is the rank cutoff at ommendation list are shown to be the most promising which overlap is calculated. Recall that we expect that single-turn predictors. However, contrary to previous when the retrieved items are generally similar, this may QPP research [25], the results for the standard deviation be indicative that the CRS is reaching a stable conclu- are not as encouraging. The results for long-term predic- sion of the likely relative items. If this occurs at a later tion are weaker, but still, the score of the initial turn is turn, we may be further confident in the likely positive predictive of later stages. In general, prediction of the performance of the system. system performance (whether it finds the target in the On analysing Figure 3, we note that Dresses & Shirts context of a conversation) is possible by using single-turn (Figure 3(b) & (c), respectively) – which are both Fashion score-based predictors, particularly for the success of the IQ datasets – we observe a strengthening trend in the conversation at early turns and prediction of the next correlations as we increase the rank cutoff value (more (a) Shoes (b) Dresses (c) Shirts Figure 3: For each dataset, results for overlap of top-ranked items as a consecutive predictor for all pairs of turns π‘˜, π‘˜ + 1 for a number of rank cutoff values. items are considered). This happens for all pairs of turns indicative of the ranking of the users’ target items in the except the initial turn. In addition, the correlations are recommendation list. stronger for later turns than earlier turns, indicating that In our analysis of the proposed single-turn predictors, this predictor is more useful for later turns (as expected). we found that examining the score of the top-ranked Indeed, improved prediction at later turns is particularly items had a medium correlation with the effectiveness of notable, as this contrasts with our results in RQ1, where the conversation, particularly the effectiveness at early earlier prediction was more accurate. turns. Indeed, we observed a Spearman’s 𝜌 of 0.423 on On the other hand, for the Shoes dataset, the highest the Shoes dataset, which is comparable to correlations correlations are observed for turns 3-4 and 4-5, and for observed for standard QPP predictors on adhoc search cutoff values at 50 and 100. The correlations for item over- tasks [12, 24, 25, 30]. However, these single-turn lap in Shoes are weaker than the other two datasets, con- predictors became less useful at predicting the success of trasting with the observations in RQ1 (where Shoes ex- later turns. On the other hand, among our consecutive hibited higher correlations for the single-turn predictors turn predictors, simply examining the overlap of the than Dresses or Shirts). We note that, as a CRS dataset, retrieved lists had a weak-medium correlation with late Shoes is β€œeasier” than Dresses (e.g. the GRU model can turn effectiveness on two out of our three datasets. attain Mean Reciprocal Rank 0.2 at turn 10 on Shoes, Overall, the weak-medium correlations observed for compared to Mean Reciprocal Rank 0.075 at turn 10 on our simple unsupervised predictors of different families Dresses [10]). We postulate that early single-turn pre- suggests that there is significant scope to extend this diction works well on Shoes, as more conversations are work, for instance by introducing supervised predictors. answered at earlier turns; in contrast, on Dresses, more Moreover, our proposed framework for CPP is generalis- critiques are required for successful conversations, and able - for instance, we can also envisage predictors that the overlap-based evidence later in the conversation is examine aspects of the critiques (for instance, repeated therefore more useful for prediction. critiques), or characteristics of the retrieved images (are Overall, these results suggest some weak-medium cor- item colours or styles varied). We leave these for future relations (upto -0.25 𝜌) on the overlap-based consecutive work. Furthermore, we also aim to extend our analyses turn predictor, thereby answering RQ2(b). to a classification task that aims to predict whether a conversation would fail, as well as testing the efficacy of interventions for failing conversations. 7. Conclusions Finally, this study takes place in the context of user simulators for evaluation of CRS - such user simulators We have presented a novel framework for conversational are common in the training and evaluation of conversa- performance prediction (CPP) that aims to detect the fac- tional systems. Logging the interactions of a deployed tors that indicate effective performance by taking into CRS would allow to verify the results depicted here. account the multi-turn aspect of the task of conversa- tional interactive image retrieval. In this regard, we proposed a number of predictors that can be used for Acknowledgments both short-term and long-term prediction, and explored the retrieval scores and retrieved items, of both a single Maria Vlachou’s work was supported by the UKRI Centre turn and consecutive turns. We conducted our analy- for Doctoral Training in Socially Intelligent Artificial ses on three widely-used relative captioning datasets for Agents, Grant number EP/S02266X/1. conversational recommendation systems (CRS) and ex- amined the extent to which our proposed predictors are References ings of the 17th ACM conference on Information and knowledge management, 2008, pp. 1419–1420. [1] T. M. Brill, L. Munoz, R. J. Miller, Siri, Alexa, and [15] D. Carmel, E. Yom-Tov, Estimating the query diffi- other digital assistants: a study of customer satisfac- culty for information retrieval, Synthesis Lectures tion with artificial intelligence applications, Journal on Information Concepts, Retrieval, and Services 2 of Marketing Management 35 (2019) 1401–1436. (2010) 1–89. [2] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey [16] H. Wu, Y. Gao, X. Guo, Z. Al-Halah, S. Rennie, on conversational recommender systems, ACM K. Grauman, R. Feris, Fashion iq: A new dataset Computing Surveys 54 (2021) 1–36. towards retrieving images by natural language feed- [3] F. N. Tou, M. D. Williams, R. Fikes, D. A. Hender- back, 2020. arXiv:1905.12794. son Jr, T. W. Malone, Rabbit: An intelligent database [17] J. Peng, C. Macdonald, B. He, I. Ounis, A study of se- assistant., in: AAAI, 1982, pp. 314–318. lective collection enrichment for enterprise search, [4] L. Chen, P. Pu, Critiquing-based recommenders: in: Proceedings of the 18th ACM Conference on survey and emerging trends, User Modeling and Information and Knowledge Management, CIKM User-Adapted Interaction 22 (2012) 125–150. ’09, 2009, p. 1999–2002. [5] T. L. Berg, A. C. Berg, J. Shih, Automatic attribute [18] S. Cronen-Townsend, Y. Zhou, W. B. Croft, A frame- discovery and characterization from noisy web data, work for selective query expansion, in: Proceedings in: Proc. ECCV, 2010, pp. 663–676. of the Thirteenth ACM International Conference on [6] X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, Information and Knowledge Management, CIKM R. Feris, Dialog-based interactive image retrieval, ’04, 2004, p. 236–237. in: Proc. NeurIPS, 2018, pp. 678–688. [19] C. Macdonald, R. L. Santos, I. Ounis, On the use- [7] V. S. Bursztyn, J. Healey, E. Koh, N. Lipka, L. Birn- fulness of query features for learning to rank, in: baum, Developing a conversational recommenda- Proceedings of the 21st ACM International Confer- tion systemfor navigating limited options, in: Ex- ence on Information and Knowledge Management, tended Abstracts of the 2021 CHI Conference on CIKM ’12, 2012, p. 2559–2562. Human Factors in Computing Systems, 2021, pp. [20] Y. Zhao, F. Scholer, Y. Tsegay, Effective pre-retrieval 1–6. query performance prediction using similarity and [8] Y. Jin, W. Cai, L. Chen, N. N. Htun, K. Verbert, Mu- variability evidence, in: European conference on sicbot: Evaluating critiquing-based music recom- information retrieval, Springer, 2008, pp. 52–64. menders with conversational interaction, in: Pro- [21] F. Scholer, S. Garcia, A case for improved evaluation ceedings of the 28th ACM International Conference of query difficulty prediction, in: Proceedings of on Information and Knowledge Management, 2019, the 32nd international ACM SIGIR conference on pp. 951–960. Research and development in information retrieval, [9] W. Cai, Y. Jin, L. Chen, Critiquing for music explo- 2009, pp. 640–641. ration in conversational recommender systems, in: [22] J. Mothe, L. Tanguy, Linguistic features to predict 26th International Conference on Intelligent User query difficulty, in: ACM Conference on research Interfaces, 2021, pp. 480–490. and Development in Information Retrieval, SIGIR, [10] Y. Wu, C. Macdonald, I. Ounis, Partially observable Predicting query difficulty-methods and applica- reinforcement learning for dialog-based interactive tions workshop, 2005, pp. 7–10. recommendation, in: Proc. RecSys, 2021, pp. 241– [23] B. He, I. Ounis, Query performance prediction, 251. Information Systems 31 (2006) 585–594. [11] T. Yu, Y. Shen, H. Jin, A visual dialog augmented [24] Y. Zhou, W. B. Croft, Query performance prediction interactive recommender system, in: Proc. KDD, in web search environments, in: Proceedings of 2019, pp. 157–165. the 30th annual international ACM SIGIR confer- [12] S. Cronen-Townsend, Y. Zhou, W. B. Croft, Predict- ence on Research and development in information ing query performance, in: Proceedings of the 25th retrieval, 2007, pp. 543–550. annual international ACM SIGIR conference on Re- [25] A. Shtok, O. Kurland, D. Carmel, Predicting query search and development in information retrieval, performance by query-drift estimation, in: Con- 2002, pp. 299–306. ference on the Theory of Information Retrieval, [13] B. He, I. Ounis, Inferring query performance using Springer, 2009, pp. 305–312. pre-retrieval predictors, in: International sympo- [26] M. Mitra, A. Singhal, C. Buckley, Improving auto- sium on string processing and information retrieval, matic query expansion, in: Proceedings of the 21st Springer, 2004, pp. 43–54. annual international ACM SIGIR conference on Re- [14] C. Hauff, D. Hiemstra, F. de Jong, A survey of pre- search and development in information retrieval, retrieval query performance predictors, in: Proceed- 1998, pp. 206–214. [27] R. Cummins, Document score distribution models 1257–1260. for query performance inference and prediction, [39] H. Zamani, S. Dumais, N. Craswell, P. Bennett, ACM Transactions on Information Systems (TOIS) G. Lueck, Generating clarifying questions for in- 32 (2014) 1–28. formation retrieval, in: Proceedings of the Web [28] F. Diaz, Performance prediction using spatial au- conference 2020, 2020, pp. 418–428. tocorrelation, in: Proceedings of the 30th annual [40] H. Roitman, S. Erera, G. Feigenblat, A study of international ACM SIGIR conference on Research query performance prediction for answer quality and development in information retrieval, 2007, pp. determination, in: Proceedings of the 2019 ACM 583–590. SIGIR International Conference on Theory of Infor- [29] A. Shtok, O. Kurland, D. Carmel, Query perfor- mation Retrieval, 2019, pp. 43–46. mance prediction using reference lists, ACM Trans- [41] W. Shi, K. Qian, X. Wang, Z. Yu, How to build user actions on Information Systems (TOIS) 34 (2016) simulators to train RL-based dialog systems, arXiv 1–34. preprint arXiv:1909.01388 (2019). [30] A. Shtok, O. Kurland, D. Carmel, Using statistical [42] X. Li, Z. C. Lipton, B. Dhingra, L. Li, J. Gao, Y.- decision theory and relevance models for query- N. Chen, A user simulator for task-completion performance prediction, in: Proceedings of the 33rd dialogues, arXiv preprint arXiv:1612.05688 (2016). international ACM SIGIR conference on Research [43] N. Tintarev, J. Masthoff, A survey of explanations and development in information retrieval, 2010, pp. in recommender systems, in: Proc. IEEE data engi- 259–266. neering workshop, IEEE, 2007, pp. 801–810. [31] J. Lafferty, C. Zhai, Document language models, [44] Y. Wu, C. Macdonald, I. Ounis, Partially observable query models, and risk minimization for informa- reinforcement learning for dialog-based interactive tion retrieval, in: Proceedings of the 24th annual recommendation, in: Proceedings of ACM RecSys, international ACM SIGIR conference on Research 2021. and development in information retrieval, 2001, pp. [45] Y. Wu, C. Macdonald, I. Ounis, Multimodal con- 111–119. versational fashion recommendation with positive [32] V. Lavrenko, W. B. Croft, Relevance-based language and negative natural-language feedback, in: Pro- models, in: ACM SIGIR Forum, volume 51, ACM ceedings of ACM Conversational User Interfaces, New York, NY, USA, 2017, pp. 260–267. 2022. [33] W. Webber, A. Moffat, J. Zobel, A similarity mea- [46] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, sure for indefinite rankings, ACM Transactions on Session-based recommendations with recurrent Information Systems (TOIS) 28 (2010) 1–38. neural networks, arXiv preprint arXiv:1511.06939 [34] O. Kurland, A. Shtok, S. Hummel, F. Raiber, (2015). D. Carmel, O. Rom, Back to the roots: A proba- bilistic framework for query-performance predic- tion, in: Proceedings of the 21st ACM international conference on Information and knowledge manage- ment, 2012, pp. 823–832. [35] J. Kang, K. Condiff, S. Chang, J. A. Konstan, L. Ter- veen, F. M. Harper, Understanding how people use natural language to ask for recommendations, in: Proc. RecSys, 2017, pp. 229–237. [36] I. SekuliΔ‡, M. Aliannejadi, F. Crestani, Exploiting document-based features for clarification in con- versational search, in: European Conference on Information Retrieval, Springer, 2022, pp. 413–427. [37] M. Aliannejadi, H. Zamani, F. Crestani, W. B. Croft, Asking clarifying questions in open-domain information-seeking conversations, in: Proceed- ings of the 42nd international acm sigir conference on research and development in information re- trieval, 2019, pp. 475–484. [38] J. Kiesel, A. Bahrami, B. Stein, A. Anand, M. Ha- gen, Toward voice query clarification, in: The 41st international ACM SIGIR conference on research & development in information retrieval, 2018, pp.