Performance Predictors for Conversational Fashion
Recommendation
Maria Vlachou1 , Craig Macdonald1
1
    University of Glasgow, UK


                                       Abstract
                                       In Conversational Recommendation Systems (CRS), a user can provide natural language feedback on suggested items, which
                                       the recommender uses to produce improved suggestions. Therefore, the success of a user’s conversation with the CRS
                                       is determined by how well the system is able to interpret the user’s feedback and the quality of the recommendations.
                                       Knowing whether a conversation is likely to be successful may allow the CRS to adjust accordingly - for instance, changing
                                       its retrieval strategy, or asking a clarifying question. Existing work on Query Performance Prediction (QPP) has examined a
                                       number of predictors that indicate the effectiveness of a search engine’s ranking in response to a query. Inspired by existing
                                       work in QPP, we propose a framework for Conversational Performance Prediction (CPP) that aims to predict conversation
                                       failures by considering the recommendation ranking at different turns of a conversation, either one turn at a time, or by
                                       considering multiple consecutive turns. In this regard, we adapt post-retrieval predictors to address the multi-turn nature of
                                       the CRS task. We conduct our analysis on Shoes and FashionIQ Shirts & Dresses datasets. In particular, as a ground truth, we
                                       measure conversation difficulty by the effectiveness of the ranking at a given turn of the conversation. Overall, we find some
                                       promise in score-based retrieval predictors for CPP, obtaining medium strength correlations with conversation difficulty - for
                                       instance, observing a Spearman’s 𝜌 of 0.423 on the Shoes dataset, which is comparable to correlations observed for standard
                                       QPP predictors on adhoc search tasks.


1. Introduction
Traditionally, Recommender Systems (RS) help users to
find items of interest on the basis of user feedback in
terms of ratings, clicks or reviews. In contrast, Conversa-
tional Recommendation Systems (CRS), such as personal
digital assistants [1], have facilitated more complex rec-
ommendation settings by suggesting items in response
to voice or (natural language) chat interactions. In par-
ticular, a CRS allows a multi-turn dialogue with users
and aim to assist them with achieving a number of task-
oriented goals [2]. Indeed, at each turn users can provide
their feedback or critique [3], which helps the system to
improve recommendations [4].
   One important aspect of natural language-based CRS
is that they allow users to explore the range of available
options and elicit their preferences. For example, Bursz-                                       Figure 1: Example of Dialog-based recommendation in CRS.
tyn et al. [7] created a multi-modal system, where users                                        Pictures and dialogues from the Shoes dataset [5, 6].
navigate in a setting of limited options, such as finding
a restaurant near their location. In this setting, users
                                                                                                       tions based on different techniques of critiquing the song
start exploring an initial set of restaurants and have the
                                                                                                       recommendations. In our work, we are focused upon
opportunity to see their details by clicking through the
                                                                                                       conversational fashion image recommendation [5, 6, 10],
options, while they are asked about the reasons for any
                                                                                                       an example of which is shown in Figure 1. In this task,
negative feedback they provide. Another example of user
                                                                                                       the user has a target item in mind, and provides textual
exploration is the MusicBot [8, 9], a music chatbot that
                                                                                                       feedback (critiques) to direct the system towards retriev-
first collects users’ preferences and then makes sugges-
                                                                                                       ing images of fashion products that are more similar to
4th Edition of Knowledge-aware and Conversational Recommender to their perceived target item.
Systems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seat-                                        However, not all conversations may lead to a satisfying
tle, WA, USA.                                                                                          outcome for the user. This can be easily quantified in
" m.vlachou.1@research.gla.ac.uk (M. Vlachou);                                                         offline evaluation scenarios, where the CRS is evaluated
craig.macdonald@glasgow.ac.uk (C. Macdonald)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License across a pre-defined number of turns. For example, Yu
    CEUR
          Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       et al. [11] found that, although users had the option to
explore a list of options for a number of turns, the sys-         2.1. Query Performance Prediction
tem was unable to find a relevant recommendation by
                                                                  Traditionally, QPP is used to predict the effectiveness of
turn 7, which might mean that the algorithm was still
                                                                  a search results page performed in response to a query
exploring the space. Also, in Wu et al. [10], the target
                                                                  in the absence of human relevance judgments [15]. It
item was found by the system at rank 1 in only 42% of
                                                                  has applications to selective retrieval approaches [17, 18]
conversations after (a maximum of) 10 turns. Therefore,
                                                                  and query features for learning-to-rank [19], to name
exploration might result in an increased number of turns,
                                                                  but a few. Query performance predictors are generally
which on one hand might mean more engaged users [8],
                                                                  grouped into pre-retrieval, and post-retrieval, which we
but at the same time suggests that often the conversa-
                                                                  discuss further below.
tions might fail (i.e., target item not found). In this regard,
we are interested in identifying indicators that can detect
how this happens – for example, a conversation could              2.1.1. Pre-retrieval Query Performance Predictors
fail because the system is unable to find the target item         Pre-retrieval predictors are used to estimate the perfor-
or because the target item is not available.                      mance of queries before the retrieval stage, and therefore,
   In what follows, inspired by existing work on Query            are independent of the search performed and the ranked
Performance Prediction (QPP) (e.g., [12, 13, 14, 15]), we         list of results [14]. This means that pre-retrieval predic-
aim to predict conversational failures by identifying spe-        tors base their predictions on properties of query-terms
cific indicators that are correlated with failure. In par-        or corpus-based statistics [12, 13, 14, 20, 21, 22]. Exam-
ticular, we aim to determine the quality of multi-turn            ples of pre-retrieval predictors that describe the statistical
critiquing-based CRS recommendation by proposing pre-             properties of the query terms or the corpus include the
dictors that consider the multi-turn aspect of conver-            query length (number of non-stop words in the query),
sational recommendation. The proposed predictors ad-              the standard deviation of the inverse document frequency
dress characteristics of the retrieved scores of the top-         of the query terms, the simplified query clarity score
recommended items and can predict poor performance                (SCS), which measures the occurrence of a query term
across a shorter or longer number of turns in the con-            in the query relative to its occurrence in the collection,
versation, which we call prediction horizons. In summary,         and AvICTF, which considers the overall informative-
this work makes the following contributions: (i) We pro-          ness of the query terms using the collection model [23].
pose a framework for Conversational Performance Pre-              Another class of pre-retrieval predictors refers to linguis-
diction (CPP), which extends the existing work on QPP to          tic features of the queries, such as syntactic complexity
a conversational recommendation setting; (ii) We show             (distance between syntactically linked words) and word
how to adapt QPP evaluation methodology to a multi-               polysemy (number of semantic classes a word belongs
turn conversational setting which allows to evaluate CPP          to) [22]. Overall, with limited information available be-
predictors for both short- and long-term prediction hori-         fore retrieval commences, pre-retrieval predictors are
zons; (iii) We evaluate some of our proposed predictors           widely considered less accurate for performance predic-
on the Shoes [5, 6] dataset and the Fashion IQ Dresses            tion than post-retrieval predictors [14].
and Shirts categories [16], using a state-of-the-art user
simulator [6]. The rest of the paper is structured as fol-        2.1.2. Post-retrieval Query Performance
lows: Section 2 presents the existing research on QPP,                   Predictors
including pre- and post-retrieval predictors, as well as
their probabilistic interpretation; Section 4 outlines our  On the other hand, post-retrieval predictors are applied on
new proposed framework and predictors; Section 5 de-        the list of the top-ranked retrieved documents, and there-
scribes our experimental setup; Sections 6 & 7 present      fore use the relevance scores or the (textual) contents of
our results and provide concluding remarks.                 the returned items. A first group of post-retrieval predic-
                                                            tors examines the difference of the result list from the
                                                            corpus, or the focus of the result list. For example, Clarity
2. Related Work                                             [12] measures the focus of the resulting ranking with
                                                            respect to the corpus using the KL divergence between
In order to predict why a conversation with a CRS might
                                                            their respective language models, while the Weighted
fail, we need to identify indicators that show when the
                                                            Information Gain (WIG) corresponds to the difference be-
user is unable to find the target item during the interac-
                                                            tween the average retrieval score of the result list and of
tion. In this regard, we are inspired by existing work from
                                                            that of the corpus [24]. A second group includes the dis-
Query Performance Prediction (QPP), which we discuss in
                                                            tribution of the retrieval scores of the top-ranked items.
Section 2.1; Later, in Section 2.2, we discuss applications
                                                            Predictors in this group include Normalized Query Com-
of QPP in conversational contexts.
                                                            mitment (NQC) [25] (the standard deviation of the re-
                                                            trieval scores in the result list). The standard deviation is
considered to be negatively correlated with the amount            However, while QPP has been widely explored for
of query drift (the non-related information in the result      (single turn) queries in search settings, the area of con-
list) [26]. Also, this group includes the modeling of re-      versational search or recommendation has seen much
trieval scores; the top-ranked items could be modeled as a     less work. For example, one recent work examines the
certain mixture of distributions corresponding to relevant     predicted effectiveness of the top-retrieved documents
and non-relevant items [27]. Another related predictor         for deciding to generate clarifying questions, and specifi-
is autocorrelation [28], which assumes that documents          cally some extracted features, such as noun phrases or
whose vector space embeddings are closely related re-          named entities [36]. Indeed, clarifications are useful for
ceive similar scores, and therefore, closely related scores    both the user and the system [37, 38, 39]. Also, Roit-
would indicate similar performance.                            man et al. [40] examined a constrained retrieval setting,
   A third group of post-retrieval predictors refers to        namely the interaction with a conversational assistant,
the relation of the top-ranked retrieval scores with a         where the assistant needs to decide whether the provided
particular reference list. Recently, a more generalised ap-    answer could be accepted. The authors built a classifier
proach for estimating the effectiveness of a ranking was       that determines the answer quality by adapting some
proposed, based on the assumption that high association        existing QPPs to the answer level (using the score of the
with pseudo-effective reference lists and low association      top item, which is provided as the answer).
with pseudo-ineffective lists improves effectiveness [29].        However, QPP for conversational recommendation has
One example refers to the utility estimation framework         not been addressed. In particular, we are interested in
(UEF) [30], which estimates the utility of a given ranking     creating a prediction framework for identifying poorly
with respect to how much it represents an underlying           performing or failed conversations in a recommendation
information need [31]. The utility is estimated by the         setting. We postulate that these predictors can be useful
expected similarity between a given document ranking           in several use cases, for instance knowing when to ask
and those induced by estimates of relevance language           for clarifications, or when the users target item cannot
models (these rankings are assumed to be representative        be found. Towards achieving this goal, we explore score-
of the information need). [32]. A similar predictor to         based predictors, adapting to the multiple turn nature
the UEF approach is query feedback (QF) [24], which            of the task. In the next section, we define the CRS task;
measures the overlap of top items between the result           Later in Section 4, we define our CPP framework.
list and a reference list retrieved from the corpus
using a language model induced from the result list.
Autocorrelation [28] could also fall under this category, if   3. Conversational Image
we compare the result list of the original retrieval scores       Recommendation
with a reference list that contains either a perturbed
version of the scores diffused in space or a list with        Figure 1 describes the context of dialog-based image rec-
the averaged values from multiple retrievals for the          ommendation in a CRS. At each interaction turn, the
same query. Lastly, an inter-list similarity predictor is     user provides a critique of the current recommendation
a measure of rank-biased overlap (RBO), which measures        (candidate item) back to the system, aimed at direct-
the expected average overlap between two rankings [33]        ing it towards the desired target item. More formally,
and can be applied to the QPP task.                           at a given interaction turn 𝑘, the user provides textual
   Finally, we note some recent QPP work (e.g. [29, 34])      feedback 𝑓𝑘 on the current top-ranked candidate item
has focused upon probabilistic frameworks for QPP,            𝑖𝑘,1 . Based on this feedback, the conversational rec-
which can integrate both pre-retrieval and post-retrieval     ommendation system 𝒞() provides a new ranking , i.e.:
predictors. However, many of the underlying intuitions        𝒞(𝑖𝑘,1 , 𝑓𝑘 ) → 𝑆𝑘 , where 𝑆𝑘 is a ranking of 𝑛 items with
encapsulated by these frameworks are already addressed        corresponding descending retrieval scores 𝑠1 . . . 𝑠𝑛 , i.e.:
in the previously described predictors.                       𝑆𝑘 = [⟨𝑖𝑘+1,1 𝑠1 ⟩, . . . ⟨𝑖𝑘+1,𝑛 , 𝑠𝑛 ⟩].
                                                                 However, it is challenging to train and evaluate a natu-
2.2. Query Performance Prediction in                          ral language-based CRS. For training, reinforcement learn-
                                                              ing (RL) is widely used, as it allows optimising the recom-
       Conversational Search                                  mendation model based on the long-term rewards [41],
Natural language-based conversational systems allow i.e. based not just on retrieving the correct item in any
users to express complex feedback through a dialogue, current iteration, but also retrieving it in later iterations.
thus resulting in more natural interactions [35]. To be However, such a model needs to be trained while interact-
able to predict the likelihood of success of a conversation, ing with an environment, and obtaining many samples
we need to consider the salient aspect of the conversa- is hard by relying on real users [41, 42]. For evaluation,
tional setting, such as the users’ feedback and the iterative ideally human users are needed to judge the system’s
turn-based nature of the interaction process.                 efficiency and user satisfaction [43]. Instead, user simula-
tors are deployed as surrogates for human users, trained         on textual queries and textual documents. In contrast,
on relative caption data - a form of human-annotated di-         in our fashion-based CRS, the “units of retrieval" are
alogues on pairs of images. Recommendation models                images, with embedded representations - this precludes
trained and evaluated using user simulators have been            the use of textual content-based predictors. Furthermore,
found to be correlated with human satisfaction [6].              our “query units" are critiques, which are based on the
   Specifically, for the purposes of training a user sim-        retrieval of the previous turn. Therefore, it can be seen
ulator with human-annotated dialogues, Guo et al. [6]            that there is no clear distinction between pre-retrieval
proposed the relative captioning task. In this task, hu-         and post-retrieval predictors, since what is considered
man annotators recruited through crowdsourcing are               post-retrieval of one turn could be seen as a pre-retrieval
placed in a context of online shopping, where the CRS            predictor of the following turn. For this reason, we
acts as the shopping assistant and they play the role of the     propose a new framework for performance prediction
customer. During the process, annotators are presented           in a conversational setting, in particular conversational
with candidate recommended images of items and they              fashion retrieval, which we describe in Section 4.1 below.
are asked to provide single instance critiques. In each          Later in Section 4.2, we describe the initial score-based
interaction round, they are shown a given candidate item         predictors we can adapted to this framework.
and based on a given target item, they provide a critique
on the current candidate item. These differences between         4.1. CPP Framework
the candidate and the target image are described with
natural language phrases and form the relative captions.         We present a framework for Conversational Performance
Hence, a relative captioning dataset contains tuples of          Prediction (CPP) applied to the domain of fashion recom-
the following form: ⟨𝑖𝑡 , 𝑖𝑐 , 𝑡𝑞𝑐,𝑡 ⟩ where 𝑖𝑡 is a represen-   mendation for image retrieval [6, 16]. In this regard, we
tation of the target item (for instance an image), 𝑖𝑐 is         define recommendation success as the identification of
the current candidate item being presented to the user           the target image item by the system before a maximum
and 𝑡𝑞𝑐,𝑡 is the critique by the user on the candidate,          number of turns is reached, which corresponds to a user
intended to direct the system more towards the target.           being satisfied with the conversation. More formally, the
Relative captioning data can be used to train a user sim-        CPP task can be described as a function of the form
ulator, which is then deployed for training or evaluating
a CRS [6, 10, 11, 16, 44, 45].                                                        𝐶𝑃 𝑃 (𝐹, 𝑆) → R
   Using a user simulator for evaluation, the overall suc-
                                                                 where 𝐹 is a sequence each containing 𝑓 feedback cri-
cess of a CRS system can be reliably measured, in a offline
                                                                 tiques over 1 or more turns, and 𝑆 is a sequence of results
Cranfield-like setting, by using ranking evaluation mea-
                                                                 lists consisting of retrieval scores, over 1 or more turns.
sures, such as NDCG, upon the ranked list of recommen-
                                                                    This framework can be instantiated for single-turns,
dations produced at each turn. From such an evaluation,
                                                                 or multiple turns. For instance, in a single-turn setting,
it can be seen that even after 10 turns, some CRS models
                                                                 we can instance CPP task at a given turn 𝑘, i.e.:
may not be able to identify the target item for some con-
versations. For this reason, making a prediction as to the                          𝐶𝑃 𝑃single ([𝑓𝑘 ], [𝑠𝑘 ]).
likelihood of a user being satisfied with a conversation
may have utility to improving the user experience. In the        On the other hand, for two consecutive turns, 𝑘 and 𝑘 +1,
next section we introduce our proposal for conversation          prediction takes the following form:
performance prediction for CRS.
                                                                           𝐶𝑃 𝑃consecutive ([𝑓𝑘 , 𝑓𝑘+1 ], [𝑠𝑘 , 𝑠𝑘+1 ]).
4. Performance Prediction in                                     Overall, from the above different formulations, it is clear
   Conversational                                                that CPP is a distinct task from QPP that can be addressed
                                                                 by different families of predictors. In this initial work, we
   Recommendation                                                adapt one category of score-based QPP predictors into
                                                                 the CPP framework, which we discuss further below.
Our aim for conversational performance prediction
differs from existing approaches on QPP in a number of
ways. While QPP focuses on estimating the relevance of           4.2. Score-based Predictors for CPP
a ranking to a given single query (single-turn), to predict      In this work, we are inspired by post-retrieval predictors
the user’s satisfaction of a conversation, we need to take       that study the distributions of retrieval scores and the use
into account the nature of the task, which is to consider        of reference lists, as introduced in Section 2.1.2. In partic-
the ranking quality across multiple turns. Another im-           ular, we have the following initial intuitions concerning
portant difference is that many QPP techniques are based         successful interactions in the CRS task:
Table 1                                                                     RQ2 Can we predict conversation performance with
Proposed CPP predictors according to number of turns in-                 predictors based on (a) differences in retrieval scores
volved.                                                                  between consecutive turns and (b) overlap in retrieved
  Single-turn                              Consecutive Turns             items of two consecutive turns?
  Top-1 item score (maximum score)
  Mean score of top-n items
                                           Difference in maximum score
                                           Overlap of top-ranked items
                                                                            To evaluate our CPP approaches, we use the Shoes
  Standard deviation (sd) of top-n items                                 dataset [5, 6], which contains one relative critique
                                                                         (describing relative differences between recommended
                                                                         and target image pairs) for pairs of shoe images, and
      • For a single turn, if the score of the top-ranked                the Dresses & Shirts categories of the Fashion IQ
        item(s) is high, then the system has a clear repre-              dataset [16], which contains two relative captions per
        sentation of the user’s desired item, and it can find            candidate-target pair.
        item(s) that closely matches that representation.                   For a CRS, we apply a supervised GRU sequential rec-
      • In a successful conversation, the scores of the                  ommendation model [6, 46], which is trained using triplet
        top-ranked item(s) will increase across multiple                 loss and uses the natural language feedback and the pre-
        turns, as the system becomes more confident in                   vious recommended images as input, thus maximizing
        its predictions.                                                 short-term rewards. To train our recommendation model,
      • In a successful conversation, the retrieved items                we use a recently developed user simulator for dialog-
        become more similar across turns as the system                   based interactive image retrieval based and the relative
        becomes more confident in its predictions and                    captioning task [6]. The GRU model is configured to
        focuses on the correct part of the item catalogue.               retrieve 100 items at each turn.
                                                                            In QPP, the accuracy of predictors is evaluated at the
   Adapting the notation of Section 4.1 to disregard the                 query level (a given query is easy or difficult compared to
feedback sequences, we define a number of score-based                    other queries in a set). Specifically, a ranking of queries by
CPPs, for single turns – in the form of 𝐶𝑃 𝑃 ([𝑠𝑘 ]) and                 the effectiveness of a system, i.e., in terms of Mean Aver-
for consecutive turns – 𝐶𝑃 𝑃 ([𝑠𝑘 , 𝑠𝑘+1 )]. All predictors              age Precision (ground truth) is correlated with a ranking
are described in Table 1. For instance, top-1 denotes                    induced by a predictor. In contrast, we evaluate CPP pre-
the maximum score of any retrieved item, while mean                      dictors at the conversation level (across multiple dialog
denotes the average of the scores of the retrieved items.                turns). Consequently, for the ground truth, we evaluate
When applying these predictors, we also denote the turn                  the effectiveness of each conversation at identifying the
𝑘 that the predictor is calculated, i.e. top-1@k is the                  user’s target item – more specifically, by considering the
maximum score of any item retrieved in the ranking                       rank of the target item at a specific turn of the conver-
produced for turn 𝑘. In the remainder of this paper,                     sation. Following existing CRS work [6, 10, 11, 16, 44],
we evaluate these predictors on several conversational                   we set the maximum number of turns to be 10.
fashion recommendation datasets.                                            In this regard, for our proposed single-turn predictors
                                                                         in Table 1, we use three different ground truth settings:
                                                                         the rank of the target item at the end of the conversation
5. Experimental Setup                                                    (turn 10); the rank of the target item during the conversa-
                                                                         tion, i.e. at a given turn 𝑘; and the rank of the target item
We now experiment to address salient aspects upon both
                                                                         directly after the prediction is made (i.e. 𝑘 + 1 for a pre-
the nature of the predictors (single-turn and consecutive
                                                                         diction at turn 𝑘). Through these different ground truth
turn), as well as upon the accuracy of the predictors on
                                                                         settings, we can measure CPP accuracy at both short-
different prediction horizons, i.e., at what point can a
                                                                         term and long-term horizons, as well as their longevity.
prediction be made, and how does it correspond to the
                                                                            Finally, for quantifying the correlations, we report
effectiveness of the CRS, as measured at a later turn. In
                                                                         Spearman’s 𝜌. Significance testing is achieved by ex-
particular, we measure short-term horizons (i.e., can we
                                                                         amining the p-value associated with 𝜌, which indicates
predict the effectiveness of the next turn?); and long-term
                                                                         the probability of an uncorrelated ranking producing a
horizons (i.e., can we predict the effectiveness of the last
                                                                         Spearman correlation as high as that observed.1
turn); as well as measuring the longevity of the prediction
(i.e., how useful is an early prediction?). Focusing initially
on single-turn predictors, our first research question is:               6. Results
    RQ1 Can we predict conversation performance with
predictors based on retrieval scores of a single turn, in                In this section we report experiments for score-based CPP
terms of (a) long-term and (b) short-term prediction, as                 predictors, for single-turn (Section 6.1) and consecutive-
well as (c) longevity?                                                   turn (Section 6.2) scenarios.
Secondly, we consider the consecutive-turn predictors:
                                                                         1
                                                                             See also https://docs.scipy.org/doc/scipy/reference/generated/
                                                                             scipy.stats.spearmanr.html
Table 2
Results of single-turn predictors for short and long-term prediction of rank of target items at various turns. * denotes signif-
icant correlations; for Shoes, all correlations are significant, so * is omitted (𝑝 < 0.05). In the first group of columns, bold
values denote the maximum correlation over all turns for the same predictor and the same ground truth value. For the other
two sets of columns, bold values denote the highest performing predictor of the three examined single-turn predictors in the
given evaluation setting for each turn – this is because comparison of correlation values across turns (rows) is not possible,
since the ground truth changes for each row.
 Prediction at turn 𝑘 with rank@turn10 Prediction at turn 2 with rank@turn 𝑘 Prediction at turn 𝑘 with rank@turn 𝑘 + 1
 k top-1@k mean@k              sd@k    rank@k top-1@k mean@k sd@k k, rank@k top-1@k mean@k                     sd@k
                                                           Shoes
 2    -0.144      -0.141       -0.081     2       -0.405       -0.385  -0.059   2,3        -0.423    -0.413    -0.201
 3    -0.145      -0.145       -0.097     3       -0.423       -0.413  -0.201   3,4        -0.356    -0.355    -0.254
 4    -0.148      -0.148       -0.105     4       -0.357       -0.349  -0.183   4,5        -0.318    -0.317    -0.211
 5    -0.155      -0.153       -0.089     5       -0.314       -0.309  -0.177   5,6        -0.293    -0.292    -0.180
 6    -0.165      -0.165       -0.093     6       -0.270       -0.267  -0.163   6,7        -0.254    -0.254    -0.135
 7    -0.173      -0.173       -0.100     7       -0.230       -0.226  -0.140   7,8        -0.235    -0.234    -0.126
 8    -0.178      -0.177       -0.073     8       -0.213       -0.210  -0.136   8,9        -0.208    -0.207    -0.067
 9 -0.184        -0.183        -0.064     9       -0.175       -0.173 -0.1149   9,10       -0.183    -0.183    -0.064
 10 -0.183        -0.181       -0.026     10      -0.144       -0.141  -0.081
                                                          Dresses
 2     0.012       0.003       -0.036     2      -0.281* -0.279* -0.161*        2,3       -0.248* -0.256*     -0.197*
 3    -0.017      -0.015       -0.004     3       -0.248* -0.256* -0.197*       3,4       -0.262* -0.257*     -0.075*
 4 -0.045*       -0.047*       -0.014     4       -0.187* -0.198* -0.173*       4,5       -0.246* -0.239*      -0.038
 5 -0.055*       -0.051*       -0.007     5       -0.128* -0.140* -0.137*       5,6       -0.206* -0.198*      -0.008
 6 -0.063*       -0.063*      -0.041*     6       -0.079*     -0.092* -0.102*   6,7       -0.172* -0.168*      -0.034
 7 -0.069*       -0.072*       -0.033     7       -0.052*     -0.067* -0.091*   7,8       -0.139* -0.142*     -0.044*
 8 -0.075*       -0.076*       -0.021     8        -0.039     -0.051* -0.072*   8,9       -0.103* -0.101*      -0.000
 9 -0.073*       -0.071*       -0.018     9        -0.005      -0.018 -0.053*   9,10      -0.073* -0.071*      -0.018
 10 -0.080* -0.078*             0.003     10       0.0127       0.003  -0.036
                                                           Shirts
 2 -0.092*       -0.089*      -0.074*     2      -0.305* -0.298* -0.141*        2,3       -0.297* -0.305*     -0.201*
 3 -0.124*       -0.119*       -0.033     3       -0.297* -0.305* -0.201*       3,4       -0.336* -0.326*      -0.03*
 4 -0.145*       -0.137*        0.011     4       -0.264* -0.273* -0.192*       4,5       -0.323* -0.308*      0.019
 5 -0.148*       -0.142*       -0.016     5       -0.228* -0.231* -0.157*       5,6       -0.305* -0.293*      0.018
 6 -0.139*       -0.134*       -0.003     6       -0.198* -0.206* -0.155*       6,7       -0.248* -0.238*      0.026
 7 -0.152*       -0.150*       -0.003     7       -0.166* -0.168* -0.122*       7,8       -0.203* -0.196*      0.017
 8 -0.160* -0.153*              0.031     8      -0.1346* -0.135* -0.096*       8,9       -0.192* -0.184*      0.049*
 9 -0.149*       -0.142*        0.003     9      -0.120* -0.118* -0.089*        9,10      -0.149* -0.142*      0.003
 10 -0.147*      -0.138*       0.053*     10     -0.092* -0.089* -0.074*


6.1. RQ1 - Single-Turn predictors                               aims to determine the extent that the overall conversa-
                                                                tion can be successfully predicted (i.e. the ground truth
Table 2 shows the results for the three single-turn
                                                                is the rank of the target item at turn 10). Overall, the
predictors, namely: the score of the top-ranked item at
                                                                correlations2 are weak (-0.184 is the strongest observed
a given turn 𝑘 (denoted top-1@k); the mean value of all
                                                                for Shoes, and -0.160 for Shirts; Dresses is lower still at
top-ranked items in the recommendation list at a given
                                                                -0.080), yet significant (𝑝 < 0.05). This suggest the diffi-
turn (mean@k); and the standard deviation values of the
                                                                culty of the long-term prediction task. We do observe that
scores of all top-ranked items (sd@k).
                                                                correlations are relatively higher as the prediction turn
   The table is grouped into three sets of columns defining
                                                                increases - thus indicating that it is easier to predict per-
the prediction turn and the ground truth turn. Specifi-
                                                                formance at turn 10 using evidence of the ranking at turn
cally, Prediction at turn 𝑘 with rank@turn10 addresses
                                                                10. Finally, among the predictors, the maximum score
long-term prediction; the middle group, Prediction at
                                                                at each turn, along with the mean score, exhibit higher
turn 2 with rank@turn 𝑘, addresses whether prediction
                                                                correlations the standard deviation. To answer RQ1 (a),
at an early turn can help identify success at early or late
                                                                we cannot sufficiently predict long-term conversation
turns; finally, the third group, Prediction at turn 𝑘 with
rank@turn 𝑘 + 1, addresses short-term prediction.                2
                                                                     In our analysis, we ignore the sign of the correlation - indeed, the
   We first examine the first group of columns, which                observed correlations are negative, as our CRS system uses repre-
                                                                     sentation distances rather than similarities.
 performance using single-turn score-based predictors.
    Turning next to the second group of columns, we ob-
 serve stronger correlations. Indeed, the overall higher
 correlations suggests that predicting at turn 2 gives more
 accurate predictions, particularly when aiming to predict
 conversation performance at turn 2 or shortly thereafter.
 In particular, for the Shoes datasets, medium strength
 correlations of -0.423 are observed - these are in line
with the best accuracy of some QPP predictors for adhoc
 search tasks [12, 25, 30, 24]. Correlations of -0.305 and
 −0.281 are observed for Shirts and Dresses, respectively.
Among the predictors, top-1@k is again most successful
                                                                 Figure 2: Results of the difference in the top-1 ranked item
 on Shoes, but on Dresses and Shirts, where correlations         (maximum score) between pairs of consecutive turns as a con-
 are lower, the overall picture is less clear across different   secutive turn CPP predictor for each of the datasets.
 prediction horizons (i.e. as the ground truth 𝑘 is varied).
 For these datasets, mean is the most accurate for most
values of 𝑘 ≥ 2. In general, when predicting conver-             turn. In RQ2 below, we focus on short-term (next turn
 sation performance using single-turn retrieval scores,          prediction), as the most promising CPP setting.
 prediction becomes less accurate as the longevity of the
 prediction increases, thus answering RQ1(c).                    6.2. RQ2 - Consecutive-Turn predictors
    Finally, the last set of columns of the table shows the
 correlation of the scores of each turn 𝑘 (as a predictor)       Figure 2 presents the results of our first consecutive-turn
when the effectiveness of the following turn 𝑘 + 1 is used       predictor, namely the difference in maximum score (top-
 as the ground truth (i.e. applying a short-term horizon).       1 item) for each pair of turns 𝑘, 𝑘 + 1 when predicting
The scores of both the top-ranked item and the average           the rank of the target item at turn 𝑘 + 1. Within the
 score of the top-ranked items at turn 𝑘 sufficiently pre-       figure, each dataset is represented as a separate curve.
 dict the rank of turn 𝑘 +1, especially for early turns. This    Considering the different datasets, for Shirts and Dresses,
 trend weakens as the number of turns increases, but the         we observe a similar trend across turns, starting from
 observed correlations remain quite high for some cases.         a correlation of -0.18 (the maximum value obtained for
 For example, for Shoes, we start with a correlation of          this predictor) at turns 2-3, which gradually decreases as
-0.423 (maximum score) and -0.413 (average score) for            the number of turns increases. In contrast, Shoes does
 turns 2, 3 and at turns 8 − 9 the correlation is still -0.20.   not achieve any correlation stronger than -0.016 at turns
 For Shirts, the maximum and average score of top items          3-4. Therefore, we observe only weak correlations for
 sufficiently predict the ranking of turn 3 at -0.30 and the     this predictor at short-term prediction, although some
 score of turn 8 still at -0.20. Finally, although weaker        correlations are significant. To answer RQ2(a), using the
 than the other two datasets, the two predictors work rea-       scores of two consecutive turns, does not sufficiently pre-
 sonably well for Dresses, achieving a maximum value of          dict conversation performance, and is indeed generally
-0.26 for predicting the rank of turn 3. These values sug-       less effective than the predictors examined in RQ1.
 gest some evidence for short-term prediction when using            Next, we test our final predictor, which considers the
 single-turn score-based predictors, to answer RQ1(b).           overlap of top-ranked items (i.e., the size of intersection)
    Overall, we observe that there is some evidence for          between consecutive turns. We considered various rank
 short (score of one turn predicting the rank of the fol-        cutoff values for calculating the overlap, ranging from
 lowing turn) and early prediction (a score of initial turn      rank 5 to rank 1000, and all pairs of turns. Figure 3 re-
 predicting the rank of some turns ahead). The score             ports the observed correlations (y-axis), where each pair
 of the top-ranked item and the mean scores of the rec-          of turns is a curve, and the x-axis is the rank cutoff at
 ommendation list are shown to be the most promising             which overlap is calculated. Recall that we expect that
 single-turn predictors. However, contrary to previous           when the retrieved items are generally similar, this may
 QPP research [25], the results for the standard deviation       be indicative that the CRS is reaching a stable conclu-
 are not as encouraging. The results for long-term predic-       sion of the likely relative items. If this occurs at a later
 tion are weaker, but still, the score of the initial turn is    turn, we may be further confident in the likely positive
 predictive of later stages. In general, prediction of the       performance of the system.
 system performance (whether it finds the target in the             On analysing Figure 3, we note that Dresses & Shirts
 context of a conversation) is possible by using single-turn     (Figure 3(b) & (c), respectively) – which are both Fashion
 score-based predictors, particularly for the success of the     IQ datasets – we observe a strengthening trend in the
 conversation at early turns and prediction of the next          correlations as we increase the rank cutoff value (more
                   (a) Shoes                               (b) Dresses                               (c) Shirts
Figure 3: For each dataset, results for overlap of top-ranked items as a consecutive predictor for all pairs of turns 𝑘, 𝑘 + 1
for a number of rank cutoff values.


items are considered). This happens for all pairs of turns indicative of the ranking of the users’ target items in the
except the initial turn. In addition, the correlations are recommendation list.
stronger for later turns than earlier turns, indicating that  In our analysis of the proposed single-turn predictors,
this predictor is more useful for later turns (as expected).
                                                           we found that examining the score of the top-ranked
Indeed, improved prediction at later turns is particularly items had a medium correlation with the effectiveness of
notable, as this contrasts with our results in RQ1, where  the conversation, particularly the effectiveness at early
earlier prediction was more accurate.                      turns. Indeed, we observed a Spearman’s 𝜌 of 0.423 on
   On the other hand, for the Shoes dataset, the highest   the Shoes dataset, which is comparable to correlations
correlations are observed for turns 3-4 and 4-5, and for   observed for standard QPP predictors on adhoc search
cutoff values at 50 and 100. The correlations for item over-
                                                           tasks [12, 24, 25, 30]. However, these single-turn
lap in Shoes are weaker than the other two datasets, con-  predictors became less useful at predicting the success of
trasting with the observations in RQ1 (where Shoes ex-     later turns. On the other hand, among our consecutive
hibited higher correlations for the single-turn predictors turn predictors, simply examining the overlap of the
than Dresses or Shirts). We note that, as a CRS dataset,   retrieved lists had a weak-medium correlation with late
Shoes is “easier” than Dresses (e.g. the GRU model can     turn effectiveness on two out of our three datasets.
attain Mean Reciprocal Rank 0.2 at turn 10 on Shoes,          Overall, the weak-medium correlations observed for
compared to Mean Reciprocal Rank 0.075 at turn 10 on       our simple unsupervised predictors of different families
Dresses [10]). We postulate that early single-turn pre-    suggests that there is significant scope to extend this
diction works well on Shoes, as more conversations are     work, for instance by introducing supervised predictors.
answered at earlier turns; in contrast, on Dresses, more   Moreover, our proposed framework for CPP is generalis-
critiques are required for successful conversations, and   able - for instance, we can also envisage predictors that
the overlap-based evidence later in the conversation is    examine aspects of the critiques (for instance, repeated
therefore more useful for prediction.                      critiques), or characteristics of the retrieved images (are
   Overall, these results suggest some weak-medium cor-    item colours or styles varied). We leave these for future
relations (upto -0.25 𝜌) on the overlap-based consecutive  work. Furthermore, we also aim to extend our analyses
turn predictor, thereby answering RQ2(b).                  to a classification task that aims to predict whether a
                                                           conversation would fail, as well as testing the efficacy of
                                                           interventions for failing conversations.
7. Conclusions                                                Finally, this study takes place in the context of user
                                                           simulators for evaluation of CRS - such user simulators
We have presented a novel framework for conversational
                                                           are common in the training and evaluation of conversa-
performance prediction (CPP) that aims to detect the fac-
                                                           tional systems. Logging the interactions of a deployed
tors that indicate effective performance by taking into
                                                           CRS would allow to verify the results depicted here.
account the multi-turn aspect of the task of conversa-
tional interactive image retrieval. In this regard, we
proposed a number of predictors that can be used for Acknowledgments
both short-term and long-term prediction, and explored
the retrieval scores and retrieved items, of both a single Maria Vlachou’s work was supported by the UKRI Centre
turn and consecutive turns. We conducted our analy- for Doctoral Training in Socially Intelligent Artificial
ses on three widely-used relative captioning datasets for Agents, Grant number EP/S02266X/1.
conversational recommendation systems (CRS) and ex-
amined the extent to which our proposed predictors are
References                                                          ings of the 17th ACM conference on Information
                                                                    and knowledge management, 2008, pp. 1419–1420.
 [1] T. M. Brill, L. Munoz, R. J. Miller, Siri, Alexa, and     [15] D. Carmel, E. Yom-Tov, Estimating the query diffi-
     other digital assistants: a study of customer satisfac-        culty for information retrieval, Synthesis Lectures
     tion with artificial intelligence applications, Journal        on Information Concepts, Retrieval, and Services 2
     of Marketing Management 35 (2019) 1401–1436.                   (2010) 1–89.
 [2] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey         [16] H. Wu, Y. Gao, X. Guo, Z. Al-Halah, S. Rennie,
     on conversational recommender systems, ACM                     K. Grauman, R. Feris, Fashion iq: A new dataset
     Computing Surveys 54 (2021) 1–36.                              towards retrieving images by natural language feed-
 [3] F. N. Tou, M. D. Williams, R. Fikes, D. A. Hender-             back, 2020. arXiv:1905.12794.
     son Jr, T. W. Malone, Rabbit: An intelligent database     [17] J. Peng, C. Macdonald, B. He, I. Ounis, A study of se-
     assistant., in: AAAI, 1982, pp. 314–318.                       lective collection enrichment for enterprise search,
 [4] L. Chen, P. Pu, Critiquing-based recommenders:                 in: Proceedings of the 18th ACM Conference on
     survey and emerging trends, User Modeling and                  Information and Knowledge Management, CIKM
     User-Adapted Interaction 22 (2012) 125–150.                    ’09, 2009, p. 1999–2002.
 [5] T. L. Berg, A. C. Berg, J. Shih, Automatic attribute      [18] S. Cronen-Townsend, Y. Zhou, W. B. Croft, A frame-
     discovery and characterization from noisy web data,            work for selective query expansion, in: Proceedings
     in: Proc. ECCV, 2010, pp. 663–676.                             of the Thirteenth ACM International Conference on
 [6] X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro,                Information and Knowledge Management, CIKM
     R. Feris, Dialog-based interactive image retrieval,            ’04, 2004, p. 236–237.
     in: Proc. NeurIPS, 2018, pp. 678–688.                     [19] C. Macdonald, R. L. Santos, I. Ounis, On the use-
 [7] V. S. Bursztyn, J. Healey, E. Koh, N. Lipka, L. Birn-          fulness of query features for learning to rank, in:
     baum, Developing a conversational recommenda-                  Proceedings of the 21st ACM International Confer-
     tion systemfor navigating limited options, in: Ex-             ence on Information and Knowledge Management,
     tended Abstracts of the 2021 CHI Conference on                 CIKM ’12, 2012, p. 2559–2562.
     Human Factors in Computing Systems, 2021, pp.             [20] Y. Zhao, F. Scholer, Y. Tsegay, Effective pre-retrieval
     1–6.                                                           query performance prediction using similarity and
 [8] Y. Jin, W. Cai, L. Chen, N. N. Htun, K. Verbert, Mu-           variability evidence, in: European conference on
     sicbot: Evaluating critiquing-based music recom-               information retrieval, Springer, 2008, pp. 52–64.
     menders with conversational interaction, in: Pro-         [21] F. Scholer, S. Garcia, A case for improved evaluation
     ceedings of the 28th ACM International Conference              of query difficulty prediction, in: Proceedings of
     on Information and Knowledge Management, 2019,                 the 32nd international ACM SIGIR conference on
     pp. 951–960.                                                   Research and development in information retrieval,
 [9] W. Cai, Y. Jin, L. Chen, Critiquing for music explo-           2009, pp. 640–641.
     ration in conversational recommender systems, in:         [22] J. Mothe, L. Tanguy, Linguistic features to predict
     26th International Conference on Intelligent User              query difficulty, in: ACM Conference on research
     Interfaces, 2021, pp. 480–490.                                 and Development in Information Retrieval, SIGIR,
[10] Y. Wu, C. Macdonald, I. Ounis, Partially observable            Predicting query difficulty-methods and applica-
     reinforcement learning for dialog-based interactive            tions workshop, 2005, pp. 7–10.
     recommendation, in: Proc. RecSys, 2021, pp. 241–          [23] B. He, I. Ounis, Query performance prediction,
     251.                                                           Information Systems 31 (2006) 585–594.
[11] T. Yu, Y. Shen, H. Jin, A visual dialog augmented         [24] Y. Zhou, W. B. Croft, Query performance prediction
     interactive recommender system, in: Proc. KDD,                 in web search environments, in: Proceedings of
     2019, pp. 157–165.                                             the 30th annual international ACM SIGIR confer-
[12] S. Cronen-Townsend, Y. Zhou, W. B. Croft, Predict-             ence on Research and development in information
     ing query performance, in: Proceedings of the 25th             retrieval, 2007, pp. 543–550.
     annual international ACM SIGIR conference on Re-          [25] A. Shtok, O. Kurland, D. Carmel, Predicting query
     search and development in information retrieval,               performance by query-drift estimation, in: Con-
     2002, pp. 299–306.                                             ference on the Theory of Information Retrieval,
[13] B. He, I. Ounis, Inferring query performance using             Springer, 2009, pp. 305–312.
     pre-retrieval predictors, in: International sympo-        [26] M. Mitra, A. Singhal, C. Buckley, Improving auto-
     sium on string processing and information retrieval,           matic query expansion, in: Proceedings of the 21st
     Springer, 2004, pp. 43–54.                                     annual international ACM SIGIR conference on Re-
[14] C. Hauff, D. Hiemstra, F. de Jong, A survey of pre-            search and development in information retrieval,
     retrieval query performance predictors, in: Proceed-           1998, pp. 206–214.
[27] R. Cummins, Document score distribution models               1257–1260.
     for query performance inference and prediction,         [39] H. Zamani, S. Dumais, N. Craswell, P. Bennett,
     ACM Transactions on Information Systems (TOIS)               G. Lueck, Generating clarifying questions for in-
     32 (2014) 1–28.                                              formation retrieval, in: Proceedings of the Web
[28] F. Diaz, Performance prediction using spatial au-            conference 2020, 2020, pp. 418–428.
     tocorrelation, in: Proceedings of the 30th annual       [40] H. Roitman, S. Erera, G. Feigenblat, A study of
     international ACM SIGIR conference on Research               query performance prediction for answer quality
     and development in information retrieval, 2007, pp.          determination, in: Proceedings of the 2019 ACM
     583–590.                                                     SIGIR International Conference on Theory of Infor-
[29] A. Shtok, O. Kurland, D. Carmel, Query perfor-               mation Retrieval, 2019, pp. 43–46.
     mance prediction using reference lists, ACM Trans-      [41] W. Shi, K. Qian, X. Wang, Z. Yu, How to build user
     actions on Information Systems (TOIS) 34 (2016)              simulators to train RL-based dialog systems, arXiv
     1–34.                                                        preprint arXiv:1909.01388 (2019).
[30] A. Shtok, O. Kurland, D. Carmel, Using statistical      [42] X. Li, Z. C. Lipton, B. Dhingra, L. Li, J. Gao, Y.-
     decision theory and relevance models for query-              N. Chen, A user simulator for task-completion
     performance prediction, in: Proceedings of the 33rd          dialogues, arXiv preprint arXiv:1612.05688 (2016).
     international ACM SIGIR conference on Research          [43] N. Tintarev, J. Masthoff, A survey of explanations
     and development in information retrieval, 2010, pp.          in recommender systems, in: Proc. IEEE data engi-
     259–266.                                                     neering workshop, IEEE, 2007, pp. 801–810.
[31] J. Lafferty, C. Zhai, Document language models,         [44] Y. Wu, C. Macdonald, I. Ounis, Partially observable
     query models, and risk minimization for informa-             reinforcement learning for dialog-based interactive
     tion retrieval, in: Proceedings of the 24th annual           recommendation, in: Proceedings of ACM RecSys,
     international ACM SIGIR conference on Research               2021.
     and development in information retrieval, 2001, pp.     [45] Y. Wu, C. Macdonald, I. Ounis, Multimodal con-
     111–119.                                                     versational fashion recommendation with positive
[32] V. Lavrenko, W. B. Croft, Relevance-based language           and negative natural-language feedback, in: Pro-
     models, in: ACM SIGIR Forum, volume 51, ACM                  ceedings of ACM Conversational User Interfaces,
     New York, NY, USA, 2017, pp. 260–267.                        2022.
[33] W. Webber, A. Moffat, J. Zobel, A similarity mea-       [46] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk,
     sure for indefinite rankings, ACM Transactions on            Session-based recommendations with recurrent
     Information Systems (TOIS) 28 (2010) 1–38.                   neural networks, arXiv preprint arXiv:1511.06939
[34] O. Kurland, A. Shtok, S. Hummel, F. Raiber,                  (2015).
     D. Carmel, O. Rom, Back to the roots: A proba-
     bilistic framework for query-performance predic-
     tion, in: Proceedings of the 21st ACM international
     conference on Information and knowledge manage-
     ment, 2012, pp. 823–832.
[35] J. Kang, K. Condiff, S. Chang, J. A. Konstan, L. Ter-
     veen, F. M. Harper, Understanding how people use
     natural language to ask for recommendations, in:
     Proc. RecSys, 2017, pp. 229–237.
[36] I. Sekulić, M. Aliannejadi, F. Crestani, Exploiting
     document-based features for clarification in con-
     versational search, in: European Conference on
     Information Retrieval, Springer, 2022, pp. 413–427.
[37] M. Aliannejadi, H. Zamani, F. Crestani, W. B.
     Croft, Asking clarifying questions in open-domain
     information-seeking conversations, in: Proceed-
     ings of the 42nd international acm sigir conference
     on research and development in information re-
     trieval, 2019, pp. 475–484.
[38] J. Kiesel, A. Bahrami, B. Stein, A. Anand, M. Ha-
     gen, Toward voice query clarification, in: The 41st
     international ACM SIGIR conference on research
     & development in information retrieval, 2018, pp.