Multi-Aspect Reviewed-Item Retrieval via LLM Query Decomposition and Aspect Fusion Anton Korikov1,∗,† , George Saad1,† , Ethan Baron1 , Mustafa Khan1 , Manav Shah1 and Scott Sanner1 1 University of Toronto, Toronto, Canada Abstract While user-generated product reviews often contain large quantities of information, their utility in addressing natural language product queries has been limited, with a key challenge being the need to aggregate information from multiple low-level sources (reviews) to a higher item level during retrieval. Existing methods for reviewed-item retrieval (RIR) typically take a late fusion (LF) approach which computes query-item scores by simply averaging the top-K query-review similarity scores for an item. However, we demonstrate that for multi-aspect queries and multi-aspect items, LF is highly sensitive to the distribution of aspects covered by reviews in terms of aspect frequency and the degree of aspect separation across reviews. To address these LF failures, we propose several novel aspect fusion (AF) strategies which include Large Language Model (LLM) query extraction and generative reranking. Our experiments show that for imbalanced review corpora, AF can improve over LF by a MAP@10 increase from 0.36 ± 0.04 to 0.52 ± 0.04, while achieving equivalent performance for balanced review corpora. Keywords Dense retrieval, query decomposition, multi-aspect retrieval, LLM reranking, late fusion, 1. Introduction across reviews. • We propose several novel aspect fusion strategies, User-generated reviews are an abundant and rich source which include LLM query extraction and reranking, of data that has the potential to be used to improve the to address failures of LF review-score aggregation retrieval of reviewed-items such as products, services, or on imbalanced multi-aspect review distributions. destinations. However, a challenge of using review data • We leverage a recently released multi-aspect re- for retrieval is that information has to be aggregated across trieval dataset, Recipe-MPR [3], with ground-truth multiple (low-level) reviews to a (higher) item-level during query- and item- aspect labels to generate four multi- retrieval. Recent work [1], defining this Reviewed-Item Re- aspect review distributions with various aspect bal- trieval setting as RIR, showed that state-of-the-art results ance properties, and numerically evaluate the effect could be achieved by using a bi-encoder to aggregate review of review-aspect balance on MA-RIR. information to an item-level in a process called late fusion • Our simulations show that for imbalanced data, As- (LF). As opposed to aggregating review information to an pect Fusion can improve over LF by MAP@10 in- item-level before query-scoring (early fusion), LF first com- crease from 0.36 ± 0.04 to 0.52 ± 0.04 while achieving putes query-review similarity to avoid losing information equivalent performance for balanced data. before scoring, and then averages the top-𝐾 query-review • We show that LLM reranking in both cross-encoder similarity scores to get a query-item similarity score. Re- and zero-shot (ZS) listwise reranking settings can cently, LF has been implemented by retrieval augmented provide some improvements when given a large generation (RAG) driven conversational recommendation enough number of reviews, but risk decreasing per- (ConvRec) systems for generative recommendation, expla- formance when not enough reviews are provided. nation, and interactive question answering [2]. In this paper, we extend RIR to a multi-aspect retrieval setting, formulating what we call multi-aspect RIR (MA-RIR). 2. Background In this problem, our goal is to retrieve relevant items for a multi-aspect query by using the reviews of multi-aspect 2.1. Neural IR items. Specifically, for an item with multiple aspects, we assume that each review describes at least one, and up to Given a set of documents 𝒟 and a query 𝑞 ∈ 𝒬, an IR task all, of the item’s aspects. 𝐼 𝑅⟨𝒟 , 𝑞⟩ is to assign a similarity score 𝑆𝑞,𝑑 ∈ ℝ between the As our primary contributions: query and each document 𝑑 ∈ 𝒟 and return a ranked list of top scoring documents. The standard first-stage neural- • We formulate the MA-RIR problem and identify fail- IR method [4] for a large corpus is to first use a bi-encoder ure modes of LF under imbalanced review-aspect 𝑔(⋅) ∶ 𝒬 ∪𝒟 → ℝ𝑚 to map a query 𝑞 and document 𝑑 to their distributions, considering imbalances due to both respective embeddings 𝑔(𝑞) = z𝑞 and 𝑔(𝑑) = z𝑑 . A similarity aspect frequency and the degree of aspect separation function 𝑓 (⋅, ⋅) ∶ ℝ𝑚 × ℝ𝑚 → ℝ, such as the dot product, is then used to compute a query-document score 𝑆𝑞,𝑑 = SIGIR’24 Workshop on Information Retrieval’s Role in RAG Systems, July 𝑓 (z𝑞 , z𝑑 ). For web-scale corpora, exact similarity search 18, 2024, Washington D.C., USA. ∗ Corresponding author. for the top query-document scores is typically impractical, † These authors contributed equally. so approximate similarity search algorithms [5] are used Envelope-Open anton.korikov@mail.utoronto.ca (A. Korikov); instead. g.saad@mail.utoronto.ca (G. Saad); mr.khan@mail.utoronto.ca (M. Khan) Orcid 0009-0003-4487-9504 (A. Korikov); 0009-0000-3549-9874 (G. Saad); 0009-0004-2461-5760 (E. Baron); 0009-0008-3622-7270 (M. Khan); 0009-0008-4728-0771 (M. Shah); 0000-0001-7984-8394 (S. Sanner) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2.2. Reviewed-Item Retrieval 3. Multi-Aspect Reviewed Item 2.2.1. Problem Formulation Retrieval Information retrieval across two-level data structures was 3.1. Multi-Aspect Queries previously studied by Zhang and Balog [6]. Specifically, Zhang and Balog define the Object Retrieval problem, where This paper focuses on retrieving relevant items using their (high-level) objects are described by multiple (low-level) reviews for a multi-aspect query, such as “Can I have a documents. Given a query, the task is to retrieve high-level meatball recipe that doesn’t take too long? ”. We define objects by using information in the low-level documents. a query aspect to be a sub-span of a multi-aspect query To investigate a special case of object retrieval where that represents a distinct topic (or facet) in the query, for the goal is retrieving items (e.g., products, destinations) instance the sub-spans ““meatball” and “doesn’t take too based on their reviews, Abdollah Pour et al. [1] recently long” in the previous sentence. While there is ambiguity in proposed the Reviewed-item Retrieval (RIR) problem. In identifying which sub-spans, if any, in a query should be the 𝑅𝐼 𝑅⟨ℐ , 𝒟 , 𝑞⟩ problem, there is a set of items ℐ, where considered aspects, this sub-span based definition is a simple every item 𝑖 is a high-level object. Each item is described by way to represent aspects and is conducive to overlap-based a set of reviews (i.e., “low-level documents”) 𝒟𝑖 ⊂ 𝒟, and evaluations of aspect extraction such as intersection-over the 𝑟’th review of item 𝑖 is 𝑑𝑖,𝑟 ∈ 𝒟𝑖 . The main difference union (IOU). Formally, we denote the set of aspects in query query query query between RIR and Object Retrieval is that in RIR a low-level 𝑞 as 𝒜𝑞 , where the 𝑗 ′ th query aspect is 𝑎𝑞,𝑗 ∈ 𝒜𝑞 . document 𝑑𝑖,𝑟 cannot describe more than one high-level In this work, multi-aspect queries are assumed to be logical object 𝑖, while Object Retrieval allows for more general AND queries for all aspects, though an aspect itself can two-level structures. Given a query 𝑞 ∈ 𝒬 and a score 𝑆𝑞,𝑖 represent other logical operators such as XOR (e.g. a query between 𝑞 and each item 𝑖, the goal of RIR is to retrieve a aspect may be “chicken or beef ”). Finally, we assume all ranked list 𝐿𝑞 of top-𝐾𝐼 scoring items: query aspects are equally important — a further discussion of weighted multi-aspect retrieval can be found in Section 𝐿𝑞 = (𝑖1 , ..., 𝑖𝐾𝐼 ) s.t. 𝑖1 ∈ arg max{𝑆𝑞,𝑖 } 7. 𝑖 𝑆𝑞,𝑖𝑘 , ≥ 𝑆𝑞,𝑖𝑘+1 , ∀𝑖𝑘 ∈ 𝐿𝑞 . 3.2. Multi-Aspect Reviewed-Items 2.2.2. Fusion In addition to considering multi-aspect queries, we also con- To get a query-item score 𝑆𝑞,𝑖 using an item’s review set sider multi-aspect items described by reviews. For instance, 𝒟𝑖 , review information needs to be aggregated to an item a multi-aspect item that is relevant to the multi-aspect query level: this process is called fusion. Two alternatives exist for example above might be a recipe titled “Beef meatballs fusion [6]: if low-level information is aggregated before a cooked in canned soup, ready in 25 minutes”. However, query is used for scoring, it is called Early Fusion (EF) — in contrast, if the aggregation occurs after query-scoring, it is called Late Fusion (LF). For EF in RIR, Abdollah Pour et al. [1] experiment with mean-pooling and contrastive learning methods to create an item embedding z𝑖 ∈ ℝ𝑚 from review embeddings {z𝑑 }𝑑∈𝒟𝑖 . They then directly compute the similarity between z𝑖 and a query embedding z𝑞 as the query-item score 𝑆𝑞,𝑖 = 𝑓 (z𝑞 , z𝑖 ). For LF in RIR, these authors first compute query-review similarity scores 𝑆𝑞,𝑑𝑖,𝑟 = 𝑓 (z𝑞 , z𝑑𝑖,𝑟 ). They then aggregate these scores into a query-item score 𝑆𝑞,𝑖 by averaging the top 𝐾𝑅 query-review scores for each item: 𝐾𝑅 1 𝑆𝑞,𝑖 = ∑𝑆 . (1) 𝐾𝑅 𝑟=1 𝑞,𝑑𝑖,𝑟 Numerical evaluations performed for EF and LF for RIR demonstrate that EF has significantly worse performance than LF [1], and Abdollah Pour et al. conjecture that EF performs worse because it loses fine-grained review infor- mation before query-scoring. In contrast, by delaying fusion, LF preserves review-level information during query-scoring. Due to these findings, we do not study EF for MA-RIR, rather, we focus on developing Aspect Fusion as an extension of LF, discussed next. Figure 1: Two extremes of item aspect distributions, showing reviews for an item with aspects “meatballs” and “ready in 25 minutes”: a) Fully overlapping (top) — Each review mentions all item aspects. b) Fully disjoint with imbalanced aspect frequency (bottom) — no review mentions more than one aspect, and some aspects are mentioned much more frequently than others. since our goal is to isolate the properties of review-based retrieval, we assume that no such natural language (NL) item-level description is available. Instead, we assume that the item’s aspects are described in reviews. Obviously, item- level descriptions (e.g. titles) are often available in practice, so a prime direction for future work is fusion across multiple levels of NL data during reviewed-item retrieval. Examples of reviews describing the item in the previous paragraph, which has aspects “meatballs” and “ready in 25 minutes”, are shown in Figure 1. In this paper, we assume that a review 𝑑𝑖,𝑟 must mention at least one item aspect item ∈ 𝒜 item and could mention up to all item aspects. 𝑎𝑖,𝑗 𝑖 Formally, the distribution of item aspects across reviews can be defined with a bipartite aspect distribution graph item ) ∈ ℰ exists if 𝒢 = {𝒟 , 𝒜 item , ℰ } where an edge (𝑑𝑖,𝑟 , 𝑎𝑖,𝑗 review 𝑑𝑖,𝑟 ∈ 𝒟𝑖 mentions aspect 𝑎𝑖,𝑗 ∈ 𝒜𝑖item . We also item rel,𝑞 let 𝒜𝑖 ⊆ 𝒜𝑖item represent the set of item-aspects that are relevant to a query and should be considered during retrieval. We define the 𝑀𝐴 − 𝑅𝐼 𝑅⟨𝒜 , ℰ , 𝒟 , 𝑞⟩ problem as the task of retrieving a ranked list of relevant multi-aspect items 𝐿𝑞 for a multi-aspect query 𝑞, where 𝒜 = 𝒜 item ∪ 𝒜 query . 3.3. Multi-Aspect Review Distributions Figure 2: a) Top. In (Monolithic) LF, the full query is scored against all reviews, and the top 𝐾𝑅 query-review scores are av- As we will demonstrate with numerical simulations on LLM- eraged for each item to produce a query-item score. b) Bottom. generated review data, understanding review distributions Aspect Fusion extracts aspects (i.e., query subspans) from a query, in terms of aspect frequency and degree of aspect separation performs LF with each aspect, and aggregates the resulting top between reviews is key to designing successful MA-RIR 𝐾𝐼 item lists (i.e., one list per extracted aspect) to a final list. techniques. Figure 1 shows two extremes of aspect distri- butions that are among the distributions we explore in our experiments. Equation (1). For MA-RIR, we propose two desiderata for the aspect distribution in the top 𝐾𝑅 reviews during fusion. 3.3.1. Fully Overlapping Distributions Figure 1a) shows a fully overlapping aspect distribution Desideratum 1: Since we assume multi-aspect queries rel,𝑞 where each review mentions all aspects — in this case, the are AND queries, if an item contains 𝒜𝑖 relevant aspects bipartite graph 𝒢 (see the RHS of Figure 1) is fully connected for query 𝑞, the 𝐾𝑅 reviews used for LF should mention all for item 𝑖1 . This is the most balanced review aspect distri- rel,𝑞 𝒜𝑖 of those relevant aspects. bution possible for an item, and, because of this “perfect” aspect balance, we postulate that aspect-agnostic retrieval Desideratum 2: As mentioned in Section 3.1, we also approaches such as standard LF will perform competitively assume all query aspects are equally important, which on such distributions. implies that aspect frequency should be identical for all rel,𝑞 𝒜𝑖 aspects in the top 𝐾𝑅 retrieved reviews. 3.3.2. Degree of Separation and Aspect Frequency In contrast to the case of perfect review-aspect balance, Fig- In a fully overlapping distribution (Figure 1a) where each ure 1b) shows an extreme case of aspect imbalance. Firstly, review mentions each aspect, both Desiderata 1 and 2 are one aspect is mentioned much more frequently than an- guaranteed to be satisfied by any subset of item reviews. other — this is an aspect frequency imbalance. Secondly, We thus argue that standard LF should be sufficient when each review mentions only one aspect — this is a maximal de- reviews fully overlap in aspects, and focus on developing gree of separation of aspects across reviews (fully disjoint). Aspect Fusion methods that address the failures of LF for Mathematically, 𝒢 has |𝒜𝑖item1 | (disjoint) star components imbalanced review-aspect distributions. where some stars have a singificantly higher degree than others. In the next section, we discuss the negative effects of 4.2. Failures of LF under Review-Aspect imbalanced review-aspect distributions on LF performance on MA-RIR, and propose aspect fusion as a method for miti- Imbalance gating these negative effects. Standard LF will fail to achieve Desiderata 1 and 2 for review- aspect distributions with at least some degree of disjointed- ness and aspect frequency imbalance under the following 4. Aspect Fusion for MA-RIR assumptions. 4.1. Desiderata of Aspect Fusion Aspect Popularity Bias Aspects that are reviewed more Recall that LF computes a query-item similarity score by frequently are more likely to be mentioned in the top 𝐾𝑅 averaging the top 𝐾𝑅 query-review similarity scores using reviews. Embedding Bias The non-isotropic nature of the embed- 4.3.3. Aspect-Item Score Fusion ding space [7] biases retrieval towards one aspect. Consider 𝑗 After aspect-item scoring, we must aggregate the 𝐴𝑒𝑞 top two equally sized and fully disjoint review subsets 𝒟𝑖 ⊂ 𝒟𝑖 𝐾𝐼 item lists for each aspect {𝐿𝑎 }𝑎∈ 𝒜𝑞ext into a single ranked and 𝒟𝑖𝑘 ⊂ 𝒟𝑖 in which reviews mention only a single aspect rel,𝑞 rel,𝑞 list of top-𝐾𝐼 items for the query, 𝐿𝑞 . We examine six ag- rel ∈ 𝒜 𝑎𝑖,𝑗 rel ∈ 𝒜 or 𝑎𝑖,𝑘 , respectively, for some item 𝑖. 𝑖 𝑖 gregation strategies, which can be categorized as four score If query-review similarity scores tend to be higher when aggregation methods and two rank aggregation methods. rel as opposed to aspect 𝑎 rel , LF a review describes aspect 𝑎𝑖,𝑗 𝑖,𝑘 The score-based variants convert the 𝐴𝑒𝑞 aspect-item scores 𝑗 into a query-item score 𝑆𝑞,𝑖 using will be more likely to select reviews from review set 𝒟𝑖 for the top 𝐾𝑅 fused reviews. For example, in Figure 1b), the 1. AMean: Arithmatic mean reviews describing cooking time might be more likely to 2. GMean: Geometric mean score higher with the full query than reviews describing 3. HMean: Harmonic mean “meatballs”. 4. Min: Minimum 4.3. Aspect Fusion to return the final ranked list 𝐿𝑞 . The two rank-based list aggregation methods include: To address these failures of LF on imbalanced data, we in- troduce several methods for Aspect Fusion, which explicitly 1. Borda: Borda count utilizes the multi-aspect nature of reviews during fusion to 2. R-R: Round-robin (interleaved) merge. address multi-aspect queries. In Borda Count, the score for a given item 𝑖 is calculated 𝐴𝑒𝑞 𝑎𝑗 𝑎𝑗 as follows: ∑𝑗=1 (𝐾𝐼 − rank𝐿𝑖 + 1), where rank𝐿𝑖 is the 4.3.1. Aspect Extraction rank of item 𝑖 in list 𝐿𝑎𝑗 . In a round-robin merge of 𝐴𝑒𝑞 lists, To extract aspects from queries, we propose to use few- elements from each list are merged in a cyclic order, and shot (FS) prompting with an LLM. Though the number of when a conflict arises with a particular item, that item is query-aspects is typically not known a priori, since we study skipped and the merge continues from the same list. multi-aspect queries, our proposed prompt (Figure 10 in the Appendix) asks that at least two non-overlapping sub- 4.4. LLM Reranking spans of the query be extracted as aspects. We represent the set of extracted query aspects for query 𝑞 as 𝒜𝑞ext and In addition to Aspect Fusion, we also introduce an LLM let 𝐴e𝑞 = |𝒜𝑞ext |. reranking step for MA-RIR — to the best of our knowl- edge LLM reranking has not been previously studied in 4.3.2. Aspect-Item Scoring a reviewed-item setting. Our goal is to understand whether LLMs in cross-encoder (CE) or ZS listwise [8] settings can The key to Aspect Fusion is directly computing aspect-review fuse reviews of multi-aspect items for effective reranking. similarity scores 𝑆𝑎,𝑑𝑖,𝑟 , as opposed to similarity scores be- After a list 𝐿𝑞 of top 𝐾𝐼 items is returned from the first tween reviews and a monolithic query, since the later can stage, 𝐾𝑅 reviews for each item need to be given to the LLM be negatively impacted by review-aspect distribution imbal- for what we call fusion-during-reranking. For Monolithic LF, ance. Aspect similarity scores are computed by separately these 𝐾𝑅 reviews are simply the 𝐾𝑅 reviews used for LF. For embedding each extracted aspect 𝑎 ∈ 𝒜𝑞ext as z𝑎 = 𝑔(𝑎) Aspect Fusion, since 𝐾𝑅 reviews were used for fusion with and calculating 𝑆𝑎,𝑑𝑖,𝑟 = 𝑓 (z𝑎 , z𝑑𝑖,𝑟 ). Then, aspect-item scores each aspect, we propose to perform a round-robin merge of 𝑆𝑎,𝑖 ∈ ℝ are obtained by aggregating the top 𝐾𝑅 aspect- the top 𝐾𝑅 review lists for each aspect in order to preserve review scores via Eq. (1) with aspect-review scores instead a balanced distribution of reviews across aspects. of query-review scores. For each extracted aspect 𝑎, the For a CE, reviews are simply concatenated and cross- top-𝐾𝐼 scoring items are ordered into a list encoded with the query. For listwise reranking, our prompt provides the LLM with the query, initial ranked list of item 𝐿𝑎 = (𝑖1 , ..., 𝑖𝐾𝐼 ) s.t. 𝑖1 ∈ arg max{𝑆𝑎,𝑖 } IDs, reviews for each item, and instructions to order the 𝑖 items based on relevance to the query — the full listwise 𝑆𝑎,𝑖𝑘 , ≥ 𝑆𝑎,𝑖𝑘+1 , ∀𝑖𝑘 ∈ 𝐿𝑞 . reranking prompt is in Figure 11 in the Appendix. Figure 2b) demonstrates aspect-item scoring and how it can alleviate the biases of standard LF. In this figure, the red 5. Experimental Method and green points are embeddings of the reviews of item 𝑖 item and 𝑎 item , respectively — both these describing aspect 𝑎𝑖,1 𝑖,2 We perform simulations on generated review data to study aspects are assumed to be relevant to the query. Though the effect of aspect balance across reviews and test our hy- the former aspect is more frequent, an equal number (𝐾𝑅 ) of pothesis that Aspect Fusion is more robust to aspect imbal- reviews for each aspect will be used during score fusion — as ances than Monolithic LF. While using synthetic data ex- long as the aspect review embeddings are similar enough to poses our results to biases from the data generation process, the relevant query aspect embedding, and the total number we are able to generate synthetic review distributions with of reviews for an aspect is at least 𝐾𝑅 . In contrast, Figure far greater control that would have been possible several 2a) shows how standard (monolithic) LF will take a biased years ago before the advent of LLMs. We specifically design review sample of the first aspect since it is more frequently experiments to study the performance of Aspect Fusion vs mentioned by reviews and z𝑞 happens to be closer to those Monolithic LF under the presence of aspect imbalance, both review embeddings. To differentiate between LF for RIR in the form of disjointedness of aspects across reviews and proposed by Abdollah Pour et al., and Aspect Fusion, we imbalanced aspect frequencies. will refer to LF as Monolithic LF since it uses the full query. In order to perform our experimentation, we need a dataset that has (a) multi-aspect queries and items, (b) GT aspect labels and (c) item reviews. To the best of our knowl- edge, there is no existing dataset with all of these properties. However, the recently-released Recipe-MPR dataset [3] in- cludes properties (a) and (b). We leverage this dataset and generate item reviews using GPT-4. Table 1 Distribution of ground truth (GT) and LLM-extracted aspects for Recipe-MPR queries and items Figure 3: Monolithic LF versus Aspect Fusion with AMean aggre- # of Aspects 1 2 3 4 5 6 7 8 gation. Both methods perform similarly on the fully overlapping Items (GT) 76 282 72 29 10 1 2 1 dataset, but Aspect Fusion performs significantly better than Monolithic LF for the fully disjoint dataset and 𝐾𝑅 < 30. For the Queries (GT) 0 294 103 14 0 0 0 0 fully disjoint dataset, Aspect Fusion drops in performance for Queries (Extracted) 0 2 342 55 12 0 0 0 𝐾𝑅 > 10 because when 𝐾𝑅 exceeds the number of reviews per aspect, scoring is based on reviews that are irrelevant to the given aspect. This decline in performance does not apply in the fully overlapping case. 5.1. Data Generation We create four datasets for our experiments based on the Recipe-MPR dataset and our new LLM-generated reviews. to whether the disjoint or overlapping reviews are used. Firstly, the fully overlapping dataset includes 20 reviews Throughout this paper, we show results for 𝐾𝐼 = 10. In per item, which each mention all of the aspects of the item. our experiments we noticed that varying 𝐾𝐼 led to minor Secondly, the fully disjoint dataset includes 10 reviews for changes in the results. For completeness, we report results each aspect of a given item. We also modify the fully dis- for 𝐾𝐼 = 5 in the Appendix. joint dataset to create two datasets with imbalanced aspect We see that for the fully overlapping dataset, Aspect Fu- frequencies. In the one rare aspect dataset, we remove all sion is approximately equivalent to the Monolithic LF ap- but one of the reviews for a randomly-selected aspect of proach, while for the fully disjoint dataset, Aspect Fusion each item. In the one popular aspect dataset, we keep all ten score aggregation approaches (arithmetic mean, harmonic reviews for only one randomly-selected aspect of each item, mean, and geometric mean) offer a significant improvement and keep only one review for the other aspects. in performance compared to the Monolithic LF approach. In order to generate reviews, the GT aspects for each This pattern offers empirical evidence that Aspect Fusion correct item in Recipe-MPR were used to prompt GPT-4. is better suited to disjoint aspect distributions than Mono- The total number of items for which there were GT aspects lithic LF. More specifically, this suggests that Monolithic is 473. The distribution of the number of aspects per query LF is not symmetrical across aspects, and fails to consider and item is shown in Table 1. On average, each item has 2.2 information from each of the aspects in a balanced way. aspects. The prompts we used to generate the reviews are Additionally, for the fully disjoint dataset, the perfor- included in the Appendix. mance of the aspect-based approach suffers for 𝐾𝑅 > 10. Recipe-MPR contains logical AND queries with ground This can be explained by the fact that when 𝐾𝑅 exceeds the truth (GT) labels for the query aspects. Refer to subsection number of disjoint reviews available for a given aspect (10 in query 3.1 for an example of a query 𝑞 and its GT aspects, 𝒜𝑞 . this data), the aspect-based methods will score items based Since the focus of this paper is on MA-RIR, we only included on reviews that are irrelevant to a given aspect. This could the 411 queries whose associated correct item had at least result in correct items receiving low scores for some aspects. two aspects. For each of these queries, we used two-shot ex- We conclude that Aspect Fusion should use 𝐾𝑅 ≤ 𝑅𝑖𝑎,min , amples to have GPT-4 extract “at least two non-overlapping where 𝑅𝑖𝑎,min is the smallest number of reviews for an item spans” representing the relevant aspects in the query. 𝑖 for an aspect, in order to avoid this performance drop. Furthermore, the fact that the score aggregation methods 5.2. Experimental Details outperform the rank-based aggregation methods (R-R and Borda) offers evidence that the embedding similarity scores For our query and review embeddings, we used TAS-B [9]. contain significant information about how well an item’s For the listwise reranking experiments, we used the gpt-3.5- reviews align with a given query aspect, above and beyond turbo-16k model. For the CE reranking experiments, the that item’s rank relative to the other candidate items. Con- model used was ms-marco-MiniLM-L-12-v2 1 . sidering the simplicity and strong performance of AMean score aggregation, we focus on this Aspect Fusion method 6. Experimental Results in the remaining results below. RQ1: Is Aspect Fusion helpful when item aspects are RQ2: How does review aspect frequency imbalance discussed disjointly across reviews? affect Monlithic LF and Aspect Fusion? Table 2 lists the mean absolute precision at 10 (MAP@10) Table 3 shows the performance of the stage 1 dense re- and recall@10 (Re@10) of the stage 1 dense retrieval for trieval for the balanced frequency (fully disjoint) dataset and various settings of 𝐾𝑅 . The table is broken up according the two datasets with imbalance in the review aspect fre- 1 https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2 quency. These results are also presented visually in Figure 4. Table 2 LF versus Aspect Fusion with six various aggregation functions for both the Fully Disjoint and Fully Overlapping datasets with 95% error margins in parentheses. 𝐾𝑅 1 2 5 10 15 30 Dataset MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05) AMean .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04) Borda .38 (.04) .56 (.05) .39 (.04) .59 (.05) .38 (.04) .58 (.05) .37 (.04) .57 (.05) .35 (.04) .53 (.05) .13 (.03) .21 (.04) Fully GMean .57 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .77 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04) disjoint HMean .57 (.04) .77 (.04) .57 (.04) .77 (.04) .57 (.04) .78 (.04) .56 (.04) .77 (.04) .52 (.04) .75 (.04) .16 (.03) .24 (.04) Min .43 (.04) .62 (.05) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) R-R .21 (.03) .66 (.05) .21 (.03) .66 (.05) .21 (.03) .66 (.05) .21 (.03) .66 (.05) .19 (.02) .66 (.05) .06 (.02) .22 (.04) Mono LF .50 (.04) .73 (.04) .51 (.04) .75 (.04) .51 (.04) .74 (.04) .51 (.04) .75 (.04) .50 (.04) .75 (.04) .50 (.04) .75 (.04) AMean .52 (.04) .74 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .52 (.04) .75 (.04) .52 (.04) .75 (.04) .53 (.04) .74 (.04) Borda .33 (.04) .51 (.05) .33 (.04) .52 (.05) .34 (.04) .52 (.05) .34 (.04) .53 (.05) .33 (.04) .52 (.05) .33 (.04) .51 (.05) Fully GMean .52 (.04) .75 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .52 (.04) .76 (.04) .52 (.04) .76 (.04) .53 (.04) .75 (.04) overlapping HMean .52 (.04) .75 (.04) .52 (.04) .76 (.04) .52 (.04) .76 (.04) .52 (.04) .76 (.04) .53 (.04) .76 (.04) .52 (.04) .76 (.04) Min .32 (.04) .52 (.05) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) R-R .17 (.02) .61 (.05) .17 (.02) .60 (.05) .17 (.02) .61 (.05) .18 (.02) .63 (.05) .18 (.02) .63 (.05) .17 (.02) .62 (.05) Table 3 Effect of aspect frequency imbalance on Monolithic LF and Aspect Fusion. “Balanced frequency” refers to the fully disjoint dataset where all item aspects have the same number of reviews. The values in parentheses indicate the 95% error margin. 𝐾𝑅 1 2 5 10 15 30 Dataset MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 Balanced AMean .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04) Frequency Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05) One Popular AMean .52 (.04) .73 (.04) .43 (.04) .68 (.05) .33 (.04) .58 (.05) .27 (.04) .52 (.05) .02 (.01) .03 (.02) .01 (.01) .03 (.02) Aspect Mono LF .36 (.04) .62 (.05) .32 (.04) .60 (.05) .28 (.04) .53 (.05) .26 (.04) .49 (.05) .27 (.04) .51 (.05) .27 (.04) .52 (.05) One Rare AMean .52 (.04) .74 (.04) .45 (.04) .67 (.05) .36 (.04) .58 (.05) .34 (.04) .54 (.05) .15 (.03) .23 (.04) .06 (.02) .09 (.03) Aspect Mono LF .39 (.04) .65 (.05) .36 (.04) .64 (.05) .32 (.04) .57 (.05) .30 (.04) .53 (.05) .32 (.04) .55 (.05) .29 (.04) .51 (.05) Figure 4: Effect of Aspect Frequency. Aspect Fusion performs Figure 5: Aspect Fusion with GT vs extracted query aspects with better than Monolithic LF for low values of 𝐾𝑅 , but suffers for fully disjoint reviews. Although GT query aspects perform better, higher values of 𝐾𝑅 . This pattern is explained in the discussion Aspect Fusion still offers an improvement over Monolithic LF of RQ1. with extracted query aspects. Note that this imbalance can only be analyzed for the case example, the MAP@10 of Monolithic LF decreased from where the reviews cover disjoint, rather than overlapping, 0.41 to 0.36 on the one popular aspect dataset, representing aspects. a 12% drop, compared to a 7% drop for the Aspect Fusion Based on our conclusion above, we focus on the results for approach. This suggests Aspect Fusion methods may be 𝐾𝑅 = 1 in this section, since for the datasets with imbalanced more robust to aspect frequency imbalance. review aspect frequency, 𝑅𝑖𝑎,min = 1. We see that there Lastly, we note that the performance of Monolithic LF is a significant decrease in performance for all methods decreases as 𝐾𝑅 grows large, which occurs because any rel- when aspect frequency imbalance is introduced. This result evant item aspects that are infrequently reviewed (there is suggests that balance in reviews across aspects is helpful only 1 review for rare aspects in these datasets) will con- for both Monolithic LF and Aspect Fusion. tribute less and less to the query-item score with an increase Furthermore, for 𝐾𝑅 = 1, the performance of Monolithic in 𝐾𝑅 . LF decreases more when aspect frequency imbalance is in- troduced, compared to for Aspect Fusion methods. For Table 4 Effect of using LLM extracted query aspects vs. GT query aspects on Monolithic LF and Aspect Fusion. The values in parentheses indicate the 95% error margin. 𝐾𝑅 1 2 5 10 15 30 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 Extracted AMean .46 (.04) .70 (.04) .47 (.04) .72 (.04) .47 (.04) .72 (.04) .46 (.04) .71 (.04) .46 (.04) .71 (.04) .15 (.03) .23 (.04) query aspects Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05) GT query AMean .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04) aspects Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05) Table 5 Reranker performance of CE and LW LLMs for various 𝐾𝑅 values. “No” refers to the case where no reranking is applied, and is equivalent to the stage 1 results. The values in parentheses indicate the 95% error margin. 𝐾𝑅 1 2 5 10 15 30 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 CE .36 (.04) .77 (.04) .51 (.04) .77 (.04) .53 (.04) .78 (.04) .53 (.04) .76 (.04) .53 (.04) .75 (.04) .16 (.03) .24 (.04) AMean LW .40 (.04) .77 (.04) .45 (.04) .77 (.04) .53 (.04) .78 (.04) .55 (.04) .76 (.04) .52 (.04) .75 (.04) .16 (.03) .24 (.04) Fully No .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04) disjoint CE .35 (.04) .67 (.05) .38 (.04) .69 (.04) .40 (.04) .69 (.04) .44 (.04) .70 (.04) .47 (.04) .70 (.04) .47 (.04) .68 (.05) Mono LF LW .33 (.04) .67 (.05) .34 (.04) .69 (.04) .38 (.04) .69 (.04) .40 (.04) .70 (.04) .42 (.04) .70 (.04) .46 (.04) .68 (.05) No .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05) CE .51 (.04) .74 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .51 (.04) .75 (.04) .48 (.04) .75 (.04) .50 (.04) .74 (.04) AMean LW .43 (.04) .74 (.04) .48 (.04) .76 (.04) .55 (.04) .77 (.04) .53 (.04) .75 (.04) .53 (.04) .75 (.04) .52 (.04) .74 (.04) Fully No .52 (.04) .74 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .52 (.04) .75 (.04) .52 (.04) .75 (.04) .53 (.04) .74 (.04) overlapping CE .50 (.04) .73 (.04) .50 (.04) .75 (.04) .50 (.04) .74 (.04) .50 (.04) .75 (.04) .48 (.04) .75 (.04) .50 (.04) .75 (.04) Mono LF LW .45 (.04) .73 (.04) .47 (.04) .75 (.04) .52 (.04) .74 (.04) .53 (.04) .75 (.04) .53 (.04) .75 (.04) .52 (.04) .75 (.04) No .50 (.04) .73 (.04) .51 (.04) .75 (.04) .51 (.04) .74 (.04) .51 (.04) .75 (.04) .50 (.04) .75 (.04) .50 (.04) .75 (.04) Figure 7: Ranks of correct items after Stage 1 Monolithic LF (x axis) and Stage 2 Cross-Encoder reranking (y axis) on the Fully Disjoint dataset. Circle size is proportional to position frequency, and the center of mass is shown in red. For 𝐾𝑅 = 1, most of the mass lies above the diagonal line, meaning the reranker has worsened performance. On the other hand, for 𝐾𝑅 = 30, most of Figure 6: Comparison of reranking methods. Performance gen- the mass lies below the diagonal line, meaning that the reranker erally increases as more reviews are included in the LLM — using has improved the performance. too few reviews can hurt performance. RQ3: How does the use of extracted query aspects RQ4: Are LLMs effective MA-RIR rerankers? instead of GT query aspects affect Aspect Fusion? Table 5 summarizes the performance of the listwise2 and Table 4 shows the same results as Table 2 except for cross-encoder rerankers. We see there is a beneficial effect the case of the extracted query aspects. These results are to increasing the number of reviews 𝐾𝑅 given to the lan- also presented visually in Figure 5. At 𝐾𝑅 = 1, while the guage model for both CE and listwise reranking. Specifically, MAP@10 of Aspect Fusion drops from 0.56 with GT aspects for reranking Monolithic LF on the fully disjoint dataset, to 0.46 with extracted aspects, it remains higher than the listwise MAP@10 improves from 0.33 to 0.46, for 𝐾𝑅 = 1 0.41 MAP@10 of Monolithic LF. This result implies that and 𝐾𝑅 = 30, respectively. Similarly, CE MAP@10 improves Aspect Fusion is useful even when GT query aspects are from 0.35 MAP@10 at 𝐾𝑅 = 1 to 0.47 at 𝐾𝑅 = 30. We con- unknown. jecture this large increase in MAP@10 with 𝐾𝑅 is due to the quadratic nature of cross-attention across input text. 2 Approximately 1% of queries had only 9 items returned by the listwise reranker instead of 10 — this was an error in generative retrieval. Since Aspect Fusion did best with low 𝐾𝑅 values, a pos- on preference elicitation over multiple aspects [14] and sible reason that we did not observe any benefits of LLM knowledge graph based topic-guided chatbots [15]. reranking for Aspect Fusion is because 𝐾𝑅 was not high enough. Also, while some reranking settings showed 2nd stage MAP@10 increases over 1st stage values (such as at 8. Conclusions 𝐾𝑅 = 30 reranking of Monolithic LF for fully disjoint data), By extending reviewed-item-retrieval (RIR) to a setting with when too few reviews were given to the reranker, the sec- multi-aspect queries and items, we were able to both the- ond stage sometimes made performance worse, such as at oretically and empirically demonstrate the failure modes 𝐾𝑅 = 1. of Monolithic Late Fusion (LF) when there is an imbalance Figure 7 shows a heatmap of the ranks assigned to the in how aspects are distributed across reviews. Specifically, correct items by the stage 1 retriever and stage 2 reranker. since Monolithic LF is aspect-agnostic, it is subject to a fre- An effective reranker would consistently improve the ranks quency bias in its review selection towards more popular for the correct item, and this would result in the center aspects. Furthermore, the disjointedness of aspects across of mass lying below the anti-diagonal. We see that this is reviews can induce a selection bias towards certain aspects indeed the case for a high value of 𝐾𝑅 , but is not the case if monolithic multi-aspect query embeddings are closer to for a low value of 𝐾𝑅 . The raw values underlying this figure review embeddings for those aspects. are provided in the Appendix. To address these failure modes, we propose Aspect Fusion as a robust MA-RIR method for imbalanced review distri- 7. Related Work butions. Using the recently released Recipe-MPR dataset, specifically designed to study multi-aspect retrieval, we de- 7.1. Multi-level Retrieval sign four generated datasets that allow us to empircally test the effects of review imbalances from aspect frequency and The most relevant work to ours is that on RIR by Abdol- disjointess. Our experiments show that Aspect Fusion is lah Pour et al. [1], which formulates the RIR problem and much more robust to non-uniform review variations than studies EF and LF approaches. In addition to LF with an Monolithic LF, outperforming the later with a 44% MAP@10 off-the-shelf bi-encoder such as TAS-B, the authors also increase on some distributions. contrastively fine tune an encoder for LF and show perfor- mance improvements over off-the-shelf LF. Extending their contrastive learning approach to MA-RIR Aspect Fusion is References a natural direction for future work. As mentioned in Sec- tion 2.2.1, Zhang and Balog [6] have previously studied the [1] M. M. Abdollah Pour, P. Farinneya, A. Toroghi, A. Ko- Object Fusion problem, which allows for more general two- rikov, A. Pesaranghader, T. Sajed, M. Bharadwaj, level structures than RIR (in which a low-level document B. Mavrin, S. Sanner, Self-supervised contrastive cannot describe more than one high-level object). How- BERT fine-tuning for fusion-based reviewed-item ever, they did not study neural techniques or multi-aspect retrieval, in: European Conference on Informa- retrieval, which are key to our work. tion Retrieval, Springer, 2023, pp. 3–17. doi:10.1007/ 978- 3- 031- 28244- 7_1 . [2] S. Kemper, J. Cui, K. Dicarlantonio, K. Lin, D. Tang, 7.2. Multi-aspect Retrieval A. Korikov, S. Sanner, Retrieval-augmented conver- In addition to releasing Recipe-MPR, which was used to sational recommendation with prompt-based semi- generate review distributions in this work, Zhang et al [3] structured natural language state tracking, in: Pro- use the queries and items in Recipe-MPR in a multi-aspect ceedings of the 47th International ACM SIGIR Confer- question-answering setting, and find that FS GPT-3 listwise ence on Research and Development in Information Re- prompting achieves far superior accuracy to all other meth- trieval, SIGIR ’24, Association for Computing Machin- ods. However, it is computationally infeasible to use such ery, New York, NY, USA, 2024. doi:10.1145/3626772. listwise prompting methods for first stage retrieval. Kong 3657670 . et al. [10] consider multiple aspects when calculating rele- [3] H. Zhang, A. Korikov, P. Farinneya, M. M. Abdol- vance scores in dense retrieval, but assume documents and lah Pour, M. Bharadwaj, A. Pesaranghader, X. Y. queries contain a fixed number of aspects from known cate- Huang, Y. X. Lok, Z. Wang, N. Jones, S. Sanner, Recipe- gories. Similarly, the label aggregation method of Kang et MPR: A test collection for evaluating multi-aspect al. [11] explicitly deals with multiple query aspects, but has preference-based natural language retrieval, in: Pro- fixed number of known categories. ceedings of the 46th International ACM SIGIR Con- Another methods called Multi-Aspect Dense Retrieval ference on Research and Development in Informa- (MADRM) [10] learns early fusion embeddings of docu- tion Retrieval, SIGIR ’23, Association for Computing ments and queries by extracting and then aggregating their Machinery, New York, NY, USA, 2023, p. 2744–2753. aspects, and report improvements over Monolithic LF base- doi:10.1145/3539618.3591880 . lines. DORIS-MAE [12] presents a dataset that deconstructs [4] N. Reimers, I. Gurevych, Sentence-BERT: Sentence complex queries into hierarchies of aspects and sub-aspects. embeddings using Siamese BERT-networks, in: K. Inui, Unlike our aspect extraction approach, which extracts as- J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the pects from queries using few-shot prompting with an LLM, 2019 Conference on Empirical Methods in Natu- DORIS-MAE predefines these aspects and their correspond- ral Language Processing and the 9th International ing topic hierarchy for both queries and document corpora. Joint Conference on Natural Language Processing Finally, some recent works study multi-aspect LLM- (EMNLP-IJCNLP), Association for Computational Lin- driven conversational recommendation [13], including work guistics, Hong Kong, China, 2019, pp. 3982–3992. doi:10.18653/v1/D19- 1410 . [5] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity A. Appendix A search with GPUs, IEEE Transactions on Big Data 7 (2021) 535–547. doi:10.1109/TBDATA.2019.2921572 . A.1. LLM Prompts [6] S. Zhang, K. Balog, Design patterns for fusion- We provide the prompts userd for overlapping review gen- based object retrieval, in: European Conference on eration, disjoint review generation, query aspect extraction, Information Retrieval, Springer, 2017, pp. 684–690. and listwise reranking in Figures 8, 9, 10, and 11 respectively. doi:10.1007/978- 3- 319- 56608- 5_66 . [7] K. Ethayarajh, How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Lin- guistics, Hong Kong, China, 2019, pp. 55–65. URL: Figure 8: Overlapping Review Generation Prompt Used with https://aclanthology.org/D19-1006. doi:10.18653/v1/ GPT-4 D19- 1006 . [8] X. Ma, X. Zhang, R. Pradeep, J. Lin, Zero-shot listwise document reranking with a large language model, 2023. arXiv:2305.02156 . [9] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Efficiently teaching an effective dense retriever with balanced topic aware sampling, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 113–122. [10] W. Kong, S. Khadanga, C. Li, S. K. Gupta, M. Zhang, W. Xu, M. Bendersky, Multi-aspect dense retrieval, Figure 9: Disjoint Review Generation Prompt Used with GPT-4 in: Proceedings of the 28th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 3178–3186. doi:10.1145/3534678. 3539137 . [11] C. Kang, X. Wang, Y. Chang, B. Tseng, Learning to rank with multi-aspect relevance for vertical search, in: Pro- ceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, Associa- tion for Computing Machinery, New York, NY, USA, Figure 10: Query Aspect Extraction Prompt Used with GPT-4 2012, p. 453–462. doi:10.1145/2124295.2124350 . [12] J. Wang, K. Wang, X. Wang, P. Naidu, L. Bergen, R. Pa- turi, DORIS-MAE: Scientific document retrieval using multi-level aspect-based queries, in: Proceedings of the 37th International Conference on Neural Informa- tion Processing Systems, NIPS ’23, Curran Associates Inc., Red Hook, NY, USA, 2024. doi:10.5555/3666122. 3667790 . [13] Y. Deldjoo, Z. He, J. McAuley, A. Korikov, S. Sanner, A. Ramisa, R. Vidal, M. Sathiamoorthy, A. Kasirzadeh, S. Milano, A review of modern recommender systems using generative models (gen-recsys), in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spain, 2024. [14] D. E. Austin, A. Korikov, A. Toroghi, S. Sanner, Bayesian optimization with LLM-based acquisition functions for natural language preference elicitation, in: Proceedings of the 18th ACM Conference on Rec- ommender Systems (RecSys’24), 2024. [15] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, J.-R. Wen, Towards topic-guided conversational recommender system, arXiv preprint arXiv:2010.04125 (2020). Figure 11: Generic Listwise Reranking Prompt Used with GPT- 3.5 Table 6 Stage 1 retriever performance for various aggregation functions and settings of 𝐾𝑅 , with 𝐾𝐼 = 5. All methods except Mono LF include Aspect Fusion. The values in parentheses indicate the 95% error margin. 𝐾𝑅 1 2 5 10 15 30 Dataset MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 AMean .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04) Borda .37 (.04) .49 (.05) .38 (.04) .49 (.05) .37 (.04) .48 (.05) .36 (.04) .45 (.05) .34 (.04) .44 (.05) .13 (.03) .18 (.04) Fully GMean .56 (.04) .71 (.04) .55 (.04) .71 (.04) .55 (.04) .71 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04) HMean .56 (.04) .71 (.04) .56 (.04) .71 (.04) .56 (.04) .71 (.04) .55 (.04) .69 (.04) .51 (.04) .68 (.05) .16 (.03) .21 (.04) disjoint Min .42 (.04) .54 (.05) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05) R-R .25 (.03) .53 (.05) .25 (.03) .53 (.05) .25 (.03) .54 (.05) .26 (.03) .54 (.05) .22 (.03) .47 (.05) .07 (.02) .15 (.03) AMean .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .67 (.05) .50 (.04) .66 (.05) .51 (.04) .66 (.05) .52 (.04) .66 (.05) Borda .32 (.04) .45 (.05) .32 (.04) .45 (.05) .33 (.04) .44 (.05) .33 (.04) .44 (.05) .32 (.04) .43 (.05) .32 (.04) .44 (.05) Fully GMean .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .66 (.05) .50 (.04) .65 (.05) .51 (.04) .66 (.05) .51 (.04) .66 (.05) HMean .51 (.04) .66 (.05) .51 (.04) .68 (.05) .51 (.04) .66 (.05) .51 (.04) .66 (.05) .52 (.04) .66 (.05) .51 (.04) .66 (.05) overlapping Min .31 (.04) .44 (.05) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) Mono LF .49 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .67 (.05) .49 (.04) .65 (.05) .48 (.04) .65 (.05) R-R .20 (.03) .50 (.05) .22 (.03) .50 (.05) .22 (.03) .50 (.05) .21 (.03) .50 (.05) .21 (.03) .48 (.05) .22 (.03) .49 (.05) Table 7 Stage 1 retriever performance by review aspect frequency and settings of 𝐾𝑅 , with 𝐾𝑖 = 5. “Balanced” refers to the fully disjoint dataset where all item aspects have the same number of reviews. The values in parentheses indicate the 95% error margin. 𝐾𝑅 1 2 5 10 15 30 Dataset MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 Balanced AMean .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04) Frequency Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05) One Popular AMean .51 (.04) .65 (.05) .42 (.04) .58 (.05) .31 (.04) .45 (.05) .25 (.04) .39 (.05) .01 (.01) .02 (.01) .01 (.01) .02 (.01) Aspect Mono LF .35 (.04) .50 (.05) .30 (.04) .46 (.05) .26 (.04) .40 (.05) .24 (.04) .38 (.05) .25 (.04) .38 (.05) .25 (.04) .38 (.05) One Rare AMean .51 (.04) .65 (.05) .43 (.04) .59 (.05) .35 (.04) .48 (.05) .33 (.04) .45 (.05) .15 (.03) .20 (.04) .06 (.02) .08 (.03) Aspect Mono LF .37 (.04) .53 (.05) .34 (.04) .51 (.05) .30 (.04) .45 (.05) .28 (.04) .43 (.05) .30 (.04) .45 (.05) .28 (.04) .40 (.05) A.2. Results for 𝐾𝐼 = 5 In the main body we showed various results of experiments where 𝐾𝐼 was set to 10. We found that varying 𝐾𝐼 within this order of magnitude had a very small effect on the results, and therefore did not include findings for any other settings of 𝐾𝐼 above. For completeness, in this section we duplicate the preceding tables but use 𝐾𝐼 = 5 instead of 𝐾𝐼 = 10. See Tables 6, 7, 8, and 9 for these results. A.3. Data for Figure 7 In Figure 7, we show the number of queries for which the correct item was ranked in a certain position by the stage 1 retriever and stage 2 reranker. The underlying data for this figure is shown in Table 10. Table 8 Stage 1 retriever performance by whether labelled GT or extracted query aspects are used, with 𝐾𝐼 = 5. The values in parentheses indicate the 95% error margin. 𝐾𝑅 1 2 5 10 15 30 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 Extracted AMean .45 (.04) .60 (.05) .46 (.04) .63 (.05) .46 (.04) .62 (.05) .45 (.04) .61 (.05) .44 (.04) .62 (.05) .14 (.03) .20 (.04) query aspects Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05) GT query AMean .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04) aspects Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05) Table 9 Stage 2 reranker performance by reranking method and setting of 𝐾𝑅 , with 𝐾𝐼 = 5. “No” refers to the case where no reranking is applied, and is equivalent to the stage 1 results. The values in parentheses indicate the 95% error margin. 𝐾𝑅 1 2 5 10 15 30 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 CE .41 (.04) .71 (.04) .52 (.04) .70 (.04) .53 (.04) .70 (.04) .53 (.04) .69 (.04) .53 (.04) .67 (.05) .17 (.03) .21 (.04) AMean LW .44 (.04) .71 (.04) .48 (.04) .70 (.04) .53 (.04) .70 (.04) .55 (.04) .69 (.04) .53 (.04) .67 (.05) .16 (.03) .21 (.04) Fully No .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04) disjoint CE .35 (.04) .56 (.05) .37 (.04) .57 (.05) .38 (.04) .57 (.05) .41 (.04) .58 (.05) .46 (.04) .61 (.05) .46 (.04) .58 (.05) Mono LF LW .33 (.04) .56 (.05) .34 (.04) .57 (.05) .37 (.04) .57 (.05) .37 (.04) .58 (.05) .44 (.04) .61 (.05) .43 (.04) .58 (.05) No .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05) CE .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .67 (.05) .50 (.04) .66 (.05) .49 (.04) .66 (.05) .51 (.04) .66 (.05) AMean LW .48 (.04) .67 (.05) .50 (.04) .68 (.05) .52 (.04) .67 (.05) .51 (.04) .66 (.05) .53 (.04) .66 (.05) .53 (.04) .66 (.05) Fully No .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .67 (.05) .50 (.04) .66 (.05) .51 (.04) .66 (.05) .52 (.04) .66 (.05) overlapping CE .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .67 (.05) .48 (.04) .65 (.05) .49 (.04) .65 (.05) Mono LF LW .47 (.04) .66 (.05) .48 (.04) .66 (.05) .52 (.04) .66 (.05) .53 (.04) .67 (.05) .51 (.04) .65 (.05) .51 (.04) .65 (.05) No .49 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .67 (.05) .49 (.04) .65 (.05) .48 (.04) .65 (.05) Table 10 Ranks assigned to the correct items for stage 1 retriever and stage 2 CE reranker with AMean aggregation Aspect Fusion. Stage 2 Correct Item Rank Stage 1 Correct Item Rank 1 2 3 4 5 6 7 8 9 10 1 68 19 12 9 2 2 4 2 0 0 2 12 20 8 1 3 4 1 3 0 0 3 5 3 8 5 2 2 2 1 1 1 4 1 2 2 4 1 4 2 0 1 0 5 1 1 1 2 3 0 2 2 1 1 𝐾𝑅 = 1 6 0 0 0 1 0 0 1 1 1 1 7 3 3 2 2 1 3 1 1 3 0 8 1 0 2 0 1 0 1 0 4 0 9 0 0 0 0 1 0 1 0 2 0 10 0 1 0 1 1 1 2 1 0 1 1 83 21 5 5 0 2 1 0 0 0 2 29 12 2 1 2 2 3 1 1 0 3 12 5 3 3 2 0 1 2 0 0 4 7 2 1 1 1 3 1 0 0 0 5 7 2 7 5 0 0 1 1 0 0 𝐾𝑅 = 30 6 5 1 3 1 0 2 2 0 0 0 7 3 1 1 1 0 1 0 0 0 1 8 4 0 0 1 0 2 1 0 0 0 9 0 2 1 0 1 1 0 1 0 0 10 1 2 1 0 0 0 1 1 0 0