1. Introduction

1613-0073

Multi-Aspect Reviewed-Item Retrieval via LLM Query Decomposition and Aspect Fusion

Anton Korikov

anton.korikov@mail.utoronto.ca 0 1 2

George Saad

g.saad@mail.utoronto.ca 0 1 2

Ethan Baron

0 1 2

Mustafa Khan

mr.khan@mail.utoronto.ca 0 1 2

Manav Shah

0 1 2

Scott Sanner

0 1 2 0 SIGIR'24 Workshop on Information Retrieval's Role in RAG Systems , July 1 University of Toronto , Toronto , Canada 2 pect Fusion can improve over LF by

2024

While user-generated product reviews often contain large quantities of information, their utility in addressing natural language product queries has been limited, with a key challenge being the need to aggregate information from multiple low-level sources (reviews) to a higher item level during retrieval. Existing methods for reviewed-item retrieval (RIR) typically take a late fusion (LF) approach which computes query-item scores by simply averaging the top-K query-review similarity scores for an item. However, we demonstrate that for multi-aspect queries and multi-aspect items, LF is highly sensitive to the distribution of aspects covered by reviews in terms of aspect frequency and the degree of aspect separation across reviews. To address these LF failures, we propose several novel aspect fusion (AF) strategies which include Large Language Model (LLM) query extraction and generative reranking. Our experiments show that for imbalanced review corpora, AF can improve over LF by a MAP@10 increase from 0.36 ± 0.04 to 0.52 ± 0.04, while achieving equivalent performance for balanced review corpora.

Dense retrieval query decomposition multi-aspect retrieval LLM reranking late fusion

1. Introduction

User-generated reviews are an abundant and rich source of data that has the potential to be used to improve the retrieval of reviewed-items such as products, services, or destinations. However, a challenge of using review data for retrieval is that information has to be aggregated across multiple (low-level) reviews to a (higher) item-level during retrieval. Recent work [ 1 ], defining this Reviewed-Item Retrieval setting as RIR, showed that state-of-the-art results could be achieved by using a bi-encoder to aggregate review information to an item-level in a process called late fusion (LF). As opposed to aggregating review information to an item-level before query-scoring (early fusion), LF first computes query-review similarity to avoid losing information before scoring, and then averages the top- query-review similarity scores to get a query-item similarity score. Recently, LF has been implemented by retrieval augmented generation (RAG) driven conversational recommendation (ConvRec) systems for generative recommendation, explanation, and interactive question answering [ 2 ].

In this paper, we extend RIR to a multi-aspect retrieval setting, formulating what we call multi-aspect RIR (MA-RIR). In this problem, our goal is to retrieve relevant items for a multi-aspect query by using the reviews of multi-aspect items. Specifically, for an item with multiple aspects, we assume that each review describes at least one, and up to all, of the item’s aspects.

As our primary contributions:

• We formulate the MA-RIR problem and identify failure modes of LF under imbalanced review-aspect distributions, considering imbalances due to both aspect frequency and the degree of aspect separation (M. Khan)

CEUR

ceur-ws.org across reviews. • We propose several novel aspect fusion strategies, which include LLM query extraction and reranking, to address failures of LF review-score aggregation on imbalanced multi-aspect review distributions. • We leverage a recently released multi-aspect retrieval dataset, Recipe-MPR [ 3 ], with ground-truth query- and item- aspect labels to generate four multiaspect review distributions with various aspect balance properties, and numerically evaluate the efect of review-aspect balance on MA-RIR. • Our simulations show that for imbalanced data, Ascrease from 0.36 ± 0.04 to 0.52 ± 0.04 while achieving equivalent performance for balanced data. • We show that LLM reranking in both cross-encoder and zero-shot (ZS) listwise reranking settings can provide some improvements when given a large enough number of reviews, but risk decreasing performance when not enough reviews are provided.

2. Background 2.1. Neural IR

Given a set of documents

and a query ∈ , an IR task ⟨ , ⟩ is to assign a similarity score , ∈ ℝ between the query and each document ∈ and return a ranked list of top scoring documents. The standard first-stage neuralIR method [ 4 ] for a large corpus is to first use a bi-encoder (⋅) ∶ ∪ → ℝ

to map a query and document to their respective embeddings () = function (⋅, ⋅) ∶ ℝ × ℝ z and () =

z . A similarity → ℝ, such as the dot product, is then used to compute a query-document score , ( z , z ). For web-scale corpora, exact similarity search for the top query-document scores is typically impractical, = so approximate similarity search algorithms [ 5 ] are used instead.

2.2. Reviewed-Item Retrieval

2.2.1. Problem Formulation Information retrieval across two-level data structures was previously studied by Zhang and Balog [ 6 ]. Specifically,

Zhang and Balog define the

Object Retrieval problem, where (high-level) objects are described by multiple (low-level) documents. Given a query, the task is to retrieve high-level objects by using information in the low-level documents.

To investigate a special case of object retrieval where the goal is retrieving items (e.g., products, destinations) based on their reviews, Abdollah Pour et al. [ 1 ] recently proposed the Reviewed-item Retrieval (RIR) problem. In the ⟨ℐ , , ⟩

problem, there is a set of items ℐ, where every item is a high-level object. Each item is described by a set of reviews (i.e., “low-level documents”) ⊂ , and the ’th review of item is , ∈ . The main diference between RIR and Object Retrieval is that in RIR a low-level document , cannot describe more than one high-level object , while Object Retrieval allows for more general two-level structures. Given a query ∈ and a score , ranked list

of top- scoring items: between and each item , the goal of RIR is to retrieve a = ( 1, ..., ) s.t. 1 ∈ arg max{ , }

, , ≥ , +1 , 2.2.2. Fusion To get a query-item score , using an item’s review set , review information needs to be aggregated to an item level: this process is called fusion. Two alternatives exist for fusion [ 6 ]: if low-level information is aggregated before a query is used for scoring, it is called Early Fusion (EF) — in contrast, if the aggregation occurs after query-scoring, it is called Late Fusion (LF).

For EF in RIR, Abdollah Pour et al. [ 1 ] experiment with mean-pooling and contrastive learning methods to create an item embedding z ∈ ℝ from review embeddings {z }∈ . They then directly compute the similarity between z and a query embedding z as the query-item score , = ( z , z ).

For LF in RIR, these authors first compute query-review similarity scores ,

, top query-review scores for each item: these scores into a query-item score , by averaging the = ( z , z , ). They then aggregate

1 =1 , = ∑ , , .

(1)

Numerical evaluations performed for EF and LF for RIR demonstrate that EF has significantly worse performance than LF [ 1 ], and Abdollah Pour et al. conjecture that EF performs worse because it loses fine-grained review information before query-scoring. In contrast, by delaying fusion, LF preserves review-level information during query-scoring. Due to these findings, we do not study EF for MA-RIR, rather, we focus on developing Aspect Fusion as an extension of

LF, discussed next.

Multi-Aspect Reviewed Item Retrieval 3.1. Multi-Aspect Queries

This paper focuses on retrieving relevant items using their reviews for a multi-aspect query, such as “Can I have a meatball recipe that doesn’t take too long? ”. We define a query aspect to be a sub-span of a multi-aspect query that represents a distinct topic (or facet) in the query, for instance the sub-spans ““meatball” and “doesn’t take too long” in the previous sentence. While there is ambiguity in identifying which sub-spans, if any, in a query should be considered aspects, this sub-span based definition is a simple way to represent aspects and is conducive to overlap-based evaluations of aspect extraction such as intersection-over as union (IOU). Formally, we denote the set of aspects in query query, where the ′th query aspect is , query ∈ query.

In this work, multi-aspect queries are assumed to be logical AND queries for all aspects, though an aspect itself can represent other logical operators such as XOR (e.g. a query aspect may be “chicken or beef ”). Finally, we assume all query aspects are equally important — a further discussion of weighted multi-aspect retrieval can be found in Section 7.

3.2. Multi-Aspect Reviewed-Items

In addition to considering multi-aspect queries, we also consider multi-aspect items described by reviews. For instance, a multi-aspect item that is relevant to the multi-aspect query example above might be a recipe titled “Beef meatballs cooked in canned soup, ready in 25 minutes”. However, (bottom) — no review mentions more than one aspect, and some aspects are mentioned much more frequently than others. since our goal is to isolate the properties of review-based retrieval, we assume that no such natural language (NL) item-level description is available. Instead, we assume that the item’s aspects are described in reviews. Obviously, itemlevel descriptions (e.g. titles) are often available in practice, so a prime direction for future work is fusion across multiple levels of NL data during reviewed-item retrieval.

Examples of reviews describing the item in the previous paragraph, which has aspects “meatballs” and “ready in 25 minutes”, are shown in Figure 1. In this paper, we assume that a review , must mention at least one item aspect ,item ∈ item and could mention up to all item aspects. Formally, the distribution of item aspects across reviews can be defined with a bipartite aspect distribution graph = { , item, ℰ } where an edge ( , , ,item) ∈ ℰ exists if item ∈ item. We also review , ∈ mentions aspect ,

rel, let ⊆ item represent the set of item-aspects that are relevant to a query and should be considered during retrieval. We define the − ⟨ , ℰ , , ⟩ problem as the task of retrieving a ranked list of relevant multi-aspect items for a multi-aspect query , where = item ∪ query .

3.3. Multi-Aspect Review Distributions

As we will demonstrate with numerical simulations on LLMgenerated review data, understanding review distributions in terms of aspect frequency and degree of aspect separation between reviews is key to designing successful MA-RIR techniques. Figure 1 shows two extremes of aspect distributions that are among the distributions we explore in our experiments. 3.3.1. Fully Overlapping Distributions 3.3.2. Degree of Separation and Aspect Frequency In contrast to the case of perfect review-aspect balance, Figure 1b) shows an extreme case of aspect imbalance. Firstly, one aspect is mentioned much more frequently than another — this is an aspect frequency imbalance. Secondly, each review mentions only one aspect — this is a maximal degree of separation of aspects across reviews (fully disjoint). Mathematically, has | i1tem| (disjoint) star components where some stars have a singificantly higher degree than others. In the next section, we discuss the negative efects of imbalanced review-aspect distributions on LF performance on MA-RIR, and propose aspect fusion as a method for mitigating these negative efects.

4. Aspect Fusion for MA-RIR 4.1. Desiderata of Aspect Fusion

Recall that LF computes a query-item similarity score by averaging the top query-review similarity scores using Equation (1). For MA-RIR, we propose two desiderata for the aspect distribution in the top reviews during fusion. Desideratum 1: Since we assume multi-aspect queries are AND queries, if an item contains rel, relevant aspects for query , the reviews used for LF should mention all rel, of those relevant aspects.

Desideratum 2: As mentioned in Section 3.1, we also assume all query aspects are equally important, which implies that aspect frequency should be identical for all rel, aspects in the top retrieved reviews.

In a fully overlapping distribution (Figure 1a) where each review mentions each aspect, both Desiderata 1 and 2 are guaranteed to be satisfied by any subset of item reviews. We thus argue that standard LF should be suficient when reviews fully overlap in aspects, and focus on developing Aspect Fusion methods that address the failures of LF for imbalanced review-aspect distributions.

4.2. Failures of LF under Review-Aspect Imbalance

Standard LF will fail to achieve Desiderata 1 and 2 for reviewaspect distributions with at least some degree of disjointedness and aspect frequency imbalance under the following assumptions.

Aspect Popularity Bias Aspects that are reviewed more frequently are more likely to be mentioned in the top reviews.

The non-isotropic nature of the embedding space [ 7 ] biases retrieval towards one aspect. Consider two equally sized and fully disjoint review subsets ⊂ and ⊂ in which reviews mention only a single aspect , rel ∈ rel, or , rel ∈

rel, , respectively, for some item .

If query-review similarity scores tend to be higher when a review describes aspect ,

rel as opposed to aspect ,r el, LF will be more likely to select reviews from review set for the top fused reviews. For example, in Figure 1b), the reviews describing cooking time might be more likely to score higher with the full query than reviews describing “meatballs”.

4.3. Aspect Fusion

To address these failures of LF on imbalanced data, we introduce several methods for Aspect Fusion, which explicitly utilizes the multi-aspect nature of reviews during fusion to address multi-aspect queries. 4.3.1. Aspect Extraction let e = | ext|.

To extract aspects from queries, we propose to use fewshot (FS) prompting with an LLM. Though the number of query-aspects is typically not known a priori, since we study multi-aspect queries, our proposed prompt (Figure 10 in the Appendix) asks that at least two non-overlapping subspans of the query be extracted as aspects. We represent the set of extracted query aspects for query as ext and 4.3.2. Aspect-Item Scoring The key to Aspect Fusion is directly computing aspect-review similarity scores , , , as opposed to similarity scores between reviews and a monolithic query, since the later can be negatively impacted by review-aspect distribution imbalance. Aspect similarity scores are computed by separately embedding each extracted aspect ∈ and calculating ,

, = ( z , z , ). Then, aspect-item scores ,

∈ ℝ are obtained by aggregating the top aspectreview scores via Eq. (1) with aspect-review scores instead of query-review scores. For each extracted aspect , the ext as z = () top- scoring items are ordered into a list = ( 1, ..., ) s.t. 1 ∈ arg max{ , }

, , ≥ , +1 , 2a) shows how standard (monolithic) LF will take a biased review sample of the first aspect since it is more frequently mentioned by reviews and z happens to be closer to those review embeddings. To diferentiate between LF for RIR proposed by Abdollah Pour et al., and Aspect Fusion, we will refer to LF as Monolithic LF since it uses the full query. After aspect-item scoring, we must aggregate the top item lists for each aspect { } ∈ ext into a single ranked into a query-item score , using list of top- items for the query, . We examine six aggregation strategies, which can be categorized as four score aggregation methods and two rank aggregation methods. The score-based variants convert the aspect-item scores 1. AMean: Arithmatic mean 2. GMean: Geometric mean

3. HMean: Harmonic mean 4. Min: Minimum

to return the final ranked list . The two rank-based list aggregation methods include:

1. Borda: Borda count 2. R-R: Round-robin (interleaved) merge.

In Borda Count, the score for a given item is calculated as follows: ∑

=1 ( − rank + 1), where rank is the rank of item in list . In a round-robin merge of lists, elements from each list are merged in a cyclic order, and when a conflict arises with a particular item, that item is skipped and the merge continues from the same list.

4.4. LLM Reranking

In addition to Aspect Fusion, we also introduce an LLM reranking step for MA-RIR — to the best of our knowledge LLM reranking has not been previously studied in a reviewed-item setting. Our goal is to understand whether LLMs in cross-encoder (CE) or ZS listwise [ 8 ] settings can fuse reviews of multi-aspect items for efective reranking.

After a list

of top items is returned from the first stage, reviews for each item need to be given to the LLM for what we call fusion-during-reranking. For Monolithic LF, these reviews are simply the reviews used for LF. For Aspect Fusion, since reviews were used for fusion with each aspect, we propose to perform a round-robin merge of a balanced distribution of reviews across aspects. the top review lists for each aspect in order to preserve

For a CE, reviews are simply concatenated and crossencoded with the query. For listwise reranking, our prompt provides the LLM with the query, initial ranked list of item IDs, reviews for each item, and instructions to order the items based on relevance to the query — the full listwise reranking prompt is in Figure 11 in the Appendix.

5. Experimental Method

We perform simulations on generated review data to study the efect of aspect balance across reviews and test our hypothesis that Aspect Fusion is more robust to aspect imbalances than Monolithic LF. While using synthetic data exposes our results to biases from the data generation process, we are able to generate synthetic review distributions with far greater control that would have been possible several years ago before the advent of LLMs. We specifically design experiments to study the performance of Aspect Fusion vs Monolithic LF under the presence of aspect imbalance, both in the form of disjointedness of aspects across reviews and imbalanced aspect frequencies.

In order to perform our experimentation, we need a dataset that has (a) multi-aspect queries and items, (b) GT aspect labels and (c) item reviews. To the best of our knowledge, there is no existing dataset with all of these properties. However, the recently-released Recipe-MPR dataset [ 3 ] includes properties (a) and (b). We leverage this dataset and generate item reviews using GPT-4. We create four datasets for our experiments based on the Recipe-MPR dataset and our new LLM-generated reviews.

Firstly, the fully overlapping dataset includes 20 reviews per item, which each mention all of the aspects of the item. Secondly, the fully disjoint dataset includes 10 reviews for each aspect of a given item. We also modify the fully disjoint dataset to create two datasets with imbalanced aspect frequencies. In the one rare aspect dataset, we remove all but one of the reviews for a randomly-selected aspect of each item. In the one popular aspect dataset, we keep all ten reviews for only one randomly-selected aspect of each item, and keep only one review for the other aspects.

In order to generate reviews, the GT aspects for each correct item in Recipe-MPR were used to prompt GPT-4. The total number of items for which there were GT aspects is 473. The distribution of the number of aspects per query and item is shown in Table 1. On average, each item has 2.2 aspects. The prompts we used to generate the reviews are included in the Appendix.

Recipe-MPR contains logical AND queries with ground truth (GT) labels for the query aspects. Refer to subsection 3.1 for an example of a query and its GT aspects, query. Since the focus of this paper is on MA-RIR, we only included the 411 queries whose associated correct item had at least two aspects. For each of these queries, we used two-shot examples to have GPT-4 extract “at least two non-overlapping spans” representing the relevant aspects in the query.

5.2. Experimental Details

For our query and review embeddings, we used TAS-B [ 9 ]. For the listwise reranking experiments, we used the gpt-3.5turbo-16k model. For the CE reranking experiments, the model used was ms-marco-MiniLM-L-12-v2 1.

6. Experimental Results

RQ1: Is Aspect Fusion helpful when item aspects are discussed disjointly across reviews?

Table 2 lists the mean absolute precision at 10 (MAP@10) and recall@10 (Re@10) of the stage 1 dense retrieval for various settings of . The table is broken up according 1https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2 to whether the disjoint or overlapping reviews are used. Throughout this paper, we show results for = 10. In our experiments we noticed that varying led to minor changes in the results. For completeness, we report results for = 5 in the Appendix.

We see that for the fully overlapping dataset, Aspect Fusion is approximately equivalent to the Monolithic LF approach, while for the fully disjoint dataset, Aspect Fusion score aggregation approaches (arithmetic mean, harmonic mean, and geometric mean) ofer a significant improvement in performance compared to the Monolithic LF approach. This pattern ofers empirical evidence that Aspect Fusion is better suited to disjoint aspect distributions than Monolithic LF. More specifically, this suggests that Monolithic LF is not symmetrical across aspects, and fails to consider information from each of the aspects in a balanced way.

Additionally, for the fully disjoint dataset, the performance of the aspect-based approach sufers for > 10. This can be explained by the fact that when exceeds the number of disjoint reviews available for a given aspect (10 in this data), the aspect-based methods will score items based on reviews that are irrelevant to a given aspect. This could result in correct items receiving low scores for some aspects. We conclude that Aspect Fusion should use ≤ , min, where , min is the smallest number of reviews for an item for an aspect, in order to avoid this performance drop.

Furthermore, the fact that the score aggregation methods outperform the rank-based aggregation methods (R-R and Borda) ofers evidence that the embedding similarity scores contain significant information about how well an item’s reviews align with a given query aspect, above and beyond that item’s rank relative to the other candidate items. Considering the simplicity and strong performance of AMean score aggregation, we focus on this Aspect Fusion method in the remaining results below.

RQ2: How does review aspect frequency imbalance afect Monlithic LF and Aspect Fusion?

Table 3 shows the performance of the stage 1 dense retrieval for the balanced frequency (fully disjoint) dataset and the two datasets with imbalance in the review aspect frequency. These results are also presented visually in Figure 4.

Note that this imbalance can only be analyzed for the case where the reviews cover disjoint, rather than overlapping, aspects.

Based on our conclusion above, we focus on the results for = 1 in this section, since for the datasets with imbalanced review aspect frequency, , min = 1. We see that there is a significant decrease in performance for all methods when aspect frequency imbalance is introduced. This result suggests that balance in reviews across aspects is helpful for both Monolithic LF and Aspect Fusion.

Furthermore, for = 1, the performance of Monolithic LF decreases more when aspect frequency imbalance is introduced, compared to for Aspect Fusion methods. For example, the MAP@10 of Monolithic LF decreased from 0.41 to 0.36 on the one popular aspect dataset, representing a 12% drop, compared to a 7% drop for the Aspect Fusion approach. This suggests Aspect Fusion methods may be more robust to aspect frequency imbalance.

Lastly, we note that the performance of Monolithic LF decreases as grows large, which occurs because any relevant item aspects that are infrequently reviewed (there is only 1 review for rare aspects in these datasets) will contribute less and less to the query-item score with an increase in .

Table 5 summarizes the performance of the listwise2 and cross-encoder rerankers. We see there is a beneficial efect to increasing the number of reviews given to the language model for both CE and listwise reranking. Specifically, for reranking Monolithic LF on the fully disjoint dataset, listwise MAP@10 improves from 0.33 to 0.46, for = 1 and = 30, respectively. Similarly, CE MAP@10 improves from 0.35 MAP@10 at = 1 to 0.47 at = 30. We conjecture this large increase in MAP@10 with is due to the quadratic nature of cross-attention across input text. 2Approximately 1% of queries had only 9 items returned by the listwise reranker instead of 10 — this was an error in generative retrieval.

Since Aspect Fusion did best with low values, a possible reason that we did not observe any benefits of LLM reranking for Aspect Fusion is because was not high enough. Also, while some reranking settings showed 2nd stage MAP@10 increases over 1st stage values (such as at = 30 reranking of Monolithic LF for fully disjoint data), when too few reviews were given to the reranker, the second stage sometimes made performance worse, such as at = 1.

Figure 7 shows a heatmap of the ranks assigned to the correct items by the stage 1 retriever and stage 2 reranker. An efective reranker would consistently improve the ranks for the correct item, and this would result in the center of mass lying below the anti-diagonal. We see that this is indeed the case for a high value of , but is not the case for a low value of . The raw values underlying this figure are provided in the Appendix.

7. Related Work 7.1. Multi-level Retrieval

The most relevant work to ours is that on RIR by Abdollah Pour et al. [ 1 ], which formulates the RIR problem and studies EF and LF approaches. In addition to LF with an of-the-shelf bi-encoder such as TAS-B, the authors also contrastively fine tune an encoder for LF and show performance improvements over of-the-shelf LF. Extending their contrastive learning approach to MA-RIR Aspect Fusion is a natural direction for future work. As mentioned in Section 2.2.1, Zhang and Balog [ 6 ] have previously studied the Object Fusion problem, which allows for more general twolevel structures than RIR (in which a low-level document cannot describe more than one high-level object). However, they did not study neural techniques or multi-aspect retrieval, which are key to our work.

7.2. Multi-aspect Retrieval

In addition to releasing Recipe-MPR, which was used to generate review distributions in this work, Zhang et al [ 3 ] use the queries and items in Recipe-MPR in a multi-aspect question-answering setting, and find that FS GPT-3 listwise prompting achieves far superior accuracy to all other methods. However, it is computationally infeasible to use such listwise prompting methods for first stage retrieval. Kong et al. [ 10 ] consider multiple aspects when calculating relevance scores in dense retrieval, but assume documents and queries contain a fixed number of aspects from known categories. Similarly, the label aggregation method of Kang et al. [ 11 ] explicitly deals with multiple query aspects, but has ifxed number of known categories.

Another methods called Multi-Aspect Dense Retrieval (MADRM) [ 10 ] learns early fusion embeddings of documents and queries by extracting and then aggregating their aspects, and report improvements over Monolithic LF baselines. DORIS-MAE [ 12 ] presents a dataset that deconstructs complex queries into hierarchies of aspects and sub-aspects. Unlike our aspect extraction approach, which extracts aspects from queries using few-shot prompting with an LLM, DORIS-MAE predefines these aspects and their corresponding topic hierarchy for both queries and document corpora.

Finally, some recent works study multi-aspect LLMdriven conversational recommendation [ 13 ], including work on preference elicitation over multiple aspects [ 14 ] and knowledge graph based topic-guided chatbots [ 15 ].

8. Conclusions

By extending reviewed-item-retrieval (RIR) to a setting with multi-aspect queries and items, we were able to both theoretically and empirically demonstrate the failure modes of Monolithic Late Fusion (LF) when there is an imbalance in how aspects are distributed across reviews. Specifically, since Monolithic LF is aspect-agnostic, it is subject to a frequency bias in its review selection towards more popular aspects. Furthermore, the disjointedness of aspects across reviews can induce a selection bias towards certain aspects if monolithic multi-aspect query embeddings are closer to review embeddings for those aspects.

To address these failure modes, we propose Aspect Fusion as a robust MA-RIR method for imbalanced review distributions. Using the recently released Recipe-MPR dataset, specifically designed to study multi-aspect retrieval, we design four generated datasets that allow us to empircally test the efects of review imbalances from aspect frequency and disjointess. Our experiments show that Aspect Fusion is much more robust to non-uniform review variations than Monolithic LF, outperforming the later with a 44% MAP@10 increase on some distributions.

A. Appendix A A.1. LLM Prompts

We provide the prompts userd for overlapping review generation, disjoint review generation, query aspect extraction, and listwise reranking in Figures 8, 9, 10, and 11 respectively. A.2. Results for = 5 In the main body we showed various results of experiments where was set to 10. We found that varying within this order of magnitude had a very small efect on the results, and therefore did not include findings for any other settings of above. For completeness, in this section we duplicate the preceding tables but use = 5 instead of = 10. See Tables 6, 7, 8, and 9 for these results.

A.3. Data for Figure 7

In Figure 7, we show the number of queries for which the correct item was ranked in a certain position by the stage 1 retriever and stage 2 reranker. The underlying data for this ifgure is shown in Table 10.

[1] M. M. Abdollah Pour , P.

Farinneya , A.

Toroghi , A.

Korikov , A.

Pesaranghader , T.

Sajed , M.

Bharadwaj , B.

Mavrin , S.

Sanner , Self-supervised contrastive BERT fine-tuning for fusion-based reviewed-item retrieval , in: European Conference on Information Retrieval , Springer, 2023 , pp. 3 - 17 . doi: 10 .1007/ 978- 3- 031 - 28244- 7 _ 1 .

[2]

Kemper ,

Cui ,

Dicarlantonio ,

Lin ,

Tang ,

Korikov ,

Sanner , Retrieval-augmented conversational recommendation with prompt-based semistructured natural language state tracking , in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '24, Association for Computing Machinery, New York, NY, USA, 2024 . doi: 10 .1145/3626772. 3657670.

[3]

Zhang ,

Korikov ,

Farinneya , M. M. Abdollah Pour , M.

Bharadwaj , A.

Pesaranghader , X. Y.

Huang , Y. X.

Lok , Z.

Wang , N.

Jones , S. Sanner, RecipeMPR: A test collection for evaluating multi-aspect preference-based natural language retrieval , in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '23, Association for Computing Machinery, New York, NY, USA, 2023 , p. 2744 - 2753 . doi: 10 .1145/3539618.3591880.

[4]

Reimers , I. Gurevych , Sentence-BERT: Sentence embeddings using Siamese BERT-networks , in: K. Inui,

Jiang ,

Ng , X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 3982 - 3992 . doi: 10 .18653/v1/ D19 - 1410.

[5]

Johnson , M. Douze,

Jégou , Billion-scale similarity search with GPUs , IEEE Transactions on Big Data 7 ( 2021 ) 535 - 547 . doi: 10 .1109/TBDATA. 2019 . 2921572 .

[6]

Zhang , K. Balog, Design patterns for fusionbased object retrieval , in: European Conference on Information Retrieval , Springer, 2017 , pp. 684 - 690 . doi: 10 .1007/978-3- 319 -56608-5_ 66 .

[7]

Ethayarajh , How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings , in: K. Inui,

Jiang ,

[8]

Ma ,

Zhang ,

Pradeep ,

Lin , Zero-shot listwise document reranking with a large language model , 2023 . arXiv: 2305 . 02156 .

[9]

Hofstätter ,

S.-C.

Lin ,

J.-H.

Yang ,

Lin ,

Hanbury , Eficiently teaching an efective dense retriever with balanced topic aware sampling , in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2021 , pp. 113 - 122 .

[10]

Kong ,

Khadanga ,

Li ,

S. K.

Gupta ,

Zhang , W. Xu, M. Bendersky, Multi-aspect dense retrieval , in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , KDD '22, Association for Computing Machinery, New York, NY, USA, 2022 , p. 3178 - 3186 . doi: 10 .1145/3534678. 3539137.

[11]

Kang ,

Wang ,

Chang ,

Tseng , Learning to rank with multi-aspect relevance for vertical search , in: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining , WSDM '12, Association for Computing Machinery, New York, NY, USA, 2012 , p. 453 - 462 . doi: 10 .1145/2124295.2124350.

[12]

Wang ,

Naidu ,

Bergen ,

Paturi , DORIS-MAE: Scientific document retrieval using multi-level aspect-based queries , in: Proceedings of the 37th International Conference on Neural Information Processing Systems , NIPS '23, Curran Associates Inc., Red

Hook

, NY , USA, 2024 . doi: 10 .5555/3666122. 3667790.

[13]

Deldjoo ,

He , J. McAuley , A.

Korikov , S.

Sanner , A.

Ramisa , R.

Vidal , M.

Sathiamoorthy , A.

Kasirzadeh , S.

Milano , A review of modern recommender systems using generative models (gen-recsys) , in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '24) , August 25-29 , 2024 , Barcelona, Spain, 2024 .

[14]

D. E.

Austin ,

Korikov ,

Toroghi ,

Sanner , Bayesian optimization with LLM-based acquisition functions for natural language preference elicitation , in: Proceedings of the 18th ACM Conference on Recommender Systems (RecSys'24) , 2024 .

[15]

Zhou ,

W. X.

Zhao ,

Wang ,

J.-R.

Wen , Towards topic-guided conversational recommender system , arXiv preprint arXiv: 2010 . 04125 ( 2020 ).