Multi-Aspect Reviewed-Item Retrieval via LLM Query
                         Decomposition and Aspect Fusion
                         Anton Korikov1,∗,† , George Saad1,† , Ethan Baron1 , Mustafa Khan1 , Manav Shah1 and Scott Sanner1
                         1
                             University of Toronto, Toronto, Canada


                                           Abstract
                                           While user-generated product reviews often contain large quantities of information, their utility in addressing natural language product
                                           queries has been limited, with a key challenge being the need to aggregate information from multiple low-level sources (reviews) to a
                                           higher item level during retrieval. Existing methods for reviewed-item retrieval (RIR) typically take a late fusion (LF) approach which
                                           computes query-item scores by simply averaging the top-K query-review similarity scores for an item. However, we demonstrate that
                                           for multi-aspect queries and multi-aspect items, LF is highly sensitive to the distribution of aspects covered by reviews in terms of aspect
                                           frequency and the degree of aspect separation across reviews. To address these LF failures, we propose several novel aspect fusion
                                           (AF) strategies which include Large Language Model (LLM) query extraction and generative reranking. Our experiments show that for
                                           imbalanced review corpora, AF can improve over LF by a MAP@10 increase from 0.36 ± 0.04 to 0.52 ± 0.04, while achieving equivalent
                                           performance for balanced review corpora.

                                           Keywords
                                           Dense retrieval, query decomposition, multi-aspect retrieval, LLM reranking, late fusion,


                         1. Introduction                                                                                                     across reviews.
                                                                                                                                           • We propose several novel aspect fusion strategies,
                         User-generated reviews are an abundant and rich source                                                              which include LLM query extraction and reranking,
                         of data that has the potential to be used to improve the                                                            to address failures of LF review-score aggregation
                         retrieval of reviewed-items such as products, services, or                                                          on imbalanced multi-aspect review distributions.
                         destinations. However, a challenge of using review data                                                           • We leverage a recently released multi-aspect re-
                         for retrieval is that information has to be aggregated across                                                       trieval dataset, Recipe-MPR [3], with ground-truth
                         multiple (low-level) reviews to a (higher) item-level during                                                        query- and item- aspect labels to generate four multi-
                         retrieval. Recent work [1], defining this Reviewed-Item Re-                                                         aspect review distributions with various aspect bal-
                         trieval setting as RIR, showed that state-of-the-art results                                                        ance properties, and numerically evaluate the effect
                         could be achieved by using a bi-encoder to aggregate review                                                         of review-aspect balance on MA-RIR.
                         information to an item-level in a process called late fusion
                                                                                                                                           • Our simulations show that for imbalanced data, As-
                         (LF). As opposed to aggregating review information to an
                                                                                                                                             pect Fusion can improve over LF by MAP@10 in-
                         item-level before query-scoring (early fusion), LF first com-
                                                                                                                                             crease from 0.36 ± 0.04 to 0.52 ± 0.04 while achieving
                         putes query-review similarity to avoid losing information
                                                                                                                                             equivalent performance for balanced data.
                         before scoring, and then averages the top-𝐾 query-review
                                                                                                                                           • We show that LLM reranking in both cross-encoder
                         similarity scores to get a query-item similarity score. Re-
                                                                                                                                             and zero-shot (ZS) listwise reranking settings can
                         cently, LF has been implemented by retrieval augmented
                                                                                                                                             provide some improvements when given a large
                         generation (RAG) driven conversational recommendation
                                                                                                                                             enough number of reviews, but risk decreasing per-
                         (ConvRec) systems for generative recommendation, expla-
                                                                                                                                             formance when not enough reviews are provided.
                         nation, and interactive question answering [2].
                            In this paper, we extend RIR to a multi-aspect retrieval
                         setting, formulating what we call multi-aspect RIR (MA-RIR).                                                 2. Background
                         In this problem, our goal is to retrieve relevant items for
                         a multi-aspect query by using the reviews of multi-aspect                                                    2.1. Neural IR
                         items. Specifically, for an item with multiple aspects, we
                         assume that each review describes at least one, and up to                                                    Given a set of documents 𝒟 and a query 𝑞 ∈ 𝒬, an IR task
                         all, of the item’s aspects.                                                                                  𝐼 𝑅⟨𝒟 , 𝑞⟩ is to assign a similarity score 𝑆𝑞,𝑑 ∈ ℝ between the
                            As our primary contributions:                                                                             query and each document 𝑑 ∈ 𝒟 and return a ranked list
                                                                                                                                      of top scoring documents. The standard first-stage neural-
                                  • We formulate the MA-RIR problem and identify fail-                                                IR method [4] for a large corpus is to first use a bi-encoder
                                    ure modes of LF under imbalanced review-aspect                                                    𝑔(⋅) ∶ 𝒬 ∪𝒟 → ℝ𝑚 to map a query 𝑞 and document 𝑑 to their
                                    distributions, considering imbalances due to both                                                 respective embeddings 𝑔(𝑞) = z𝑞 and 𝑔(𝑑) = z𝑑 . A similarity
                                    aspect frequency and the degree of aspect separation                                              function 𝑓 (⋅, ⋅) ∶ ℝ𝑚 × ℝ𝑚 → ℝ, such as the dot product,
                                                                                                                                      is then used to compute a query-document score 𝑆𝑞,𝑑 =
                         SIGIR’24 Workshop on Information Retrieval’s Role in RAG Systems, July                                       𝑓 (z𝑞 , z𝑑 ). For web-scale corpora, exact similarity search
                         18, 2024, Washington D.C., USA.
                         ∗
                              Corresponding author.                                                                                   for the top query-document scores is typically impractical,
                         †
                             These authors contributed equally.                                                                       so approximate similarity search algorithms [5] are used
                         Envelope-Open anton.korikov@mail.utoronto.ca (A. Korikov);                                                   instead.
                         g.saad@mail.utoronto.ca (G. Saad); mr.khan@mail.utoronto.ca
                         (M. Khan)
                         Orcid 0009-0003-4487-9504 (A. Korikov); 0009-0000-3549-9874 (G. Saad);
                         0009-0004-2461-5760 (E. Baron); 0009-0008-3622-7270 (M. Khan);
                         0009-0008-4728-0771 (M. Shah); 0000-0001-7984-8394 (S. Sanner)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2.2. Reviewed-Item Retrieval                                                3. Multi-Aspect Reviewed Item
2.2.1. Problem Formulation                                                     Retrieval
Information retrieval across two-level data structures was                  3.1. Multi-Aspect Queries
previously studied by Zhang and Balog [6]. Specifically,
Zhang and Balog define the Object Retrieval problem, where                  This paper focuses on retrieving relevant items using their
(high-level) objects are described by multiple (low-level)                  reviews for a multi-aspect query, such as “Can I have a
documents. Given a query, the task is to retrieve high-level                meatball recipe that doesn’t take too long? ”. We define
objects by using information in the low-level documents.                    a query aspect to be a sub-span of a multi-aspect query
   To investigate a special case of object retrieval where                  that represents a distinct topic (or facet) in the query, for
the goal is retrieving items (e.g., products, destinations)                 instance the sub-spans ““meatball” and “doesn’t take too
based on their reviews, Abdollah Pour et al. [1] recently                   long” in the previous sentence. While there is ambiguity in
proposed the Reviewed-item Retrieval (RIR) problem. In                      identifying which sub-spans, if any, in a query should be
the 𝑅𝐼 𝑅⟨ℐ , 𝒟 , 𝑞⟩ problem, there is a set of items ℐ, where               considered aspects, this sub-span based definition is a simple
every item 𝑖 is a high-level object. Each item is described by              way to represent aspects and is conducive to overlap-based
a set of reviews (i.e., “low-level documents”) 𝒟𝑖 ⊂ 𝒟, and                  evaluations of aspect extraction such as intersection-over
the 𝑟’th review of item 𝑖 is 𝑑𝑖,𝑟 ∈ 𝒟𝑖 . The main difference                union (IOU). Formally, we denote the set of aspects in query
                                                                                   query                                     query    query
between RIR and Object Retrieval is that in RIR a low-level                 𝑞 as 𝒜𝑞      , where the 𝑗 ′ th query aspect is 𝑎𝑞,𝑗 ∈ 𝒜𝑞       .
document 𝑑𝑖,𝑟 cannot describe more than one high-level                      In this work, multi-aspect queries are assumed to be logical
object 𝑖, while Object Retrieval allows for more general                    AND queries for all aspects, though an aspect itself can
two-level structures. Given a query 𝑞 ∈ 𝒬 and a score 𝑆𝑞,𝑖                  represent other logical operators such as XOR (e.g. a query
between 𝑞 and each item 𝑖, the goal of RIR is to retrieve a                 aspect may be “chicken or beef ”). Finally, we assume all
ranked list 𝐿𝑞 of top-𝐾𝐼 scoring items:                                     query aspects are equally important — a further discussion
                                                                            of weighted multi-aspect retrieval can be found in Section
       𝐿𝑞 = (𝑖1 , ..., 𝑖𝐾𝐼 )   s.t. 𝑖1 ∈ arg max{𝑆𝑞,𝑖 }                     7.
                                                 𝑖
                                   𝑆𝑞,𝑖𝑘 , ≥ 𝑆𝑞,𝑖𝑘+1 ,   ∀𝑖𝑘 ∈ 𝐿𝑞 .
                                                                            3.2. Multi-Aspect Reviewed-Items
2.2.2. Fusion                                                               In addition to considering multi-aspect queries, we also con-
To get a query-item score 𝑆𝑞,𝑖 using an item’s review set                   sider multi-aspect items described by reviews. For instance,
𝒟𝑖 , review information needs to be aggregated to an item                   a multi-aspect item that is relevant to the multi-aspect query
level: this process is called fusion. Two alternatives exist for            example above might be a recipe titled “Beef meatballs
fusion [6]: if low-level information is aggregated before a                 cooked in canned soup, ready in 25 minutes”. However,
query is used for scoring, it is called Early Fusion (EF) — in
contrast, if the aggregation occurs after query-scoring, it is
called Late Fusion (LF).
   For EF in RIR, Abdollah Pour et al. [1] experiment with
mean-pooling and contrastive learning methods to create an
item embedding z𝑖 ∈ ℝ𝑚 from review embeddings {z𝑑 }𝑑∈𝒟𝑖 .
They then directly compute the similarity between z𝑖 and a
query embedding z𝑞 as the query-item score 𝑆𝑞,𝑖 = 𝑓 (z𝑞 , z𝑖 ).
   For LF in RIR, these authors first compute query-review
similarity scores 𝑆𝑞,𝑑𝑖,𝑟 = 𝑓 (z𝑞 , z𝑑𝑖,𝑟 ). They then aggregate
these scores into a query-item score 𝑆𝑞,𝑖 by averaging the
top 𝐾𝑅 query-review scores for each item:
                                       𝐾𝑅
                                   1
                         𝑆𝑞,𝑖 =      ∑𝑆 .                             (1)
                                  𝐾𝑅 𝑟=1 𝑞,𝑑𝑖,𝑟

  Numerical evaluations performed for EF and LF for RIR
demonstrate that EF has significantly worse performance
than LF [1], and Abdollah Pour et al. conjecture that EF
performs worse because it loses fine-grained review infor-
mation before query-scoring. In contrast, by delaying fusion,
LF preserves review-level information during query-scoring.
Due to these findings, we do not study EF for MA-RIR, rather,
we focus on developing Aspect Fusion as an extension of
LF, discussed next.
                                                                            Figure 1: Two extremes of item aspect distributions, showing
                                                                            reviews for an item with aspects “meatballs” and “ready in 25
                                                                            minutes”: a) Fully overlapping (top) — Each review mentions all
                                                                            item aspects. b) Fully disjoint with imbalanced aspect frequency
                                                                            (bottom) — no review mentions more than one aspect, and some
                                                                            aspects are mentioned much more frequently than others.
since our goal is to isolate the properties of review-based
retrieval, we assume that no such natural language (NL)
item-level description is available. Instead, we assume that
the item’s aspects are described in reviews. Obviously, item-
level descriptions (e.g. titles) are often available in practice,
so a prime direction for future work is fusion across multiple
levels of NL data during reviewed-item retrieval.
    Examples of reviews describing the item in the previous
paragraph, which has aspects “meatballs” and “ready in 25
minutes”, are shown in Figure 1. In this paper, we assume
that a review 𝑑𝑖,𝑟 must mention at least one item aspect
  item ∈ 𝒜 item and could mention up to all item aspects.
𝑎𝑖,𝑗       𝑖
Formally, the distribution of item aspects across reviews
can be defined with a bipartite aspect distribution graph
                                             item ) ∈ ℰ exists if
𝒢 = {𝒟 , 𝒜 item , ℰ } where an edge (𝑑𝑖,𝑟 , 𝑎𝑖,𝑗
review 𝑑𝑖,𝑟 ∈ 𝒟𝑖 mentions aspect 𝑎𝑖,𝑗 ∈ 𝒜𝑖item . We also
                                        item
      rel,𝑞
let 𝒜𝑖     ⊆ 𝒜𝑖item represent the set of item-aspects that
are relevant to a query and should be considered during
retrieval. We define the 𝑀𝐴 − 𝑅𝐼 𝑅⟨𝒜 , ℰ , 𝒟 , 𝑞⟩ problem as
the task of retrieving a ranked list of relevant multi-aspect
items 𝐿𝑞 for a multi-aspect query 𝑞, where 𝒜 = 𝒜 item ∪
𝒜 query .

3.3. Multi-Aspect Review Distributions                              Figure 2: a) Top. In (Monolithic) LF, the full query is scored
                                                                    against all reviews, and the top 𝐾𝑅 query-review scores are av-
As we will demonstrate with numerical simulations on LLM-           eraged for each item to produce a query-item score. b) Bottom.
generated review data, understanding review distributions           Aspect Fusion extracts aspects (i.e., query subspans) from a query,
in terms of aspect frequency and degree of aspect separation        performs LF with each aspect, and aggregates the resulting top
between reviews is key to designing successful MA-RIR               𝐾𝐼 item lists (i.e., one list per extracted aspect) to a final list.
techniques. Figure 1 shows two extremes of aspect distri-
butions that are among the distributions we explore in our
experiments.                                                        Equation (1). For MA-RIR, we propose two desiderata for
                                                                    the aspect distribution in the top 𝐾𝑅 reviews during fusion.
3.3.1. Fully Overlapping Distributions
Figure 1a) shows a fully overlapping aspect distribution            Desideratum 1: Since we assume multi-aspect queries
                                                                                                           rel,𝑞
where each review mentions all aspects — in this case, the          are AND queries, if an item contains 𝒜𝑖      relevant aspects
bipartite graph 𝒢 (see the RHS of Figure 1) is fully connected      for query 𝑞, the 𝐾𝑅 reviews used for LF should mention all
for item 𝑖1 . This is the most balanced review aspect distri-         rel,𝑞
                                                                    𝒜𝑖      of those relevant aspects.
bution possible for an item, and, because of this “perfect”
aspect balance, we postulate that aspect-agnostic retrieval         Desideratum 2: As mentioned in Section 3.1, we also
approaches such as standard LF will perform competitively           assume all query aspects are equally important, which
on such distributions.                                              implies that aspect frequency should be identical for all
                                                                      rel,𝑞
                                                                    𝒜𝑖      aspects in the top 𝐾𝑅 retrieved reviews.
3.3.2. Degree of Separation and Aspect Frequency
In contrast to the case of perfect review-aspect balance, Fig-        In a fully overlapping distribution (Figure 1a) where each
ure 1b) shows an extreme case of aspect imbalance. Firstly,         review mentions each aspect, both Desiderata 1 and 2 are
one aspect is mentioned much more frequently than an-               guaranteed to be satisfied by any subset of item reviews.
other — this is an aspect frequency imbalance. Secondly,            We thus argue that standard LF should be sufficient when
each review mentions only one aspect — this is a maximal de-        reviews fully overlap in aspects, and focus on developing
gree of separation of aspects across reviews (fully disjoint).      Aspect Fusion methods that address the failures of LF for
Mathematically, 𝒢 has |𝒜𝑖item1
                                | (disjoint) star components        imbalanced review-aspect distributions.
where some stars have a singificantly higher degree than
others. In the next section, we discuss the negative effects of     4.2. Failures of LF under Review-Aspect
imbalanced review-aspect distributions on LF performance
on MA-RIR, and propose aspect fusion as a method for miti-
                                                                         Imbalance
gating these negative effects.                                      Standard LF will fail to achieve Desiderata 1 and 2 for review-
                                                                    aspect distributions with at least some degree of disjointed-
                                                                    ness and aspect frequency imbalance under the following
4. Aspect Fusion for MA-RIR                                         assumptions.
4.1. Desiderata of Aspect Fusion
                                                                    Aspect Popularity Bias Aspects that are reviewed more
Recall that LF computes a query-item similarity score by            frequently are more likely to be mentioned in the top 𝐾𝑅
averaging the top 𝐾𝑅 query-review similarity scores using           reviews.
Embedding Bias The non-isotropic nature of the embed-                      4.3.3. Aspect-Item Score Fusion
ding space [7] biases retrieval towards one aspect. Consider
                                                      𝑗                    After aspect-item scoring, we must aggregate the 𝐴𝑒𝑞 top
two equally sized and fully disjoint review subsets 𝒟𝑖 ⊂ 𝒟𝑖                𝐾𝐼 item lists for each aspect {𝐿𝑎 }𝑎∈ 𝒜𝑞ext into a single ranked
and 𝒟𝑖𝑘 ⊂ 𝒟𝑖 in which reviews mention only a single aspect
          rel,𝑞                 rel,𝑞
                                                                           list of top-𝐾𝐼 items for the query, 𝐿𝑞 . We examine six ag-
 rel ∈ 𝒜
𝑎𝑖,𝑗             rel ∈ 𝒜
             or 𝑎𝑖,𝑘         , respectively, for some item 𝑖.
         𝑖               𝑖                                                 gregation strategies, which can be categorized as four score
If query-review similarity scores tend to be higher when                   aggregation methods and two rank aggregation methods.
                           rel as opposed to aspect 𝑎 rel , LF
a review describes aspect 𝑎𝑖,𝑗                        𝑖,𝑘                  The score-based variants convert the 𝐴𝑒𝑞 aspect-item scores
                                                                       𝑗   into a query-item score 𝑆𝑞,𝑖 using
will be more likely to select reviews from review set 𝒟𝑖 for
the top 𝐾𝑅 fused reviews. For example, in Figure 1b), the                      1. AMean: Arithmatic mean
reviews describing cooking time might be more likely to
                                                                               2. GMean: Geometric mean
score higher with the full query than reviews describing
                                                                               3. HMean: Harmonic mean
“meatballs”.
                                                                               4. Min: Minimum

4.3. Aspect Fusion                                                         to return the final ranked list 𝐿𝑞 . The two rank-based list
                                                                           aggregation methods include:
To address these failures of LF on imbalanced data, we in-
troduce several methods for Aspect Fusion, which explicitly                    1. Borda: Borda count
utilizes the multi-aspect nature of reviews during fusion to                   2. R-R: Round-robin (interleaved) merge.
address multi-aspect queries.                                                In Borda Count, the score for a given item 𝑖 is calculated
                                                                                           𝐴𝑒𝑞             𝑎𝑗                     𝑎𝑗
                                                                           as follows: ∑𝑗=1 (𝐾𝐼 − rank𝐿𝑖 + 1), where rank𝐿𝑖 is the
4.3.1. Aspect Extraction
                                                                           rank of item 𝑖 in list 𝐿𝑎𝑗 . In a round-robin merge of 𝐴𝑒𝑞 lists,
To extract aspects from queries, we propose to use few-                    elements from each list are merged in a cyclic order, and
shot (FS) prompting with an LLM. Though the number of                      when a conflict arises with a particular item, that item is
query-aspects is typically not known a priori, since we study              skipped and the merge continues from the same list.
multi-aspect queries, our proposed prompt (Figure 10 in
the Appendix) asks that at least two non-overlapping sub-                  4.4. LLM Reranking
spans of the query be extracted as aspects. We represent
the set of extracted query aspects for query 𝑞 as 𝒜𝑞ext and                In addition to Aspect Fusion, we also introduce an LLM
let 𝐴e𝑞 = |𝒜𝑞ext |.                                                        reranking step for MA-RIR — to the best of our knowl-
                                                                           edge LLM reranking has not been previously studied in
4.3.2. Aspect-Item Scoring                                                 a reviewed-item setting. Our goal is to understand whether
                                                                           LLMs in cross-encoder (CE) or ZS listwise [8] settings can
The key to Aspect Fusion is directly computing aspect-review               fuse reviews of multi-aspect items for effective reranking.
similarity scores 𝑆𝑎,𝑑𝑖,𝑟 , as opposed to similarity scores be-               After a list 𝐿𝑞 of top 𝐾𝐼 items is returned from the first
tween reviews and a monolithic query, since the later can                  stage, 𝐾𝑅 reviews for each item need to be given to the LLM
be negatively impacted by review-aspect distribution imbal-                for what we call fusion-during-reranking. For Monolithic LF,
ance. Aspect similarity scores are computed by separately                  these 𝐾𝑅 reviews are simply the 𝐾𝑅 reviews used for LF. For
embedding each extracted aspect 𝑎 ∈ 𝒜𝑞ext as z𝑎 = 𝑔(𝑎)                     Aspect Fusion, since 𝐾𝑅 reviews were used for fusion with
and calculating 𝑆𝑎,𝑑𝑖,𝑟 = 𝑓 (z𝑎 , z𝑑𝑖,𝑟 ). Then, aspect-item scores        each aspect, we propose to perform a round-robin merge of
𝑆𝑎,𝑖 ∈ ℝ are obtained by aggregating the top 𝐾𝑅 aspect-                    the top 𝐾𝑅 review lists for each aspect in order to preserve
review scores via Eq. (1) with aspect-review scores instead                a balanced distribution of reviews across aspects.
of query-review scores. For each extracted aspect 𝑎, the                      For a CE, reviews are simply concatenated and cross-
top-𝐾𝐼 scoring items are ordered into a list                               encoded with the query. For listwise reranking, our prompt
                                                                           provides the LLM with the query, initial ranked list of item
        𝐿𝑎 = (𝑖1 , ..., 𝑖𝐾𝐼 )   s.t. 𝑖1 ∈ arg max{𝑆𝑎,𝑖 }                   IDs, reviews for each item, and instructions to order the
                                                  𝑖
                                                                           items based on relevance to the query — the full listwise
                                    𝑆𝑎,𝑖𝑘 , ≥ 𝑆𝑎,𝑖𝑘+1 ,   ∀𝑖𝑘 ∈ 𝐿𝑞 .       reranking prompt is in Figure 11 in the Appendix.
   Figure 2b) demonstrates aspect-item scoring and how it
can alleviate the biases of standard LF. In this figure, the red           5. Experimental Method
and green points are embeddings of the reviews of item 𝑖
                    item and 𝑎 item , respectively — both these
describing aspect 𝑎𝑖,1         𝑖,2                                         We perform simulations on generated review data to study
aspects are assumed to be relevant to the query. Though                    the effect of aspect balance across reviews and test our hy-
the former aspect is more frequent, an equal number (𝐾𝑅 ) of               pothesis that Aspect Fusion is more robust to aspect imbal-
reviews for each aspect will be used during score fusion — as              ances than Monolithic LF. While using synthetic data ex-
long as the aspect review embeddings are similar enough to                 poses our results to biases from the data generation process,
the relevant query aspect embedding, and the total number                  we are able to generate synthetic review distributions with
of reviews for an aspect is at least 𝐾𝑅 . In contrast, Figure              far greater control that would have been possible several
2a) shows how standard (monolithic) LF will take a biased                  years ago before the advent of LLMs. We specifically design
review sample of the first aspect since it is more frequently              experiments to study the performance of Aspect Fusion vs
mentioned by reviews and z𝑞 happens to be closer to those                  Monolithic LF under the presence of aspect imbalance, both
review embeddings. To differentiate between LF for RIR                     in the form of disjointedness of aspects across reviews and
proposed by Abdollah Pour et al., and Aspect Fusion, we                    imbalanced aspect frequencies.
will refer to LF as Monolithic LF since it uses the full query.
   In order to perform our experimentation, we need a
dataset that has (a) multi-aspect queries and items, (b) GT
aspect labels and (c) item reviews. To the best of our knowl-
edge, there is no existing dataset with all of these properties.
However, the recently-released Recipe-MPR dataset [3] in-
cludes properties (a) and (b). We leverage this dataset and
generate item reviews using GPT-4.

Table 1
Distribution of ground truth (GT) and LLM-extracted aspects for
Recipe-MPR queries and items
                                                                       Figure 3: Monolithic LF versus Aspect Fusion with AMean aggre-
    # of Aspects           1      2      3    4     5    6   7     8   gation. Both methods perform similarly on the fully overlapping
    Items (GT)            76    282     72   29    10    1   2     1   dataset, but Aspect Fusion performs significantly better than
                                                                       Monolithic LF for the fully disjoint dataset and 𝐾𝑅 < 30. For the
    Queries (GT)           0    294    103   14     0    0   0     0
                                                                       fully disjoint dataset, Aspect Fusion drops in performance for
    Queries (Extracted)    0      2    342   55    12    0   0     0   𝐾𝑅 > 10 because when 𝐾𝑅 exceeds the number of reviews per
                                                                       aspect, scoring is based on reviews that are irrelevant to the given
                                                                       aspect. This decline in performance does not apply in the fully
                                                                       overlapping case.
5.1. Data Generation
We create four datasets for our experiments based on the
Recipe-MPR dataset and our new LLM-generated reviews.                  to whether the disjoint or overlapping reviews are used.
   Firstly, the fully overlapping dataset includes 20 reviews          Throughout this paper, we show results for 𝐾𝐼 = 10. In
per item, which each mention all of the aspects of the item.           our experiments we noticed that varying 𝐾𝐼 led to minor
Secondly, the fully disjoint dataset includes 10 reviews for           changes in the results. For completeness, we report results
each aspect of a given item. We also modify the fully dis-             for 𝐾𝐼 = 5 in the Appendix.
joint dataset to create two datasets with imbalanced aspect               We see that for the fully overlapping dataset, Aspect Fu-
frequencies. In the one rare aspect dataset, we remove all             sion is approximately equivalent to the Monolithic LF ap-
but one of the reviews for a randomly-selected aspect of               proach, while for the fully disjoint dataset, Aspect Fusion
each item. In the one popular aspect dataset, we keep all ten          score aggregation approaches (arithmetic mean, harmonic
reviews for only one randomly-selected aspect of each item,            mean, and geometric mean) offer a significant improvement
and keep only one review for the other aspects.                        in performance compared to the Monolithic LF approach.
   In order to generate reviews, the GT aspects for each               This pattern offers empirical evidence that Aspect Fusion
correct item in Recipe-MPR were used to prompt GPT-4.                  is better suited to disjoint aspect distributions than Mono-
The total number of items for which there were GT aspects              lithic LF. More specifically, this suggests that Monolithic
is 473. The distribution of the number of aspects per query            LF is not symmetrical across aspects, and fails to consider
and item is shown in Table 1. On average, each item has 2.2            information from each of the aspects in a balanced way.
aspects. The prompts we used to generate the reviews are                  Additionally, for the fully disjoint dataset, the perfor-
included in the Appendix.                                              mance of the aspect-based approach suffers for 𝐾𝑅 > 10.
   Recipe-MPR contains logical AND queries with ground                 This can be explained by the fact that when 𝐾𝑅 exceeds the
truth (GT) labels for the query aspects. Refer to subsection           number of disjoint reviews available for a given aspect (10 in
                                                         query
3.1 for an example of a query 𝑞 and its GT aspects, 𝒜𝑞         .       this data), the aspect-based methods will score items based
Since the focus of this paper is on MA-RIR, we only included           on reviews that are irrelevant to a given aspect. This could
the 411 queries whose associated correct item had at least             result in correct items receiving low scores for some aspects.
two aspects. For each of these queries, we used two-shot ex-           We conclude that Aspect Fusion should use 𝐾𝑅 ≤ 𝑅𝑖𝑎,min ,
amples to have GPT-4 extract “at least two non-overlapping
                                                                       where 𝑅𝑖𝑎,min is the smallest number of reviews for an item
spans” representing the relevant aspects in the query.
                                                                       𝑖 for an aspect, in order to avoid this performance drop.
                                                                          Furthermore, the fact that the score aggregation methods
5.2. Experimental Details                                              outperform the rank-based aggregation methods (R-R and
                                                                       Borda) offers evidence that the embedding similarity scores
For our query and review embeddings, we used TAS-B [9].
                                                                       contain significant information about how well an item’s
For the listwise reranking experiments, we used the gpt-3.5-
                                                                       reviews align with a given query aspect, above and beyond
turbo-16k model. For the CE reranking experiments, the
                                                                       that item’s rank relative to the other candidate items. Con-
model used was ms-marco-MiniLM-L-12-v2 1 .
                                                                       sidering the simplicity and strong performance of AMean
                                                                       score aggregation, we focus on this Aspect Fusion method
6. Experimental Results                                                in the remaining results below.

RQ1: Is Aspect Fusion helpful when item aspects are                    RQ2: How does review aspect frequency imbalance
discussed disjointly across reviews?                                   affect Monlithic LF and Aspect Fusion?
  Table 2 lists the mean absolute precision at 10 (MAP@10)                Table 3 shows the performance of the stage 1 dense re-
and recall@10 (Re@10) of the stage 1 dense retrieval for               trieval for the balanced frequency (fully disjoint) dataset and
various settings of 𝐾𝑅 . The table is broken up according              the two datasets with imbalance in the review aspect fre-
1
    https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2       quency. These results are also presented visually in Figure 4.
     Table 2
     LF versus Aspect Fusion with six various aggregation functions for both the Fully Disjoint and Fully Overlapping datasets with
     95% error margins in parentheses.
              𝐾𝑅               1                   2                   5                   10                  15                  30
  Dataset             MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10
              Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05)
              AMean   .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04)
              Borda   .38 (.04) .56 (.05) .39 (.04) .59 (.05) .38 (.04) .58 (.05) .37 (.04) .57 (.05) .35 (.04) .53 (.05) .13 (.03) .21 (.04)
  Fully
              GMean .57 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .77 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04)
  disjoint
              HMean .57 (.04) .77 (.04) .57 (.04) .77 (.04) .57 (.04) .78 (.04) .56 (.04) .77 (.04) .52 (.04) .75 (.04) .16 (.03) .24 (.04)
              Min     .43 (.04) .62 (.05) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02)
              R-R     .21 (.03) .66 (.05) .21 (.03) .66 (.05) .21 (.03) .66 (.05) .21 (.03) .66 (.05) .19 (.02) .66 (.05) .06 (.02) .22 (.04)
              Mono LF .50 (.04) .73 (.04) .51 (.04) .75 (.04) .51 (.04) .74 (.04) .51 (.04) .75 (.04) .50 (.04) .75 (.04) .50 (.04) .75 (.04)
              AMean .52 (.04) .74 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .52 (.04) .75 (.04) .52 (.04) .75 (.04) .53 (.04) .74 (.04)
              Borda   .33 (.04) .51 (.05) .33 (.04) .52 (.05) .34 (.04) .52 (.05) .34 (.04) .53 (.05) .33 (.04) .52 (.05) .33 (.04) .51 (.05)
  Fully
              GMean .52 (.04) .75 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .52 (.04) .76 (.04) .52 (.04) .76 (.04) .53 (.04) .75 (.04)
  overlapping
              HMean .52 (.04) .75 (.04) .52 (.04) .76 (.04) .52 (.04) .76 (.04) .52 (.04) .76 (.04) .53 (.04) .76 (.04) .52 (.04) .76 (.04)
              Min     .32 (.04) .52 (.05) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02) .01 (.01) .03 (.02)
              R-R     .17 (.02) .61 (.05) .17 (.02) .60 (.05) .17 (.02) .61 (.05) .18 (.02) .63 (.05) .18 (.02) .63 (.05) .17 (.02) .62 (.05)


     Table 3
     Effect of aspect frequency imbalance on Monolithic LF and Aspect Fusion. “Balanced frequency” refers to the fully disjoint
     dataset where all item aspects have the same number of reviews. The values in parentheses indicate the 95% error margin.
               𝐾𝑅             1                   2                   5                   10                 15                  30
 Dataset
                     MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10
 Balanced    AMean .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04)
 Frequency Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05)
 One Popular AMean .52 (.04) .73 (.04) .43 (.04) .68 (.05) .33 (.04) .58 (.05) .27 (.04) .52 (.05) .02 (.01) .03 (.02) .01 (.01) .03 (.02)
 Aspect      Mono LF .36 (.04) .62 (.05) .32 (.04) .60 (.05) .28 (.04) .53 (.05) .26 (.04) .49 (.05) .27 (.04) .51 (.05) .27 (.04) .52 (.05)
 One Rare    AMean .52 (.04) .74 (.04) .45 (.04) .67 (.05) .36 (.04) .58 (.05) .34 (.04) .54 (.05) .15 (.03) .23 (.04) .06 (.02) .09 (.03)
 Aspect      Mono LF .39 (.04) .65 (.05) .36 (.04) .64 (.05) .32 (.04) .57 (.05) .30 (.04) .53 (.05) .32 (.04) .55 (.05) .29 (.04) .51 (.05)


Figure 4: Effect of Aspect Frequency. Aspect Fusion performs               Figure 5: Aspect Fusion with GT vs extracted query aspects with
better than Monolithic LF for low values of 𝐾𝑅 , but suffers for           fully disjoint reviews. Although GT query aspects perform better,
higher values of 𝐾𝑅 . This pattern is explained in the discussion          Aspect Fusion still offers an improvement over Monolithic LF
of RQ1.                                                                    with extracted query aspects.


Note that this imbalance can only be analyzed for the case                 example, the MAP@10 of Monolithic LF decreased from
where the reviews cover disjoint, rather than overlapping,                 0.41 to 0.36 on the one popular aspect dataset, representing
aspects.                                                                   a 12% drop, compared to a 7% drop for the Aspect Fusion
   Based on our conclusion above, we focus on the results for              approach. This suggests Aspect Fusion methods may be
𝐾𝑅 = 1 in this section, since for the datasets with imbalanced             more robust to aspect frequency imbalance.
review aspect frequency, 𝑅𝑖𝑎,min = 1. We see that there                       Lastly, we note that the performance of Monolithic LF
is a significant decrease in performance for all methods                   decreases as 𝐾𝑅 grows large, which occurs because any rel-
when aspect frequency imbalance is introduced. This result                 evant item aspects that are infrequently reviewed (there is
suggests that balance in reviews across aspects is helpful                 only 1 review for rare aspects in these datasets) will con-
for both Monolithic LF and Aspect Fusion.                                  tribute less and less to the query-item score with an increase
   Furthermore, for 𝐾𝑅 = 1, the performance of Monolithic                  in 𝐾𝑅 .
LF decreases more when aspect frequency imbalance is in-
troduced, compared to for Aspect Fusion methods. For
     Table 4
     Effect of using LLM extracted query aspects vs. GT query aspects on Monolithic LF and Aspect Fusion. The values in
     parentheses indicate the 95% error margin.
               𝐾𝑅               1                   2                   5                   10                  15                 30
                       MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10
 Extracted     AMean .46 (.04) .70 (.04) .47 (.04) .72 (.04) .47 (.04) .72 (.04) .46 (.04) .71 (.04) .46 (.04) .71 (.04) .15 (.03) .23 (.04)
 query aspects Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05)
 GT query      AMean .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04)
 aspects       Mono LF .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05)


     Table 5
     Reranker performance of CE and LW LLMs for various 𝐾𝑅 values. “No” refers to the case where no reranking is applied, and is
     equivalent to the stage 1 results. The values in parentheses indicate the 95% error margin.
                        𝐾𝑅        1                   2                   5                  10                  15                  30
                         MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10 MAP@10 Re@10
                      CE .36 (.04) .77 (.04) .51 (.04) .77 (.04) .53 (.04) .78 (.04) .53 (.04) .76 (.04) .53 (.04) .75 (.04) .16 (.03) .24 (.04)
              AMean LW .40 (.04) .77 (.04) .45 (.04) .77 (.04) .53 (.04) .78 (.04) .55 (.04) .76 (.04) .52 (.04) .75 (.04) .16 (.03) .24 (.04)
  Fully               No .56 (.04) .77 (.04) .56 (.04) .77 (.04) .56 (.04) .78 (.04) .56 (.04) .76 (.04) .51 (.04) .75 (.04) .16 (.03) .24 (.04)
  disjoint            CE .35 (.04) .67 (.05) .38 (.04) .69 (.04) .40 (.04) .69 (.04) .44 (.04) .70 (.04) .47 (.04) .70 (.04) .47 (.04) .68 (.05)
              Mono LF LW .33 (.04) .67 (.05) .34 (.04) .69 (.04) .38 (.04) .69 (.04) .40 (.04) .70 (.04) .42 (.04) .70 (.04) .46 (.04) .68 (.05)
                      No .41 (.04) .67 (.05) .41 (.04) .69 (.04) .42 (.04) .69 (.04) .41 (.04) .70 (.04) .43 (.04) .70 (.04) .41 (.04) .68 (.05)
                      CE .51 (.04) .74 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .51 (.04) .75 (.04) .48 (.04) .75 (.04) .50 (.04) .74 (.04)
              AMean LW .43 (.04) .74 (.04) .48 (.04) .76 (.04) .55 (.04) .77 (.04) .53 (.04) .75 (.04) .53 (.04) .75 (.04) .52 (.04) .74 (.04)
  Fully               No .52 (.04) .74 (.04) .52 (.04) .76 (.04) .52 (.04) .77 (.04) .52 (.04) .75 (.04) .52 (.04) .75 (.04) .53 (.04) .74 (.04)
  overlapping         CE .50 (.04) .73 (.04) .50 (.04) .75 (.04) .50 (.04) .74 (.04) .50 (.04) .75 (.04) .48 (.04) .75 (.04) .50 (.04) .75 (.04)
              Mono LF LW .45 (.04) .73 (.04) .47 (.04) .75 (.04) .52 (.04) .74 (.04) .53 (.04) .75 (.04) .53 (.04) .75 (.04) .52 (.04) .75 (.04)
                      No .50 (.04) .73 (.04) .51 (.04) .75 (.04) .51 (.04) .74 (.04) .51 (.04) .75 (.04) .50 (.04) .75 (.04) .50 (.04) .75 (.04)


                                                                            Figure 7: Ranks of correct items after Stage 1 Monolithic LF (x
                                                                            axis) and Stage 2 Cross-Encoder reranking (y axis) on the Fully
                                                                            Disjoint dataset. Circle size is proportional to position frequency,
                                                                            and the center of mass is shown in red. For 𝐾𝑅 = 1, most of
                                                                            the mass lies above the diagonal line, meaning the reranker has
                                                                            worsened performance. On the other hand, for 𝐾𝑅 = 30, most of
Figure 6: Comparison of reranking methods. Performance gen-
                                                                            the mass lies below the diagonal line, meaning that the reranker
erally increases as more reviews are included in the LLM — using
                                                                            has improved the performance.
too few reviews can hurt performance.


RQ3: How does the use of extracted query aspects                            RQ4: Are LLMs effective MA-RIR rerankers?
instead of GT query aspects affect Aspect Fusion?                              Table 5 summarizes the performance of the listwise2 and
   Table 4 shows the same results as Table 2 except for                     cross-encoder rerankers. We see there is a beneficial effect
the case of the extracted query aspects. These results are                  to increasing the number of reviews 𝐾𝑅 given to the lan-
also presented visually in Figure 5. At 𝐾𝑅 = 1, while the                   guage model for both CE and listwise reranking. Specifically,
MAP@10 of Aspect Fusion drops from 0.56 with GT aspects                     for reranking Monolithic LF on the fully disjoint dataset,
to 0.46 with extracted aspects, it remains higher than the                  listwise MAP@10 improves from 0.33 to 0.46, for 𝐾𝑅 = 1
0.41 MAP@10 of Monolithic LF. This result implies that                      and 𝐾𝑅 = 30, respectively. Similarly, CE MAP@10 improves
Aspect Fusion is useful even when GT query aspects are                      from 0.35 MAP@10 at 𝐾𝑅 = 1 to 0.47 at 𝐾𝑅 = 30. We con-
unknown.                                                                    jecture this large increase in MAP@10 with 𝐾𝑅 is due to the
                                                                            quadratic nature of cross-attention across input text.
                                                                            2
                                                                                Approximately 1% of queries had only 9 items returned by the listwise
                                                                                reranker instead of 10 — this was an error in generative retrieval.
   Since Aspect Fusion did best with low 𝐾𝑅 values, a pos-       on preference elicitation over multiple aspects [14] and
sible reason that we did not observe any benefits of LLM         knowledge graph based topic-guided chatbots [15].
reranking for Aspect Fusion is because 𝐾𝑅 was not high
enough. Also, while some reranking settings showed 2nd
stage MAP@10 increases over 1st stage values (such as at         8. Conclusions
𝐾𝑅 = 30 reranking of Monolithic LF for fully disjoint data),
                                                                 By extending reviewed-item-retrieval (RIR) to a setting with
when too few reviews were given to the reranker, the sec-
                                                                 multi-aspect queries and items, we were able to both the-
ond stage sometimes made performance worse, such as at
                                                                 oretically and empirically demonstrate the failure modes
𝐾𝑅 = 1.
                                                                 of Monolithic Late Fusion (LF) when there is an imbalance
   Figure 7 shows a heatmap of the ranks assigned to the
                                                                 in how aspects are distributed across reviews. Specifically,
correct items by the stage 1 retriever and stage 2 reranker.
                                                                 since Monolithic LF is aspect-agnostic, it is subject to a fre-
An effective reranker would consistently improve the ranks
                                                                 quency bias in its review selection towards more popular
for the correct item, and this would result in the center
                                                                 aspects. Furthermore, the disjointedness of aspects across
of mass lying below the anti-diagonal. We see that this is
                                                                 reviews can induce a selection bias towards certain aspects
indeed the case for a high value of 𝐾𝑅 , but is not the case
                                                                 if monolithic multi-aspect query embeddings are closer to
for a low value of 𝐾𝑅 . The raw values underlying this figure
                                                                 review embeddings for those aspects.
are provided in the Appendix.
                                                                    To address these failure modes, we propose Aspect Fusion
                                                                 as a robust MA-RIR method for imbalanced review distri-
7. Related Work                                                  butions. Using the recently released Recipe-MPR dataset,
                                                                 specifically designed to study multi-aspect retrieval, we de-
7.1. Multi-level Retrieval                                       sign four generated datasets that allow us to empircally test
                                                                 the effects of review imbalances from aspect frequency and
The most relevant work to ours is that on RIR by Abdol-          disjointess. Our experiments show that Aspect Fusion is
lah Pour et al. [1], which formulates the RIR problem and        much more robust to non-uniform review variations than
studies EF and LF approaches. In addition to LF with an          Monolithic LF, outperforming the later with a 44% MAP@10
off-the-shelf bi-encoder such as TAS-B, the authors also         increase on some distributions.
contrastively fine tune an encoder for LF and show perfor-
mance improvements over off-the-shelf LF. Extending their
contrastive learning approach to MA-RIR Aspect Fusion is         References
a natural direction for future work. As mentioned in Sec-
tion 2.2.1, Zhang and Balog [6] have previously studied the       [1] M. M. Abdollah Pour, P. Farinneya, A. Toroghi, A. Ko-
Object Fusion problem, which allows for more general two-             rikov, A. Pesaranghader, T. Sajed, M. Bharadwaj,
level structures than RIR (in which a low-level document              B. Mavrin, S. Sanner, Self-supervised contrastive
cannot describe more than one high-level object). How-                BERT fine-tuning for fusion-based reviewed-item
ever, they did not study neural techniques or multi-aspect            retrieval, in: European Conference on Informa-
retrieval, which are key to our work.                                 tion Retrieval, Springer, 2023, pp. 3–17. doi:10.1007/
                                                                      978- 3- 031- 28244- 7_1 .
                                                                  [2] S. Kemper, J. Cui, K. Dicarlantonio, K. Lin, D. Tang,
7.2. Multi-aspect Retrieval                                           A. Korikov, S. Sanner, Retrieval-augmented conver-
In addition to releasing Recipe-MPR, which was used to                sational recommendation with prompt-based semi-
generate review distributions in this work, Zhang et al [3]           structured natural language state tracking, in: Pro-
use the queries and items in Recipe-MPR in a multi-aspect             ceedings of the 47th International ACM SIGIR Confer-
question-answering setting, and find that FS GPT-3 listwise           ence on Research and Development in Information Re-
prompting achieves far superior accuracy to all other meth-           trieval, SIGIR ’24, Association for Computing Machin-
ods. However, it is computationally infeasible to use such            ery, New York, NY, USA, 2024. doi:10.1145/3626772.
listwise prompting methods for first stage retrieval. Kong            3657670 .
et al. [10] consider multiple aspects when calculating rele-      [3] H. Zhang, A. Korikov, P. Farinneya, M. M. Abdol-
vance scores in dense retrieval, but assume documents and             lah Pour, M. Bharadwaj, A. Pesaranghader, X. Y.
queries contain a fixed number of aspects from known cate-            Huang, Y. X. Lok, Z. Wang, N. Jones, S. Sanner, Recipe-
gories. Similarly, the label aggregation method of Kang et            MPR: A test collection for evaluating multi-aspect
al. [11] explicitly deals with multiple query aspects, but has        preference-based natural language retrieval, in: Pro-
fixed number of known categories.                                     ceedings of the 46th International ACM SIGIR Con-
   Another methods called Multi-Aspect Dense Retrieval                ference on Research and Development in Informa-
(MADRM) [10] learns early fusion embeddings of docu-                  tion Retrieval, SIGIR ’23, Association for Computing
ments and queries by extracting and then aggregating their            Machinery, New York, NY, USA, 2023, p. 2744–2753.
aspects, and report improvements over Monolithic LF base-             doi:10.1145/3539618.3591880 .
lines. DORIS-MAE [12] presents a dataset that deconstructs        [4] N. Reimers, I. Gurevych, Sentence-BERT: Sentence
complex queries into hierarchies of aspects and sub-aspects.          embeddings using Siamese BERT-networks, in: K. Inui,
Unlike our aspect extraction approach, which extracts as-             J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the
pects from queries using few-shot prompting with an LLM,              2019 Conference on Empirical Methods in Natu-
DORIS-MAE predefines these aspects and their correspond-              ral Language Processing and the 9th International
ing topic hierarchy for both queries and document corpora.            Joint Conference on Natural Language Processing
   Finally, some recent works study multi-aspect LLM-                 (EMNLP-IJCNLP), Association for Computational Lin-
driven conversational recommendation [13], including work             guistics, Hong Kong, China, 2019, pp. 3982–3992.
                                                                      doi:10.18653/v1/D19- 1410 .
 [5] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity    A. Appendix A
     search with GPUs, IEEE Transactions on Big Data 7
     (2021) 535–547. doi:10.1109/TBDATA.2019.2921572 .           A.1. LLM Prompts
 [6] S. Zhang, K. Balog, Design patterns for fusion-
                                                                 We provide the prompts userd for overlapping review gen-
     based object retrieval, in: European Conference on
                                                                 eration, disjoint review generation, query aspect extraction,
     Information Retrieval, Springer, 2017, pp. 684–690.
                                                                 and listwise reranking in Figures 8, 9, 10, and 11 respectively.
     doi:10.1007/978- 3- 319- 56608- 5_66 .
 [7] K. Ethayarajh, How contextual are contextualized
     word representations? Comparing the geometry of
     BERT, ELMo, and GPT-2 embeddings, in: K. Inui,
     J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the
     2019 Conference on Empirical Methods in Natu-
     ral Language Processing and the 9th International
     Joint Conference on Natural Language Processing
     (EMNLP-IJCNLP), Association for Computational Lin-
     guistics, Hong Kong, China, 2019, pp. 55–65. URL:           Figure 8: Overlapping Review Generation Prompt Used with
     https://aclanthology.org/D19-1006. doi:10.18653/v1/         GPT-4
     D19- 1006 .
 [8] X. Ma, X. Zhang, R. Pradeep, J. Lin, Zero-shot listwise
     document reranking with a large language model, 2023.
     arXiv:2305.02156 .
 [9] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury,
     Efficiently teaching an effective dense retriever with
     balanced topic aware sampling, in: Proceedings of the
     44th International ACM SIGIR Conference on Research
     and Development in Information Retrieval, 2021, pp.
     113–122.
[10] W. Kong, S. Khadanga, C. Li, S. K. Gupta, M. Zhang,
     W. Xu, M. Bendersky, Multi-aspect dense retrieval,          Figure 9: Disjoint Review Generation Prompt Used with GPT-4
     in: Proceedings of the 28th ACM SIGKDD Confer-
     ence on Knowledge Discovery and Data Mining, KDD
     ’22, Association for Computing Machinery, New York,
     NY, USA, 2022, p. 3178–3186. doi:10.1145/3534678.
     3539137 .
[11] C. Kang, X. Wang, Y. Chang, B. Tseng, Learning to rank
     with multi-aspect relevance for vertical search, in: Pro-
     ceedings of the Fifth ACM International Conference
     on Web Search and Data Mining, WSDM ’12, Associa-
     tion for Computing Machinery, New York, NY, USA,            Figure 10: Query Aspect Extraction Prompt Used with GPT-4
     2012, p. 453–462. doi:10.1145/2124295.2124350 .
[12] J. Wang, K. Wang, X. Wang, P. Naidu, L. Bergen, R. Pa-
     turi, DORIS-MAE: Scientific document retrieval using
     multi-level aspect-based queries, in: Proceedings of
     the 37th International Conference on Neural Informa-
     tion Processing Systems, NIPS ’23, Curran Associates
     Inc., Red Hook, NY, USA, 2024. doi:10.5555/3666122.
     3667790 .
[13] Y. Deldjoo, Z. He, J. McAuley, A. Korikov, S. Sanner,
     A. Ramisa, R. Vidal, M. Sathiamoorthy, A. Kasirzadeh,
     S. Milano, A review of modern recommender systems
     using generative models (gen-recsys), in: Proceedings
     of the 30th ACM SIGKDD Conference on Knowledge
     Discovery and Data Mining (KDD ’24), August 25–29,
     2024, Barcelona, Spain, 2024.
[14] D. E. Austin, A. Korikov, A. Toroghi, S. Sanner,
     Bayesian optimization with LLM-based acquisition
     functions for natural language preference elicitation,
     in: Proceedings of the 18th ACM Conference on Rec-
     ommender Systems (RecSys’24), 2024.
[15] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, J.-R. Wen,
     Towards topic-guided conversational recommender
     system, arXiv preprint arXiv:2010.04125 (2020).
                                                                 Figure 11: Generic Listwise Reranking Prompt Used with GPT-
                                                                 3.5
    Table 6
    Stage 1 retriever performance for various aggregation functions and settings of 𝐾𝑅 , with 𝐾𝐼 = 5. All methods except Mono LF
    include Aspect Fusion. The values in parentheses indicate the 95% error margin.
                    𝐾𝑅              1                   2                   5                  10                  15                  30
       Dataset             MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5
                   AMean .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04)
                   Borda   .37 (.04) .49 (.05) .38 (.04) .49 (.05) .37 (.04) .48 (.05) .36 (.04) .45 (.05) .34 (.04) .44 (.05) .13 (.03) .18 (.04)
       Fully       GMean .56 (.04) .71 (.04) .55 (.04) .71 (.04) .55 (.04) .71 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04)
                   HMean .56 (.04) .71 (.04) .56 (.04) .71 (.04) .56 (.04) .71 (.04) .55 (.04) .69 (.04) .51 (.04) .68 (.05) .16 (.03) .21 (.04)
       disjoint
                   Min     .42 (.04) .54 (.05) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01)
                   Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05)
                   R-R     .25 (.03) .53 (.05) .25 (.03) .53 (.05) .25 (.03) .54 (.05) .26 (.03) .54 (.05) .22 (.03) .47 (.05) .07 (.02) .15 (.03)
                   AMean .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .67 (.05) .50 (.04) .66 (.05) .51 (.04) .66 (.05) .52 (.04) .66 (.05)
                   Borda   .32 (.04) .45 (.05) .32 (.04) .45 (.05) .33 (.04) .44 (.05) .33 (.04) .44 (.05) .32 (.04) .43 (.05) .32 (.04) .44 (.05)
       Fully       GMean .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .66 (.05) .50 (.04) .65 (.05) .51 (.04) .66 (.05) .51 (.04) .66 (.05)
                   HMean .51 (.04) .66 (.05) .51 (.04) .68 (.05) .51 (.04) .66 (.05) .51 (.04) .66 (.05) .52 (.04) .66 (.05) .51 (.04) .66 (.05)
       overlapping
                   Min     .31 (.04) .44 (.05) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01) .01 (.01)
                   Mono LF .49 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .67 (.05) .49 (.04) .65 (.05) .48 (.04) .65 (.05)
                   R-R     .20 (.03) .50 (.05) .22 (.03) .50 (.05) .22 (.03) .50 (.05) .21 (.03) .50 (.05) .21 (.03) .48 (.05) .22 (.03) .49 (.05)


    Table 7
    Stage 1 retriever performance by review aspect frequency and settings of 𝐾𝑅 , with 𝐾𝑖 = 5. “Balanced” refers to the fully disjoint
    dataset where all item aspects have the same number of reviews. The values in parentheses indicate the 95% error margin.
                  𝐾𝑅               1                   2                   5                  10                  15                  30
      Dataset             MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5
      Balanced    AMean .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04)
      Frequency Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05)
      One Popular AMean .51 (.04) .65 (.05) .42 (.04) .58 (.05) .31 (.04) .45 (.05) .25 (.04) .39 (.05) .01 (.01) .02 (.01) .01 (.01) .02 (.01)
      Aspect      Mono LF .35 (.04) .50 (.05) .30 (.04) .46 (.05) .26 (.04) .40 (.05) .24 (.04) .38 (.05) .25 (.04) .38 (.05) .25 (.04) .38 (.05)
      One Rare    AMean .51 (.04) .65 (.05) .43 (.04) .59 (.05) .35 (.04) .48 (.05) .33 (.04) .45 (.05) .15 (.03) .20 (.04) .06 (.02) .08 (.03)
      Aspect      Mono LF .37 (.04) .53 (.05) .34 (.04) .51 (.05) .30 (.04) .45 (.05) .28 (.04) .43 (.05) .30 (.04) .45 (.05) .28 (.04) .40 (.05)


A.2. Results for 𝐾𝐼 = 5
In the main body we showed various results of experiments
where 𝐾𝐼 was set to 10. We found that varying 𝐾𝐼 within this
order of magnitude had a very small effect on the results,
and therefore did not include findings for any other settings
of 𝐾𝐼 above. For completeness, in this section we duplicate
the preceding tables but use 𝐾𝐼 = 5 instead of 𝐾𝐼 = 10. See
Tables 6, 7, 8, and 9 for these results.

A.3. Data for Figure 7
In Figure 7, we show the number of queries for which the
correct item was ranked in a certain position by the stage 1
retriever and stage 2 reranker. The underlying data for this
figure is shown in Table 10.
Table 8
Stage 1 retriever performance by whether labelled GT or extracted query aspects are used, with 𝐾𝐼 = 5. The values in
parentheses indicate the 95% error margin.
                 𝐾𝑅             1                   2                   5                  10                  15                  30
                       MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5
 Extracted     AMean .45 (.04) .60 (.05) .46 (.04) .63 (.05) .46 (.04) .62 (.05) .45 (.04) .61 (.05) .44 (.04) .62 (.05) .14 (.03) .20 (.04)
 query aspects Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05)
 GT query      AMean .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04)
 aspects       Mono LF .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05)


Table 9
Stage 2 reranker performance by reranking method and setting of 𝐾𝑅 , with 𝐾𝐼 = 5. “No” refers to the case where no reranking
is applied, and is equivalent to the stage 1 results. The values in parentheses indicate the 95% error margin.
                        𝐾𝑅      1                   2                   5                  10                  15                  30
                       MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5 MAP@5 Re@5
                    CE .41 (.04) .71 (.04) .52 (.04) .70 (.04) .53 (.04) .70 (.04) .53 (.04) .69 (.04) .53 (.04) .67 (.05) .17 (.03) .21 (.04)
            AMean LW .44 (.04) .71 (.04) .48 (.04) .70 (.04) .53 (.04) .70 (.04) .55 (.04) .69 (.04) .53 (.04) .67 (.05) .16 (.03) .21 (.04)
Fully               No .55 (.04) .71 (.04) .55 (.04) .70 (.04) .55 (.04) .70 (.04) .55 (.04) .69 (.04) .50 (.04) .67 (.05) .16 (.03) .21 (.04)
disjoint            CE .35 (.04) .56 (.05) .37 (.04) .57 (.05) .38 (.04) .57 (.05) .41 (.04) .58 (.05) .46 (.04) .61 (.05) .46 (.04) .58 (.05)
            Mono LF LW .33 (.04) .56 (.05) .34 (.04) .57 (.05) .37 (.04) .57 (.05) .37 (.04) .58 (.05) .44 (.04) .61 (.05) .43 (.04) .58 (.05)
                    No .39 (.04) .56 (.05) .39 (.04) .57 (.05) .40 (.04) .57 (.05) .40 (.04) .58 (.05) .42 (.04) .61 (.05) .39 (.04) .58 (.05)
                    CE .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .67 (.05) .50 (.04) .66 (.05) .49 (.04) .66 (.05) .51 (.04) .66 (.05)
            AMean LW .48 (.04) .67 (.05) .50 (.04) .68 (.05) .52 (.04) .67 (.05) .51 (.04) .66 (.05) .53 (.04) .66 (.05) .53 (.04) .66 (.05)
Fully               No .51 (.04) .67 (.05) .51 (.04) .68 (.05) .51 (.04) .67 (.05) .50 (.04) .66 (.05) .51 (.04) .66 (.05) .52 (.04) .66 (.05)
overlapping         CE .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .67 (.05) .48 (.04) .65 (.05) .49 (.04) .65 (.05)
            Mono LF LW .47 (.04) .66 (.05) .48 (.04) .66 (.05) .52 (.04) .66 (.05) .53 (.04) .67 (.05) .51 (.04) .65 (.05) .51 (.04) .65 (.05)
                    No .49 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .66 (.05) .50 (.04) .67 (.05) .49 (.04) .65 (.05) .48 (.04) .65 (.05)


Table 10
Ranks assigned to the correct items for stage 1 retriever and stage 2 CE reranker with AMean aggregation Aspect Fusion.
                                                                         Stage 2 Correct Item Rank
                                               Stage 1 Correct
                                               Item Rank            1    2 3 4 5 6 7 8 9 10
                                                             1      68   19 12 9 2 2 4 2 0 0
                                                             2      12   20 8 1 3 4 1 3 0 0
                                                             3      5    3 8 5 2 2 2 1 1 1
                                                             4      1    2 2 4 1 4 2 0 1 0
                                                             5      1    1 1 2 3 0 2 2 1 1
                                       𝐾𝑅 = 1
                                                             6      0    0 0 1 0 0 1 1 1 1
                                                             7      3    3 2 2 1 3 1 1 3 0
                                                             8      1    0 2 0 1 0 1 0 4 0
                                                             9      0    0 0 0 1 0 1 0 2 0
                                                            10      0    1 0 1 1 1 2 1 0 1
                                                             1      83   21 5 5 0 2 1 0 0 0
                                                             2      29   12 2 1 2 2 3 1 1 0
                                                             3      12   5 3 3 2 0 1 2 0 0
                                                             4      7    2 1 1 1 3 1 0 0 0
                                                             5      7    2 7 5 0 0 1 1 0 0
                                       𝐾𝑅 = 30
                                                             6      5    1 3 1 0 2 2 0 0 0
                                                             7      3    1 1 1 0 1 0 0 0 1
                                                             8      4    0 0 1 0 2 1 0 0 0
                                                             9      0    2 1 0 1 1 0 1 0 0
                                                            10      1    2 1 0 0 0 1 1 0 0