<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Joint Information Retrieval and Recommendation: a Reproducibility Study⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone Merlo</string-name>
          <email>simone.merlo@phd.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guglielmo Faggioli</string-name>
          <email>guglielmo.faggioli@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Information Retrieval (IR) and Recommender Systems (RS) represent the core components in the information access scenario. These two categories of systems are traditionally developed in isolation and have a very limited interaction. However, since the nineties it was clear that there were significant connections between IR and RS and in recent times systems performing retrieval and recommendation jointly have been created. This contributed to showing that developing joint IR and RS systems allows to improve the performance of both tasks. The current state-of-the-art in the joint IR and RS field is represented by the Unified Information Access (UIA) framework. Driven by the importance of reproducibility, in this work, we discuss the reproducibility, replicability and generalizability of UIA. First, we analyse the reproducibility degree of UIA. Then, we focus on its replicability by studying its behaviour on a public dataset. Finally, we explore its generalizability by altering the data processing and training algorithms. The obtained results show that the performance of UIA and, in general, of joint IR and RS systems, may strongly depend on the dataset used for the training and evaluation and that its stability may vary depending on the task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Recommender Systems</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Information Retrieval (IR) systems and Recommender Systems (RS) are traditionally thought as
independent systems, event if their results are frequently merged to provide users with a comprehensive
answer to their needs. Nonetheless, Belkin and Croft [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] consider IR and RS as “two sides of the same
coin”. Indeed, both systems are mainly concerned with retrieving the most suitable piece of information
— documents for IR or items for RS — in a collection according to a request. However, some diferences
are still present and they mainly concern the input, which represented by a textual query in IR and
by an item or a set of user’s historical interactions in RS. In recent times, systems performing IR and
recommendation jointly started to be developed. Indeed, Si et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Zamani and Croft [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] showed
that combining IR and RS models allows for improved performance by exploiting the knowledge held
by a model to enhance the other. Moreover, two major eforts are SRJGraph, proposed by Zhao et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
and the Unified Information Access ( UIA) framework, developed by Zeng et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which represents
the state-of-the-art. Anyway, the joint modeling of IR and RS is still in its early stages.
      </p>
      <p>
        Reproducibility is a fundamental aspect of both IR and RS and it poses many challenges [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. For
this reason and given the recent interest in joint IR and RS we analyse the UIA framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] based on
three axes1: reproducibility (i.e., diferent team, same experimental setup), replicability (i.e., diferent
team, diferent experimental setup), and generalizability (i.e., diferent team, diferent experimental
setup, diferent task). In particular, the main goal is articulated in three research questions:
• RQ1 - Reproducibility: is the performance of UIA, reported in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], reproducible?
• RQ2 - Replicability: is the performance of UIA replicable on a publicly available dataset?
• RQ3 - Generalizability: is the performance of UIA stable when using alternative approaches
which are less computationally demanding and/or involve diferent data processing methods?
UIA was chosen since, by representing the state-of-the-art in the joint IR and RS field, its architecture
may become a base to develop new systems. Moreover, UIA was originally trained and evaluated using
both a private (Lowe’s) and a publicly available (Amazon ESCI) datasets but the bulk of the experiments
was conducted only on the private one, which also enabled more functionalities.
      </p>
      <p>
        In this paper we discuss the results of our empirical evaluation2 which showed that UIA can be
reproduced and that the robustness and efectiveness of UIA depends on diferent factors including the
training data and the training process. This work represents an extended abstract of a previous ECIR
submission, the full version is available at [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The remainder of this work is organized as follows: in Section 2 we provide an overview of UIA; in
Section 3 we explain how we reproduced, replicated and generalized the framework; in Section 4 we
report and discuss the results obtained.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Highlights of the Reproduced Approach</title>
      <p>In this section we provide an overview of the UIA framework and of the Amazon ESCI dataset, which
is used for the reproducibility, replicability and generalizability study.</p>
      <sec id="sec-2-1">
        <title>2.1. The UIA Framework</title>
        <p>
          Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] summarize an interaction between the user and the UIA framework (Figure 1) with
three elements: an information access request ℛ, a task label (access functionality in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]) ℱ , and
a candidate information item ℐ. UIA supports three hybrid RS-IR tasks (functionalities in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]) ℱ : i)
Keyword Search (KS) where a short textual query is used to retrieve the most relevant items; ii) Query By
Example (QBE) where an input item is used to retrieve other similar items; and iii) Complementary Item
Recommendation (CIR) where an input item is used to retrieve items that “can be used together” (i.e.,
complementary). The information access request ℛ is task-dependent and corresponds to a keyword
query (KS) or an item (QBE or CIR). Finally, the candidate information item ℐ, is a textual representation
of the item (i.e., its title, in the Amazon ESCI dataset) for which the system must estimate the relevance
to ℛ. Thus, given a task ℱ and a request ℛ, the goal of UIA, parametrized by  , is ranking all the items
ℐ in the catalogue based on a relevance score  =  (ℛ, ℱ , ℐ;  ). To accomplish this, UIA relies on
a bi-encoder architecture. Specifically, it employs a request encoder Eℛ and an item encoder Eℐ to
embed a request ℛ (jointly with the task label ℱ ) and an item ℐ within a latent space. More in detail,
ℛ is encoded as R⃗ = Eℛ([CLS] ℛ [SEP] ℱ [SEP]), where ℱ is the label of the task associated to
the request, while [CLS] and [SEP] are the “class” and “separator” tokens, respectively. Similarly ℐ is
encoded as ⃗I = Eℐ ([CLS] ℐ [SEP]). The final representation is the embedding of the [CLS] token,
as typical in this context [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12, 13</xref>
          ]. Both Eℛ and Eℐ employ the BERT [14] model to encode their
input. Finally, the score of ℐ with respect to ℛ is computed as  = R⃗ · ⃗I. UIA is trained by minimizing
a cross-entropy loss function. Each training instance is a tuple (ℛ, ℱ , ℐ+, ℐ− ), where ℐ+ and ℐ−
represent a positive and a negative example respectively. The negative examples, not available in the
original dataset, are obtained with a two-phase negative sampling procedure. The first phase ( Phase 1)
samples the negatives among the items retrieved by BM25 [15] in response to each request. The second
phase (Phase 2) employs the model trained using the data of Phase 1 to embed the items in the space
and samples the negatives among the nearest neighbours of each item. The training procedure involves
also the usage of in-batch negatives and mini-batches.
        </p>
        <p>
          Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] consider also a second training pipeline to handle users’ data and personalize the
output. Such pipeline requires accessing user’s personal data (i.e., previous interactions with the system
and preferences). However, we focus exclusively on non-personalized data (i.e., the Amazon ESCI
Dataset), thus we describe only the non-personalized part of the pipeline.
        </p>
        <sec id="sec-2-1-1">
          <title>2Code available at:https://anonymous.4open.science/r/UIAReproRepliGen-5CEE</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The Amazon ESCI Dataset</title>
        <p>
          Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] trained and evaluated UIA on two datasets: the Lowe’s dataset and the Amazon ESCI
dataset [16]. The former is private and contains user data to enable personalization, the latter is public
but does not contain user data and thus does not allow training/testing the personalization module.
Due to its public availability and to the lack of public, joint IR and RS datasets, we focus exclusively
on the Amazon ESCI dataset. The Amazon ESCI dataset [16] was released in the context of the KDD
Cup 20223 Amazon ESCI challenge and it is a large, multilingual dataset of dificult Amazon search
queries and results. In line with [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we consider the product catalogue and the training data used for
the Task 2 of the challenge. The training data contains triplets (query, item, label) where the label is one
among: “Exact”, i.e., the item is an exact match for the query; “Substitute”, i.e., the item is related to the
query but not a match; “Complement”, i.e., the item is not relevant to the query but can complement
a relevant item; and “Irrelevant”. The ESCI dataset contains only textual queries, thus is unsuitable
for QBE and CIR, therefore, in line with [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we split the full dataset into three separate datasets, one
for each task. Specifically, we call  the set of all the requests (queries), and  (),  (), and  ()
the sets of items labelled “Exact”, “Substitute”, and “Complementary” for query , respectively. The
three task-specific datasets are defined as follows: (1) KS: {(, ) : ∀ ∈  ∧  ∈  ()}, (2) QBE:
{(1, 2) : ∀ ∈ ∧1 ∈  ()∧2 ∈  ()}, and (3) CIR: {(1, 2) : ∀ ∈ ∧1 ∈  ()∧2 ∈  ()}.
Following [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we further split each dataset into training (80%), validation (10%), and test (10%) sets.
The three datasets are used jointly during the training phase while, for evaluation, the performance is
measured separately on each test set.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Reproduction and Experimental Methodologies</title>
      <p>In this section, we detail the experiment to assess the reproducibility of UIA (RQ1), we then introduce
the analyses done to determine its replicability (RQ2) and conclude with the tests carried out to gauge
UIA generalizability (RQ3).</p>
      <sec id="sec-3-1">
        <title>3.1. RQ1: Reproducibility</title>
        <p>To reproduce UIA, we employed only publicly available datasets and the original code4. We operated
independently on whether the original developers were available to share with us their knowledge,
to ensure unbiased results and put ourselves in the most challenging reproducibility conditions. We
report here the challenges we identified in reproducing UIA and the solutions we used to address them.</p>
        <sec id="sec-3-1-1">
          <title>3KDD Cup 2022: https://amazonkddcup.github.io/ 4https://github.com/HansiZeng/UIA</title>
          <p>Second sample of the relevant items. While inspecting the available code base, we observed that,
after the dataset splitting (Section 2.2), a second sampling is executed. Specifically, for QBE and CIR, for
every unique query item (i.e., 1 in Section 2.2), 5 random relevant items are sampled (i.e., 5 instances
are added to the dataset). Similarly, for KS, each query is associated with only 10 relevant items. We
ascribe this diference between the original paper and the code to eficiency reasons. Moreover, this
second sampling prevents excessively large datasets and weighting too much queries which are too
popular or generic items. We maintain this implementation choice to ensure reproducibility.
Negative sampling procedure. The available code base reports some diferences from the procedure
described in Section 2.1 concerning the Phase 1 negative sampling for the QBE task. Indeed, for QBE
the negatives are randomly sampled among all the items. We modified the code to sample the negatives
from the items retrieved by BM25 also for QBE.</p>
          <p>Another diference concerns the KS dataset. In detail, when sampling the negative examples for the
KS task during Phase 1, for each pair request-item in the dataset, the negative is sampled from the
items similar (i.e., labelled “Substitute”) to the one considered as positive, if present, else from the items
complementary to the one considered as positive, if present, else the negative is sampled using the
request and BM25. We preserved this aspect of the code.</p>
          <p>
            Computational Resources. Due to limited computational resources — especially concerning GPU
memory — we reduced the batch size from 384 (used in [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]) to 48 (-86%) and we set the number of
epochs to 24 instead of 48. Other hyperparameters, such as learning rate (7− 6) and the number of
warmup iterations (4,000), were left unchanged compared to the original paper.
3.1.1. Phase 1 Only
The double-phase training is computationally expensive, doubling the training time and cost (including
the environmental impact). Thus, we evaluate UIA after a single training phase. While we reasonably
expect a decrease in terms of performance, we are interested in assessing whether this represents an
acceptable trade-of between efectiveness and eficiency. If this applies, UIA could be employed in
resource-constrained environments, with a limited cost and environmental impact.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. RQ2: Replicability</title>
        <p>
          Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] focus mostly on the Lowe’s private dataset and only some of the analyses are carried out
on the Amazon ESCI dataset. Thus, concerning replicability, we are interested in extending the analysis
of UIA to the Amazon ESCI dataset by replicating on it the experiments done by Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] only on
the Lowe’s dataset.
3.2.1. No Task Label (w/o ℱ )
We seek to examine the role of the task label ℱ in UIA to understand if the framework is able to
recognize that there are three diferent tasks or if it only learns from the huge amount of training data.
To do this we modify UIA by removing ℱ (this version of the framework is addressed as “w/o ℱ ”). This
allows to consider the training data related to the diferent tasks as belonging to a unique training set.
        </p>
        <p>In detail, given its relatively large training data, the interchangeability of its input and output (i.e.,
both items for QBE and CIR), and the similar nature of the tasks, we are interested in determining how
important ℱ is in correctly matching items to items. Furthermore, while KS uses queries as requests,
QBE and CIR use items. We are thus interested in verifying if this aspect is already suficient to diversify
the two classes of tasks. Removing the task label ℱ , in practical terms, corresponds to modifying
Eℛ into E′ℛ s.t. R⃗′ = E′ℛ([CLS] ℛ [SEP]). Thus, for the candidate item ℐ, the new score can be
computed as ′ = R⃗′ · ⃗I. In Figure 2 we highlight the portions that are removed (Figure 2a) and we
show the new architecture without the task label (Figure 2b).</p>
        <p>
          (a) Removing task label from UIA architecture.
(b) UIA architecture without task label.
3.2.2. Isolated Tasks
Training and evaluating the framework on the tasks in isolation is equivalent to optimize and evaluate 3
separate instances of UIA, each one for each task. Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] showed that when the Lowe’s dataset
is used, UIA benefits from the joint training. The employment of the Amazon ESCI dataset, though,
implies deep changes in the architecture of the framework (i.e., the personalization part is removed).
Thus, we want to understand if UIA still benefits from joint training when the Amazon ESCI dataset
is used and, therefore, when the personalization components are removed. To do this we train the
framework on the tasks in isolation. For eficiency reasons and in light of the results achieved by the
Phase 1 Only experiment (Section 3.1) we consider a single-phase training. Therefore, the results should
be compared with those obtained for the experiment Phase 1 Only.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. RQ3: Generalizability</title>
        <p>
          We describe here the experimental methodology we adopt to test the generalizability of UIA, i.e., its
resilience to major changes to its training procedure and to the training data.
3.3.1. Half QBE
Inspecting the generated datasets reveals that (after sampling, Section 3.1) the training set for QBE
(1.07M tuples) is more than twice the KS one (452k tuples) and six times larger than the CIR one (184k
tuples). While not explicitly mentioned in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], this characteristic was also observed by Zeng et al..
Indeed, in the provided repository, some portions of code use only half of the QBE dataset. These results
were not reported or explicitly mentioned in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. To assess the generalizability of the approach, we test
the hypothesis that reducing the amount (i.e., halving) of data used for the QBE task does not impact
severely on the final performance.
3.3.2. Early Split
The UIA task can be considered an example of “Knowledge Graph Completion”. The idea underlying
this task consists of predicting if, given a relation  and two entities ℎ and , the head entity ℎ is in
relation  with the tail entity . For UIA, the head entity ℛ is either a query or an item, the relation ℱ
is one among KS, QBE, or CIR, and the tail entity ℐ is an item. The procedure to split the collection
into training, validation, and test set adopted by Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], consists of randomly partitioning all
the possible triplets (ℛ, ℱ , ℐ) into the three sets. While this is frequent in the “Knowledge Graph
Completion” domain [17, 18, 19, 20, 21, 22], it also is criticized [23, 24]. In particular, Akrami et al.
[23], criticizes the so-called “Cartesian product relations”. These relations are such that given a set of
subjects and objects, the relation is valid for all the cartesian pairs between subjects and objects. If part
of these pairs ends in the training set and part ends in the test, this inflates the performance of the
knowledge graph completion algorithm.This occurs in the dataset used for UIA. Indeed, given a query
 of the Amazon ESCI dataset, its “Exact” items are related to all the corresponding “Substitute” and
“Complementary” items. We propose to modify this splitting procedure by dividing  (the set of the
queries) into training, validation, and test sets and generating the triples only afterwards, using the
procedure proposed by Zeng et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (described in Section 2.2). This ensures that all the information
regarding a certain query is contained within same partition. This also appears natural from a “temporal”
standpoint. Indeed, when a user issues a query it is possible to collect the training data up to that
instant and the system is unaware of the next user’s query (i.e., the test). Thus, employing information
associated to training queries to test the model would correspond to predicting the past. Given this new
version of the datasets, we retrain the model and test its performance. For eficiency reasons and in light
of the results achieved by the Phase 1 Only experiment (Section 3.1) we consider a single-phase training.
Therefore, the results should be compared with those obtained for the experiment Phase 1 Only.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>
        Here we discuss the outcomes of the reproducibility, replicability and generalizability experiments
introduced in Section 3. We evaluate our results considering MRR@10, nDCG@10 and Recall@50, in
line with [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Table 1 reports the original performance of UIA (first row) along with the one of our
experiments.
      </p>
      <sec id="sec-4-1">
        <title>4.1. RQ1: Reproducibility Results</title>
        <p>Recall</p>
        <p>MRR</p>
        <p>Recall</p>
        <p>MRR</p>
        <p>
          KS
nDCG
0.360
The reproducibility results are reported in the second row of Table 1. The obtained performance is
really close to the original one for KS (-0.041 MRR, -0.033 nDCG and -0.049 Recall) and CIR (-0.027 MRR,
-0.034 nDCG and -0.035 Recall). These results appear satisfactory, considering that we were forced to
reduce the batch size and epochs, due to limited computing capabilities. In this regard, UIA achieves
satisfactory performance even under stronger resource constraints. Diferently, for QBE our results are
significantly larger, from 24% to 88% (+0.191 MRR, +0.175 nDCG and +0.130 Recall), than those reported
in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. We attribute this phenomenon to the changes in the Phase 1 negative sampling (from random to
BM25, see in Section 3.1). We hypothesise that the results reported in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] represent a lower bound of
the actual performance UIA can achieve on the QBE task.
4.1.1. Phase 1 Only
Considering the model trained with a single phase (third line of Table 1), the performance drops. The
magnitude of the drop depends on the task. For KS it is minor (-0.014 MRR, -0.014 nDCG, -0.023 Recall),
suggesting that, for this task, the second training phase has a limited impact. For CIR the drop is larger
(-0.102 MRR, -0.099 nDCG, -0.073 Recall). QBE, instead, is the most harmed task (-0.148 MRR, -0.147
nDCG, -0142 Recall). This suggests the importance of the hard negatives and additional training time for
the two most RS oriented tasks. The drop in performance is not negligible but so are the computational
resources saved: from 240 hours of computation to 120.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. RQ2: Replicability Results</title>
        <p>
          4.2.1. No Task Label (w/o ℱ )
The results achieved by removing the task label are reported in the “ w/o ℱ ” row of Table 1. UIA behaves
consistently on the Amazon ESCI dataset w.r.t. the ablation study on the Lowe’s dataset reported in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
Again, the KS task appears as the most stable (-0.011 MRR, -0.016 nDCG, and +0.006 Recall). The most
impacted task, instead, is CIR with a loss of 37% in performance for MRR and nDCG (-0.180 MRR, -0.167
nDCG, and -0.112 Recall). This indicates that, when there is no distinction between the tasks, the model
is still able to operate on KS, while being less performing for QBE and CIR. A possible explanation
lies in the diferences in term distribution between queries and items, used as input for KS and QBE
and CIR tasks. Moreover, the diference in the training sets sizes of QBE and CIR may be the cause of
their diference in performance loss. Indeed, the QBE dataset is much larger than the CIR (580%), thus,
during the training phase, it is “less harmful” for the model to optimize for the QBE task: this reflects
on the test performance, where the QBE task is handled better.
        </p>
        <p>The “w/o ℱ (Phase1Only)” row of Table 1 reports the results achieved removing the task information
and training the framework only according to Phase 1. These must be compared with the one of the
“Phase1Only” experiment. By looking at the results we can conclude that, for this experiment, the
framework behaves in the same way also when performing one training phase.
4.2.2. Isolated Tasks
The results achieved for the three instances of UIA, each optimized on a single task, are grouped in the
row “IsolatedTasks” row of Table 1. The obtained performance shows that UIA performs better when
trained on single tasks than when jointly optimized, if the Amazon ESCI dataset is exploited. The KS
task appears as the most stable (+0.029 MRR, +0.027 nDCG, and +0.032 Recall compared to Phase1Only).
The QBE task has the second-biggest increase (+0.030 MRR, +0.030 nDCG, and +0.030 Recall). Finally,
CIR is the task that has the greatest advantage (+0.053 MRR, +0.059 nDCG, and +0.019 Recall).</p>
        <p>
          This behaviour does not align with the results on the Lowe’s dataset, discussed in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and in the other
studies about joint IR and RS [
          <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
          ]. Indeed, the Amazon ESCI dataset, diferently from the Lowe’s,
does not contain user data, leading to major diferences also in UIA’s architecture. In detail, when
Amazon ESCI is employed, the personalization components must be removed from the framework.
Thus, these outcomes reveal how the benefits derived from the joint training may depend on both the
dataset, the data processing pipelines and architecture of the framework employed.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. RQ3: Generalizabilty Results</title>
        <p>For some generalizability experiments, as perviously stated, we adopt a more ethical approach towards
IR research [25, 26, 27], training UIA with a single phase.
4.3.1. Half QBE
Halving the QBE training data highlights three patterns in the behaviour of UIA (“HalfQBE” row of
Table 1): i) For KS the performance increases (+0.019 MRR, +0.014 nDCG, and +0.020 Recall compared
to the reproduced UIA), suggesting that aligning the size of the datasets allows UIA to grasp more
knowledge from the KS instances. ii) For QBE the performance decreases (-0.126 MRR, -0.124 nDCG,
-0.112 Recall) due to the reduction of its training data. iii) For CIR the performance has minor changes
(-0.008 MRR, -0.007 nDCG, +0.005 Recall), highlighting that CIR training phase is not afected by the
training data used for the QBE. This behaviour stems from the semantics of the tasks. Indeed, the
learning of KS and QBE is strongly correlated since they both aim to retrieve items “similar” to the input
(query or item). Thus, the excessive amount of QBE data may overshadow KS. CIR, instead, expects as
output an item that is explicitly not similar, thus its training is likely disentangled from the other tasks.</p>
        <p>
          The “HalfQBE (Phase1Only)” row of Table 1 reports the results achieved using half of the data for QBE
and training UIA only according to Phase 1. These must be compared with the ones of the “Phase1Only”
experiment. By looking at the results we can notice that, for KS nothing changes, for QBE the gap in
performance grows, while for CIR the slight decrease turns to a small increase in performance. However,
the same considerations made when performing both training phases apply.
4.3.2. Early Split
When we split the Amazon ESCI queries into training and test set before constructing the datasets used
to train and test the framework, we observe negligible diferences in performance (row “Early Split” of
Table 1), compared to “Phase1Only”, for KS. This stems from the fact that the KS dataset is obtained
from Amazon ESCI by selecting the appropriate entries, without processing (diferently to QBE and
CIR), since Amazon ESCI is an IR dataset based on real world data. Thus, for KS, the independence
from the preprocessing pipeline leads to a more stable performance. Diferently, when the proposed
processing pipeline is used, the performance for QBE and CIR drastically drops, highlighting that there
exists scenarios in which UIA achieves unsatisfactory results. For example, if the model lacks prior
knowledge of an item and this is used to query the system, then UIA will be bound to fail. Moreover,
this provides valuable insights on how to test/train of this class of models. Indeed, it would be more
informative to report results when the train-test split occurs both at a tuple level (as done in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]) and at
a query level (as proposed here), to obtain the complementary information of what would happen if
the model was not able to learn from highly similar items – if not the item itself –, or from the item in
relation to diferent ones. Finally, this should encourage the IR and RS research community, inspired by
previous work on “knowledge graph completion” evaluation, to investigate, develop and adopt adequate
evaluation protocols that efectively address corner cases and the various sides of the task.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Works</title>
      <p>In this work we described the architecture of UIA and how the publicly available Amazon ESCI dataset
is processed and employed to optimize it. Moreover, our experiments allowed to show that it is possible
to reproduce the performance of UIA on the Amazon ESCI dataset. Diferently, UIA’s behaviour is not
fully replicable, primarily because, when the Amazon ESCI dataset is used, the framework does not
benefit from joint training. By generalizing the framework, we also discovered that the dataset used
and the way in which it is manipulated have non-negligible consequences on the performance of UIA.
Finally, our empirical results show that, when the Amazon ESCI dataset is used, the KS task appears to
be the most robust while the recommendation tasks are more vulnerable.</p>
      <p>Future work will focus on studying, employing and enhancing the benefits derived from the joint
training. This includes understanding how the novelties introduced with UIA can be reused to develop
innovative and robust systems. Moreover, we will continue our work towards the analysis and
generalization of this framework. For this purpose, we will try to find (or create) and exploit publicly available
datasets that include user data and can be adapted to fit in “joint retrieval and recommendation”.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work has received support from CAMEO, PRIN 2022 n. 2022ZLL7MW.</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
Computer Science, Springer, 2019, pp. 194–206. URL: https://doi.org/10.1007/978-3-030-32381-3_16.
doi:10.1007/978-3-030-32381-3\_16.
[13] H. Choi, J. Kim, S. Joe, Y. Gwon, Evaluation of BERT and ALBERT sentence embedding performance
on downstream NLP tasks, in: 25th International Conference on Pattern Recognition, ICPR
2020, Virtual Event / Milan, Italy, January 10-15, 2021, IEEE, 2020, pp. 5482–5487. URL: https:
//doi.org/10.1109/ICPR48806.2021.9412102. doi:10.1109/ICPR48806.2021.9412102.
[14] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. URL:
https://doi.org/10.18653/v1/n19-1423. doi:10.18653/V1/N19-1423.
[15] S. E. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Found.</p>
      <p>Trends Inf. Retr. 3 (2009) 333–389. URL: https://doi.org/10.1561/1500000019. doi:10.1561/1500
000019.
[16] C. K. Reddy, L. Màrquez, F. Valero, N. Rao, H. Zaragoza, S. Bandyopadhyay, A. Biswas, A. Xing,
K. Subbian, Shopping queries dataset: A large-scale ESCI benchmark for improving product search,
CoRR abs/2206.06588 (2022). URL: https://doi.org/10.48550/arXiv.2206.06588. doi:10.48550/ARX
IV.2206.06588. arXiv:2206.06588.
[17] A. Bordes, X. Glorot, J. Weston, Y. Bengio, A semantic matching energy function for learning with
multi-relational data - application to word-sense disambiguation, Mach. Learn. 94 (2014) 233–259.</p>
      <p>URL: https://doi.org/10.1007/s10994-013-5363-6. doi:10.1007/S10994-013-5363-6.
[18] A. Bordes, J. Weston, R. Collobert, Y. Bengio, Learning structured embeddings of knowledge bases,
in: W. Burgard, D. Roth (Eds.), Proceedings of the Twenty-Fifth AAAI Conference on Artificial
Intelligence, AAAI 2011, San Francisco, California, USA, August 7-11, 2011, AAAI Press, 2011, pp.
301–306. URL: https://doi.org/10.1609/aaai.v25i1.7917. doi:10.1609/AAAI.V25I1.7917.
[19] D. Ayala, A. Borrego, I. Hernández, C. R. Rivero, D. Ruiz, AYNEC: all you need for evaluating
completion techniques in knowledge graphs, in: P. Hitzler, M. Fernández, K. Janowicz, A. Zaveri,
A. J. G. Gray, V. López, A. Haller, K. Hammar (Eds.), The Semantic Web - 16th International
Conference, ESWC 2019, Portorož, Slovenia, June 2-6, 2019, Proceedings, volume 11503 of Lecture
Notes in Computer Science, Springer, 2019, pp. 397–411. URL: https://doi.org/10.1007/978-3-030-213
48-0_26. doi:10.1007/978-3-030-21348-0\_26.
[20] R. Socher, D. Chen, C. D. Manning, A. Y. Ng, Reasoning with neural tensor networks for knowledge
base completion, in: C. J. C. Burges, L. Bottou, Z. Ghahramani, K. Q. Weinberger (Eds.), Advances
in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information
Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada,
United States, 2013, pp. 926–934. URL: https://proceedings.neurips.cc/paper/2013/hash/b337e84de
8752b27eda3a12363109e80-Abstract.html.
[21] S. Mazumder, B. Liu, Context-aware path ranking for knowledge base completion, in: C. Sierra
(Ed.), Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence,
IJCAI 2017, Melbourne, Australia, August 19-25, 2017, ijcai.org, 2017, pp. 1195–1201. URL: https:
//doi.org/10.24963/ijcai.2017/166. doi:10.24963/IJCAI.2017/166.
[22] Z. Sun, S. Vashishth, S. Sanyal, P. P. Talukdar, Y. Yang, A re-evaluation of knowledge graph
completion methods, in: D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online,
July 5-10, 2020, Association for Computational Linguistics, 2020, pp. 5516–5522. URL: https:
//doi.org/10.18653/v1/2020.acl-main.489. doi:10.18653/V1/2020.ACL-MAIN.489.
[23] F. Akrami, M. S. Saeef, Q. Zhang, W. Hu, C. Li, Realistic re-evaluation of knowledge graph
completion methods: An experimental study, in: D. Maier, R. Pottinger, A. Doan, W. Tan, A. Alawini,
H. Q. Ngo (Eds.), Proceedings of the 2020 International Conference on Management of Data,
SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, ACM, 2020,
pp. 1995–2010. URL: https://doi.org/10.1145/3318464.3380599. doi:10.1145/3318464.3380599.
[24] M. Gardner, T. M. Mitchell, Eficient and expressive knowledge base completion using subgraph
feature extraction, in: L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, Y. Marton (Eds.), Proceedings
of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015,
Lisbon, Portugal, September 17-21, 2015, The Association for Computational Linguistics, 2015, pp.
1488–1498. URL: https://doi.org/10.18653/v1/d15-1173. doi:10.18653/V1/D15-1173.
[25] H. Scells, S. Zhuang, G. Zuccon, Reduce, reuse, recycle: Green information retrieval research, in:
E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, G. Kazai (Eds.), SIGIR ’22: The 45th
International ACM SIGIR Conference on Research and Development in Information Retrieval,
Madrid, Spain, July 11 - 15, 2022, ACM, 2022, pp. 2825–2837. URL: https://doi.org/10.1145/3477495.
3531766. doi:10.1145/3477495.3531766.
[26] G. Spillo, A. D. Filippo, C. Musto, M. Milano, G. Semeraro, Towards sustainability-aware
recommender systems: Analyzing the trade-of between algorithms performance and carbon footprint, in:
J. Zhang, L. Chen, S. Berkovsky, M. Zhang, T. D. Noia, J. Basilico, L. Pizzato, Y. Song (Eds.),
Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023, Singapore, Singapore,
September 18-22, 2023, ACM, 2023, pp. 856–862. URL: https://doi.org/10.1145/3604915.3608840.
doi:10.1145/3604915.3608840.
[27] G. Chowdhury, An agenda for green information retrieval research, Inf. Process. Manag. 48 (2012)
1067–1077. URL: https://doi.org/10.1016/j.ipm.2012.02.003. doi:10.1016/J.IPM.2012.02.003.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Merlo</surname>
          </string-name>
          , G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <article-title>A reproducibility study for joint information retrieval and recommendation in product search</article-title>
          ,
          <source>in: Advances in Information Retrieval: 47th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2025</year>
          , Lucca, Italy, April 6-
          <issue>10</issue>
          ,
          <year>2025</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>IV</given-names>
          </string-name>
          , SpringerVerlag, Berlin, Heidelberg,
          <year>2025</year>
          , p.
          <fpage>130</fpage>
          -
          <lpage>145</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -88717-8_
          <fpage>10</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -88717-8_
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Belkin</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Information filtering and information retrieval: Two sides of the same coin?</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>35</volume>
          (
          <year>1992</year>
          )
          <fpage>29</fpage>
          -
          <lpage>38</lpage>
          . URL: https://doi.org/10.1145/138859.138861. doi:
          <volume>10</volume>
          .1145/138859.138861.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Si</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Gai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>When search meets recommendation: Learning disentangled search representation for recommendation</article-title>
          , in: H.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Poblete (Eds.),
          <source>Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2023</year>
          , Taipei, Taiwan,
          <source>July 23-27</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>1313</fpage>
          -
          <lpage>1323</lpage>
          . URL: https://doi.org/10.1145/3539618.3591786. doi:
          <volume>10</volume>
          .1145/3539618.3591786.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Joint modeling and optimization of search and recommendation</article-title>
          , in: O.
          <string-name>
            <surname>Alonso</surname>
          </string-name>
          , G. Silvello (Eds.),
          <source>Proceedings of the First Biennial Conference on Design of Experimental Search &amp; Information Retrieval Systems</source>
          , Bertinoro, Italy,
          <source>August 28-31</source>
          ,
          <year>2018</year>
          , volume
          <volume>2167</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>41</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2167</volume>
          /paper2 .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Learning a joint search and recommendation model from user-item interactions</article-title>
          , in: J.
          <string-name>
            <surname>Caverlee</surname>
            ,
            <given-names>X. B.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lalmas</surname>
          </string-name>
          , W. Wang (Eds.),
          <source>WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining</source>
          , Houston, TX, USA, February 3-
          <issue>7</issue>
          ,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>717</fpage>
          -
          <lpage>725</lpage>
          . URL: https://doi.org/10.1145/3336191.3371818. doi:
          <volume>10</volume>
          .1145/3336 191.3371818.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Joint learning of e-commerce search and recommendation with a unified graph neural network</article-title>
          , in: K. S. Candan,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Akoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          , J. Tang (Eds.),
          <source>WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining</source>
          , Virtual Event / Tempe, AZ, USA, February
          <volume>21</volume>
          -
          <issue>25</issue>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>1461</fpage>
          -
          <lpage>1469</lpage>
          . URL: https://doi.org/10.1145/3488560.3498414. doi:
          <volume>10</volume>
          .1145/3488560.3498414.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kallumadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alibadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <article-title>A personalized dense retrieval framework for unified information access</article-title>
          , in: H.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Poblete (Eds.),
          <source>Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2023</year>
          , Taipei, Taiwan,
          <source>July 23-27</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>130</lpage>
          . URL: https://doi.org/10.1145/3539618.3591626. doi:
          <volume>10</volume>
          .1145/3539618.3591626.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <article-title>Reproducibility challenges in information retrieval evaluation</article-title>
          ,
          <source>ACM J. Data Inf. Qual</source>
          .
          <volume>8</volume>
          (
          <issue>2017</issue>
          )
          <article-title>8:1-8:4</article-title>
          . URL: https://doi.org/10.1145/3020206. doi:
          <volume>10</volume>
          .1145/3020206.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          ,
          <article-title>Some common mistakes in IR evaluation, and how they can be avoided</article-title>
          ,
          <source>SIGIR Forum 51</source>
          (
          <year>2017</year>
          )
          <fpage>32</fpage>
          -
          <lpage>41</lpage>
          . URL: https://doi.org/10.1145/3190580.3190586. doi:
          <volume>10</volume>
          .1145/3190580.3190586.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Dacrema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Boglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>A troubling analysis of reproducibility and progress in recommender systems research</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>39</volume>
          (
          <year>2021</year>
          )
          <volume>20</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          :
          <fpage>49</fpage>
          . URL: https://doi.org/10.1145/3434185. doi:
          <volume>10</volume>
          .1145/3434185.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S. H.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Yih,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2020</year>
          , Online,
          <source>November 16-20</source>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>6769</fpage>
          -
          <lpage>6781</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2020</year>
          .EMN LP-MAIN.
          <year>550</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>How to fine-tune BERT for text classification?</article-title>
          , in: M.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Chinese Computational Linguistics - 18th China National Conference, CCL</source>
          <year>2019</year>
          , Kunming, China,
          <source>October 18-20</source>
          ,
          <year>2019</year>
          , Proceedings, volume
          <volume>11856</volume>
          of Lecture Notes in
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>