<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Elsevier at SimpleText: Passage Retrieval by Fine-tuning GPL on Scientific Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Artemis Capari</string-name>
          <email>P@10</email>
          <email>R@10</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hosein Azarbonyad</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Tsatsaronis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zubair Afzal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elsevier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amsterdam</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>CLEF SimpleText Lab is centered around finding relevant passages from a large collection of scientific documents in response to a lay query, detecting and explaining dificult terminology within those passages, and finally simplifying the passages. The first task is similar to the ad-hoc retrieval task in which given a topic/query, the goal is to retrieve relevant passages, but in addition to the relevance, ranking models should assess documents based on their readability/complexity as well. This paper describes our approach towards building a ranking model to tackle the first task. To build the ranking model, we first evaluate performance of several models on a proprietary test collection constructed based on scientific documents across multiple science domains. Then, we fine-tune the best performing model on a large collection of unlabelled documents using the Generative Pseudo Labeling approach. The key contribution and findings of our approach is that a bi-encoder model, trained on the MS-Marco dataset, fine-tuned further on a large collection of unlabelled scientific passages achieves the highest performance on the proprietary dataset which is specifically designed for the scientific passage retrieval task. Finally, fine-tuning a model in the same fashion, but only using the Computer Science queries from the test collection has proven to be successful for SimpleText Task 1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Scientific Documents</kwd>
        <kwd>Domain Adaptation</kwd>
        <kwd>Scholarly Document Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Scientists and researchers employ specialized language and ideas to efectively communicate
information. Consequently, there exists a substantial, increasing volume of scientific concepts
and information within any given scientific field, which contributes to the challenges
scientists face in keeping pace with the expanding scope of technical concepts and novel content.
Understanding scientific documents is even more challenging for the public audience. It has
been shown that the readability of scientific documents is decreasing over time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This poses
challenges and opportunities towards both researchers and publishers to think about way to
increase the readability of complex scientific documents for public audience.
      </p>
      <p>
        SimpleText Lab [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is specifically focused around addressing these challenges. The aim of
this lab is to first find relevant passages to users’ queries, spot and explain dificult terminology
within relevant passages, and finally simplify the passage by re-writing it in a more readable
way. The very first task in the series of tasks associated with this lab, is a passage retrieval
task namely “What is in (out)”, where the goal is, given a query/topic, to retrieve all passages
relevant to the query/topic that can be used to create a simplified summary around the topic.
In addition to the relevance, ranking models should also consider the complexity of passages
when ranking them and prioritize less complex passages.
      </p>
      <p>
        The state-of-the-art ranking models are semantic matching models using either a
crossencoder or bi-encoder (or a combination of) architectures [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These models are trained on
publicly available datasets such as MS-Marco [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which do not contain scientific documents.
The retrieval task of the SimpleText lab itself and the underlying training/evaluation sets are
centred around scientific documents. Therefore, existing ranking models might not perform
very well in this setting as the language of scientific documents is usually more complex and
there might be specific scientific terminology within scientific documents that is specific for
such documents.
      </p>
      <p>In this paper we build our model on top of the existing state-of-the-art ranking models.
To address the domain diference challenge, we use a domain adaptation technique, namely
Generative Pseudo Labeling (GPL) to fine-tune the pre-trained models on a set of unlabelled
scientific documents. To evaluate ranking models and fine-tune them, we build a proprietary
test collection containing 5000 query document-pairs annotated by relevance labels. Our results
on this dataset, shows that a bi-encoder model fine-tuned on a large collection of scientific
unlabelled documents achieves a stronger performance than the zero-shot counterpart. We
use this model to re-rank documents ranked by the Elastic Search system. Our results show
that some of the fine-tuned models achieve a better performance than the zero-shot models
on the SimpleText dataset as well. In the remainder of the paper, we briefly review related
work in Section 2, we describe the technical details of the designed system in Section 3, we
empirically evaluate the models in Sections 4 and 5 and we conclude in Section 6 by arraying
some limitations of the current technical solution and provide pointers to future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Dense retrieval models are a type of information retrieval (IR) model that use fixed-length
dense vector representations to represent both queries and documents, allowing for eficient
and accurate retrieval of relevant information from a large corpus of text by computing the
similarity score between query- and document vectors. These models have been shown to
outperform traditional sparse retrieval models, such as BM25 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in a variety of tasks, including
open-domain question answering and document ranking.
      </p>
      <p>Two popular types of such dense retrieval models are bi-encoders and cross-encoders. Both
models still have the same objective, i.e. capturing the semantic meaning of queries and
documents into dense vector representations, but difer in the architecture of the neural network
used to learn their representations.</p>
      <p>
        Bi-encoders use two separate encoders to independently encode the query and the document
into dense vectors, which are then compared using a similarity function to produce a relevance
score. One of the most popular bi-encoders is the Dense Passage Retrieval (DPR) model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. DPR
uses a two-stage retrieval process, in which a large set of passages is first retrieved using sparse
techniques, which is used in turn to compute a dense vector representation of each passage
using a pre-trained language model such as BERT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The query is represented using a similar
dense vector representation as well. The passages are then ranked based on the cosine similarity
between the query and passage vectors.
      </p>
      <p>
        Cross-Encoders however, use a single encoder to encode the query and document into a joint
embedding space. Documents are then ranked based on the similarity score that is computed
between this joint embedding and the learned representation of the positive document. They
can capture more complex interactions between query and document. However, they are
computationally more expensive as it requires a unique embedding for each query-document
pair, while bi-encoders encode queries and documents separately and therefore it only requires
a single document corpus for all queries [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Therefore, they are often only used as re-rankers
[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref9">9, 10, 11, 12, 13, 14</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>To train and fine-tune our models, we first build a test collection using a set of scientific
documents. Then, we fine-tune existing ranking models using this dataset as well as a large
collection of scientific documents to make these model more suitable for retrieving scientific
passages.</p>
      <sec id="sec-3-1">
        <title>3.1. Test Collection</title>
        <p>
          To build a test collection, we select 100 queries spread across 20 diferent scientific domains 1.
We select the queries to be a known scientific concept on which we can collect credible and
relevant documents/passages. Once the queries are selected, we then use the well-known
pooling mechanism to retrieve candidate documents to be annotated per query. We select five
diferent models (two lexical matching, two bi-encoders, and one cross-encoder) as the models
to be used to build the pool. These models are selected based on their performance on a small
set or to ensure the diversity of models (and hence diversity of document within the pool). We
select 50 documents per query using the pooling approach. These documents are then labeled
by experts per domains as “relevant”, “partially relevant”, or “non-relevant”. We use this dataset
to evaluate the performance of diferent ranking models.
3.2. GPL
Generative Pseudo Labeling (GPL) is an unsupervised domain adaptation method first introduced
in [15]. The proposed framework leverages the structure of a pre-trained generative model to
generate pseudo labels for the target domain data, which are then used to train a retrieval model
in a supervised manner. GPL outperforms existing unsupervised domain adaptation methods on
several benchmark datasets and achieves state-of-the-art performance in unsupervised domain
adaptation of dense retrieval. Considering we intend to use- and experiment with dense-retrieval
models, and the importance of large amounts of data has often been highlighted in previous
1Genetics and Molecular Biology, Computer Science, Economics, Agricultural and Biological Sciences, Biochemistry,
Econometrics and Finance, Toxicology and Pharmaceutical Science, Chemical Engineering, Veterinary Science and
Veterinary Medicine, Chemistry, Materials Science, Earth and Planetary Sciences, Engineering, Food Science,
Immunology and Microbiology, Mathematics, Nursing and Health Professions, Medicine and Dentistry, Neuroscience,
Pharmacology, Psychology, Physics and Astronomy, Social Science
work on dense retrieval methods [
          <xref ref-type="bibr" rid="ref3 ref6 ref7">7, 6, 3</xref>
          ], our manually annotated dataset might not sufice
as it only consists of 5000 snippets from a set of a 100 queries. However, there are many more
snippets and possible queries that can be extracted from a large collection of unlabeled scientific
documents (research articles), which could be labeled by GPL on their relevance in order to
ifne-tune and adapt the existing ranking models to the scientific document retrieval task.
        </p>
        <p>We adapt GPL to our use-case, by first removing the query generation part. Instead, we
select a set of known scientific concepts per domain, and then per concept, we find all passages
mentioning the concept.</p>
        <p>Finding an exact mention of a scientific concept in a document can be a very good indicator
of relevance of the document to the concept. Then, per concept, each document mentioning it
is regarded as positive, and a bi-encoder is used to find negative document per query.</p>
        <p>
          The GPL framework uses a cross-encoder as a teacher model on the collected positive and
negative documents to fine-tune the underlying bi-encoder model, which is used to adapt
the bi-encoder model to our scientific document ranking setting. For our use-case, we have
ifne-tuned two diferent bi-encoders msmarco-distilbert-base-v4[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] (MS-DB-v4) and
msmarcodistilbert-base-tas-b[16] (MS-DB-tas-b) using our whole test collection, spanning 20 diferent
scientific domains, consisting of 5 queries each. We found that msmarco-distilbert-base-tas-b
was most suitable for tasks that require understanding of a wide range of domains.
        </p>
        <p>However, as the SimpleText task aims at finding references in Computer Science, we have
also fine-tuned the aforementioned models on queries and articles from just the Computer
Science and Mathematics domains. Naturally, these models were fine-tuned on far less data
(See Table 1).</p>
        <p>Each of the models were fitted on pseudo labels created with ms-marco-MiniLM-L-6-v2, using
the Adam Optimiser [17] with a learning rate of 2e− 5 and 1000 warm-up steps.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We have applied our models in several settings before selecting the final 10 submitted runs.
Diferent variations of the best performing models (on the proprietary test collection) were
selected to make the final submissions. As shown in Table 2, the rankings for runs
17 were retrieved by taking the top-k documents found for each of the 29 queries from
Simpletext_2023_task1_train.qrels by the Elastic Search API. These were then
reranked using our fine-tuned models. The rankings for the first 4 runs were obtained with
the model that was only fine-tuned on Computer Science and Mathematics data, while we
used the model fine-tuned on all Science Direct Domains for runs 5-7. For run 8, the top-500
documents were retrieved by searching for “query, topic”, and then re-ranked using our CS
ifne-tuned model, again using “query, topic” as the query input. For run 9, we used the model
that performed best on our own test collection to search the entire corpus for each query, rather
than pre-filtering with Elastic Search. Finally, we used our best CS-trained model once again,
but searched per topic instead of per query.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>
        We have selected our runs based on our own evaluation, which uses the qrels provided to us.
However, to our knowledge, these qrels are biased towards passages retrieved by ElasticSearch,
which is a lexical search method. Naturally, the recall for our semantic search models may
therefore be limited. As the test qrels that have been used for the oficial evaluation are based
on pooling the submissions of 2023 participants [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], these qrels include passages from various
types of neural rankers as well as lexical matching models. Hence the results from our own
evaluation difer from the oficial results. Nonetheless, they are included as they still provide
insight on our training process and our decisions behind selecting certain runs.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Selecting Best Runs</title>
        <p>In this section, we describe the results of fine-tuning diferent ranking models on a large
collection of unlabeled documents using the GPL model.</p>
        <p>While ms-marco-distilbert-base-tas-b proved most suitable for fine-tuning on our use-case,
Table 3 shows that it underperforms its zero-shot equivalent on the train set. A possible
explanation could be the pooling bias or the shallow depth of the training set. To be able to
explain this result and make solid conclusions based on these results, we need to evaluate the
performance of these models on an unseen test set. On the other hand, the fine-tuned
ms-marcodistilbert-base-v4 model outperforms the zero-shot version which shows the efectiveness of
ifne-tuning on the performance of this model.</p>
        <p>Furthermore, Figures 4, 2, and 3 show the performance of the GPL-based fine-tuned model
at diferent training steps for diferent configurations. As can be seen, the distilbert-base-v4
ifne-tuned on CS data and evaluated based on top-100 ES documents achieves significant
improvements with more training steps in the early stages of the training, but the model
converges after 2 training steps.</p>
        <p>The converged model has a significantly higher performance than the zero-shot version in
terms of most evaluation metrics. While the distilbert-base-v4 gets improved by more training
steps, the same behavior is not observed for the distilbert-base-tas-b model. In fact, this model’s
performance steadily drops by more training steps. A more detailed analysis on a larger test
collection (with more queries and deeper depth) is required to explain this behavior of the
model.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Submitted Runs</title>
        <p>1
2
3
4
5
6
7
8
9
10</p>
        <p>Table 5 shows the performance of the submitted runs on the training set using topics.
Performance of the models based on topics is similar to their query-based performance. However, the
model used to re-rank top 5000 documents of the ES system achieves the higher performance in
topic-based evaluation.
5.3. Oficial Results
As per Table 6, where the results are sorted on the primary measure, nDCG@10, we see that
our submitted runs (e.g. Elsevier) dominate the top of the scoreboard.</p>
        <p>In particular, the highest performing result, run 8, was obtained by re-ranking top-500
passages retrieved by ElasticSearch when searching for “query, topic” with
MS-DB-v4-GPLCS, again searching with query, topic. The selection of configurations(see Table 2) for our
submissions were based on our own evaluation on the set of qrels provided to us, which indicated
that searching only for the query with MS-DB-v4-GPL-CS outperformed our best model for
the KAPR task: MS-DB-tas-b-GPL-all. However, this set might not have been representative of
SimpleText’s oficial evaluation set as most of the other high-ranking results, were obtained
with MS-DB-tas-b-GPL-all. For instance, run 7 can directly be compared with run 3 as they use
the same type of query input and the same type of corpus (i.e. top-1000 ElasticSearch results).
This also applies for run 5 versus run 2 and run 6 versus run 1. In each of these settings, the
tas-b model fine-tuned on our entire benchmark set outperformed the v4 model fine-tuned on
only the Computer Science portion of our test collection.</p>
        <p>This indicates that even for the SimpleText task, MS-DB-tas-b-GPL-all performs better than
MS-DB-v4-GPL-CS, and that the success of run 8 could thus be partly attributed to the fact that
it was the only run that used “query, topic” as its query input. Using MS-DB-tas-b-GPL-all with
“query, topic” might thus have outperformed our winning run. Nonetheless, these results show
that the model fine-tuned for our specific scientific passage retrieval task still generalizes well
to other datasets.</p>
        <p>BPREF</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>MRR
In this paper, we designed several ranking models to address the document retrieval task of
the SimpleText lab. To this end, we first built a test collection containing 5000 query-document
pairs annotated by relevance labels. The documents in this test collection are extracted from
scientific documents which makes it suitable to evaluate performance of ranking models on
the scientific document retrieval task. We, then evaluated the performance of existing ranking
models on this test collection and selected a few models based on their performance to build our
ranking models (used to create our SimpleText submissions). Since these models are trained on
generic datasets created for the ad-hoc document retrieval task, they might not have a strong
performance on the specific task of scientific document retrieval. To address this issue, we used
a domain adaptation technique, namely Generative Pseudo Labeling (GPL) to fine-tune the
selected ranking models to the scientific document retrieval task by means of a large collection
of unlabeled scientific documents. Our results on the SimpleText training dataset shows the
efectiveness of fine-tuning on the performance of our best ranking model. The
distilbert-base-v4
model fine-tuned using GPL on a large collection of documents in Computer Science domain
which is used to re-rank top-500 documents retrieved by a Elastic Search system using “topic,
query” as the query input has the highest performance compared to the other fine-tuned models.
Using the relevance labels from Computer Science-related domains to fine-tune state-of-the-art
ranking models proved successful. However, as only a small portion of our test collection
consisted of Computer Science queries, future work could explore labeling a larger set of queries
in Computer Science-related domains to fine-tune a model in the same fashion.
[14] C. Li, A. Yates, S. MacAvaney, B. He, Y. Sun, Parade: Passage representation aggregation
for document reranking, arXiv preprint arXiv:2008.09093 (2020).
[15] K. Wang, N. Thakur, N. Reimers, I. Gurevych, Gpl: Generative pseudo labeling for
unsupervised domain adaptation of dense retrieval, arXiv preprint arXiv:2112.07577
(2021).
[16] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Eficiently teaching an efective dense
retriever with balanced topic aware sampling, in: Proceedings of the 44th International
ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp.
113–122.
[17] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Plavén-Sigray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Matheson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Schifler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <article-title>The readability of scientific texts is decreasing over time</article-title>
          ,
          <source>Elife</source>
          <volume>6</volume>
          (
          <year>2017</year>
          )
          <article-title>e27725</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , E. SanJuan,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Augereau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azarbonyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          , Overview of simpletext - clef
          <article-title>-2023 track on automatic simplification of scientific texts</article-title>
          ., in: Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Anastasia Giachanou,
          <string-name>
            <given-names>Dan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          , Michalis Vlachos, Guglielmo Faggioli, Nicola Ferro (Eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models</article-title>
          ,
          <source>arXiv preprint arXiv:2104.08663</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , L. Deng,
          <article-title>Ms marco: A human generated machine reading comprehension dataset</article-title>
          , choice
          <volume>2640</volume>
          (
          <year>2016</year>
          )
          <fpage>660</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ounis</surname>
          </string-name>
          ,
          <article-title>Improving zero-shot retrieval using dense external expansion</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>59</volume>
          (
          <year>2022</year>
          )
          <fpage>103026</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oğuz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W.-t. Yih,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>04906</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Passage re-ranking with bert</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>04085</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Document expansion by query prediction</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08375</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Document ranking with a pretrained sequence-to-sequence model</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>06713</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          , Cedr:
          <article-title>Contextualized embeddings for document ranking</article-title>
          ,
          <source>in: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1101</fpage>
          -
          <lpage>1104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frieder</surname>
          </string-name>
          ,
          <article-title>Eficient document re-ranking for transformers by precomputing term representations</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>