<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BioASQ 11B: Integrating Domain Specific Vocabulary to BERT-based Model for Biomedical Document Ranking.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maël Lesavourey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilles Hubert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRIT lab</institution>
          ,
          <addr-line>118 Route de Narbonne, F-31062 TOULOUSE CEDEX 9</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>In this paper we present the results obtained during BioASQ Task 11B Phase A on document ranking. We proposed a two-stage retrieval approach composed of a retriever and a reranker. The former is based on BM25 and developed with Pyserini. The latter is an implementation of a BERT cross-encoder named CEDR. It tackles the issue of input length limitation and takes advantage of the word embeddings in output of the model to compute a relevance score. We investigate a strategy to integrate biomedical thesaurus (MeSH) knowledge into this model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;biomedical document ranking</kwd>
        <kwd>information retrieval</kwd>
        <kwd>thesaurus-based knowledge</kwd>
        <kwd>BERT cross-encoder</kwd>
        <kwd>multi-stage retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        compounds and biomedical concepts supported by literature” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The links are built using
articles metadata and combining a thesaurus and diferent ontologies of the biomedical domain.
In addition, it is possible to retrieve the list of articles that support each link.
      </p>
      <p>We intend to extend FORUM by taking advantage of the textual content of the articles and
provide suggestion on prior articles for each pair chemical compound – biomedical concept.
The whole system will be a multi-stage retrieval approach where the knowledge graph plays
the role of a first-stage retriever. The main diference with BioASQ TaskB is that the first-stage
retrieval is already completed and the articles to prioritize are necessarily within the retrieved
list. Despite this detail, we can still benefit from BioASQ benchmarks to evaluate document
ranking approaches.</p>
      <p>
        In this paper we present the system used for Task 11B Phase A on document ranking based
on a retrieve and rerank strategy. We applied a first stage retrieval based on BM25 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which
is a straightforward yet efective approach. In order to rerank the documents we used
CEDRKNRM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a BERT-based model for full-length documents that combines the [CLS] token
representation and the contextual embeddings of the query and document terms. We propose
to integrate thesaurus knowledge to this model via a marking strategy.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Method</title>
      <p>
        2.1. System description
Our method to address this task is composed of two main modules: a retriever and a reranker.
This is a common approach that consists in decomposing document ranking into several
parts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The retriever aims at creating a smaller candidate list (hundreds of documents)
from the whole corpus (generally more than ten million). It is usually based on bag of words
(BoW) representations [7] which are less eficient than deep contextualized representations
but drastically reduce computational cost. The candidate list is then reranked using a
highperformance model. State-of-the-art models to achieve this task are transformer-based methods
like BERT [8], especially when using a cross-encoder architecture where the query and candidate
document are passed together as input. Computing a relevance score is usually done by adding
a single linear layer on top of the model output.
      </p>
      <p>As mentioned in the introduction our goal is to develop a robust and efective reranker. For
our retriever, we decided to take advantage of a well-known system rather than proposing a
new approach. We used Pyserini [9], a Python library for reproducing information retrieval
research. We created our own indices with PubMed articles and used BM25 implementation to
retrieve the top 500 articles most likely to respond to the query.</p>
      <p>Classical BERT approaches for text ranking take query and candidate document as input with
the following sequence: [CLS] query [SEP] document [SEP]. The system is trained to predict
if a document is relevant or not by using the [CLS] token embedding. Documents are ranked
using their probability of being relevant to a certain query. We identify several challenges of
common approaches regarding Task 11B Phase A. BERT produces contextual embeddings for
each word given in input along with the [CLS] token representation. Taking advantage of these
embeddings to compute the probability of a document being pertinent could extend the classical
approach. Another limitation is the maximum number of tokens that BERT can handle at once,
i.e., 512. This limit is easily exceeded with scientific publications even if we consider only their
title and abstract. Finally, we would like to investigate if the BERT-based model can benefit
from the knowledge of biomedical structured vocabularies. We illustrate the pipeline used for
our method in Figure 1.</p>
      <p>
        We decided to base our method on the Contextualized Embeddings for Document Ranking
(CEDR) model as it addresses two challenges: input length limitation and the use of term
embeddings. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed to keep the query as a whole and to divide too long documents in
chunks of the same length, then apply a cross-encoder on each query-chunk pairs. The final
[CLS] representation is obtained by computing an average pooling on the [CLS] vector of each
of these pairs. Moreover, they created a similarity matrix between the query and document by
concatenating the outputs of the cross-encoders. They passed the similarity matrix along with
the classification token as input of 3 neural ranking models. Best performance were obtained
with K-NRM [10], a kernel-based ranking model that takes advantage of word embeddings and
soft match features to produce a ranking score.
      </p>
      <p>We implemented a version of the CEDR model using K-NRM algorithm to create a baseline
to evaluate our work. We propose to extend this method by taking advantage of the Medical
Subject Headings3 (MeSH) thesaurus, a controlled vocabulary produced by the National Library
of Medicine4 (NLM) to index every citation appearing in MEDLINE/PubMed. Some works [11]
highlighted the significant efect of input sequence organisation in the performance of
BERTbased systems. [12] proposed to exploit exact term-matches between the query and the document
by marking the input of a BERT cross-encoder. We propose to integrate MeSH thesaurus
knowledge the same way. Given a query and a document we implemented a marking strategy
of biomedical terms assuming that if a word is referenced in the MeSH thesaurus it will carry a
more important information. Difering from [ 12], we limited the marking to the main headings
and entry terms referenced in MeSH. We used the same tag for a main heading and its entry
terms in order to highlight not only the exact matches but also synonyms. An example of the</p>
      <sec id="sec-2-1">
        <title>3https://www.nlm.nih.gov/mesh/meshhome.html 4https://www.nlm.nih.gov/</title>
        <p>marking strategy is given in Figure 2.
2.2. Data
We worked with the PubMed Annual Baseline5 for 2023 for which we removed all citations
that do not contain an abstract. We also worked with the dataset provided by BioASQ6 which
contains all the questions of the previous editions along with their gold standards (relevant
articles).</p>
        <p>We built a training set composed of relevant and irrelevant articles for each query. The
relevant articles are the ones given by BioASQ. Irrelevant articles were chosen in the top 100
articles retrieved by BM25 for each query. We selected two times more irrelevant articles than
relevant ones. Indeed [13] has shown that randomly choosing negative samples leads to worst
results than selecting hard negative ones. So we selected negative samples close to the positive
ones in terms of BoW representations.</p>
        <p>In order to mark the queries and candidate documents we downloaded the latest release of
MeSH thesaurus and we kept their main headings and entry terms.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>Task 11B Phase A was organized on 4 diferent batches. Each participating system had to return
a ranked list of at most 10 relevant articles and/or text snippets. The evaluation metric used
for the oficial scores is the Mean Average Precision (MAP) due to its capability to take into
account the order of the submitted items.</p>
      <p>
        In Table 1 we present the results of our runs for batches 2, 3, and 4 of the Task 11B Phase A
on document ranking. We submitted predictions with two systems: CEDR and Mark-CEDR.
The first is an implementation of the CEDR-KNRM model proposed by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The latter is the
same model trained with marked queries and documents as described in section 2.1.
      </p>
      <sec id="sec-3-1">
        <title>5https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ 6http://participants-area.bioasq.org/datasets/</title>
        <p>The best result for our proposed method was obtained during batch 2 of the challenge where
we achieved a 0.2405 MAP score. During batch 3 and batch 4 Mark-CEDR performed the same
way while the base model (CEDR) gained more than 0.05 points.</p>
        <p>In order to understand these results we led further investigations regarding question type
(Table 2) and query length (Table 3). From Table 2 we learn that both models always perform
better when treating yes/no questions. There is a huge drop (-0.092) between batch 3 and 4 for
list questions treated by Mark-CEDR while the base model maintains its performance. That
partially explains why CEDR performed better during the last batch.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion/Discussion</title>
      <p>The first version of our approach that consists in integrating external structured knowledge
provides average performance results on all batches and, unfortunately, inconsistent ones. The
scores obtained in batch 3 were promising because the mark version was performing better
by 0.03 points. However it was quite the opposite during the last batch which reveals that
the performance of this first implementation is unstable. This observation is in line with the
questioning of [11] about the efectiveness of exact matching against strong baselines.</p>
      <p>However we identified several points that can be addressed to improve our model. The first
one would be to use a stronger retriever. We observed that for many questions some of the
expected articles were not in the top 500 retrieved. In that cases the reranker cannot perform
properly. In order to improve the reranker we could explore new types of marking strategies to
investigate the instability issue. Moreover we could take MeSH structure into account. Indeed
we only used MeSH as a controlled vocabulary but we did not take advantage of its hierarchical
tree structure. Finally, it would be wise to exploit a BERT model pre-trained on a biomedical
corpus like BioLinkBERT [14] or PubMedBERT [15] which achieved state-of-the-art results on
the BLURB benchmark7 [16].
[7] J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, X. Cheng, Semantic models for the first-stage
retrieval: A comprehensive review, ACM Trans. Inf. Syst. 40 (2022). URL: https://doi.org/
10.1145/3486250. doi:10.1145/3486250.
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[9] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A python toolkit
for reproducible information retrieval research with sparse and dense representations,
in: Proceedings of the 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’21, Association for Computing Machinery,
New York, NY, USA, 2021, p. 2356–2362. URL: https://doi.org/10.1145/3404835.3463238.
doi:10.1145/3404835.3463238.
[10] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural ad-hoc ranking with kernel
pooling, in: Proceedings of the 40th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’17, Association for Computing Machinery,
New York, NY, USA, 2017, p. 55–64. URL: https://doi.org/10.1145/3077136.3080809. doi:10.
1145/3077136.3080809.
[11] J. Lin, R. F. Nogueira, A. Yates, Pretrained transformers for text ranking: BERT and beyond,</p>
      <p>CoRR abs/2010.06467 (2020). URL: https://arxiv.org/abs/2010.06467. arXiv:2010.06467.
[12] L. Boualili, J. G. Moreno, M. Boughanem, Markedbert: Integrating traditional ir cues in
pre-trained language models for passage retrieval, in: Proceedings of the 43rd International
ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR
’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 1977–1980. URL:
https://doi.org/10.1145/3397271.3401194. doi:10.1145/3397271.3401194.
[13] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Optimizing dense retrieval model training
with hard negatives, in: Proceedings of the 44th International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2021, pp. 1503–1512.
[14] M. Yasunaga, J. Leskovec, P. Liang, Linkbert: Pretraining language models with document
links, 2022. arXiv:2203.15827.
[15] R. Tinn, H. Cheng, Y. Gu, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon,
Finetuning large neural language models for biomedical natural language processing, CoRR
abs/2112.07869 (2021). URL: https://arxiv.org/abs/2112.07869. arXiv:2112.07869.
[16] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon,
Domain-specific language model pretraining for biomedical natural language processing,
CoRR abs/2007.15779 (2020). URL: https://arxiv.org/abs/2007.15779. arXiv:2007.15779.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Paliouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farre-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          , Bioasq at clef2023:
          <article-title>The eleventh edition of the large-scale biomedical semantic indexing and question answering challenge</article-title>
          ,
          <source>in: Advances in Information Retrieval: 45th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2023</year>
          , Dublin, Ireland, April 2-
          <issue>6</issue>
          ,
          <year>2023</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2023</year>
          , p.
          <fpage>577</fpage>
          -
          <lpage>584</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>66</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>66</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          , G. Paliouras,
          <string-name>
            <surname>BioASQ-QA</surname>
          </string-name>
          :
          <article-title>A manually curated corpus for Biomedical Question Answering</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>170</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Delmas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Filangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paulhe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Duperier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Garrier</surname>
          </string-name>
          , P.-E. Saunier,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pitarch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jourdan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giacomoni</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Frainay, FORUM: building a Knowledge Graph from public databases and scientific literature to extract associations between chemicals and diseases</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>37</volume>
          (
          <year>2021</year>
          )
          <fpage>3896</fpage>
          -
          <lpage>3904</lpage>
          . URL: https://doi.org/10.1093/bioinformatics/btab627. doi:
          <volume>10</volume>
          .1093/bioinformatics/btab627.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , et al.,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          , Cedr:
          <article-title>Contextualized embeddings for document ranking</article-title>
          ,
          <source>in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR'19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>1101</fpage>
          -
          <lpage>1104</lpage>
          . URL: https://doi.org/10.1145/3331184. 3331317. doi:
          <volume>10</volume>
          .1145/3331184.3331317.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage document ranking with BERT</article-title>
          , CoRR abs/
          <year>1910</year>
          .14424 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1910</year>
          .14424. arXiv:
          <year>1910</year>
          .14424.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>