Enhancing Biomedical Document Ranking with Domain Knowledge Incorporation in a Multi-Stage Retrieval Approach. Notebook for the BioASQ Lab at CLEF 2024 Maël Lesavourey1 , Gilles Hubert1 1 IRIT lab, 118 Route de Narbonne, F-31062 TOULOUSE CEDEX 9 Abstract This article presents the results we obtained during BioASQ Task 12B Phase A on document ranking. Our strategy is based on a two-stage retrieval approach composed of a retriever and a reranker. The retriever is based on BM25 scoring and RM3 query expansion. The ranker is a BERT cross-encoder pre-trained on a biomedical corpus (BioLinkBERT). We study the impact of incorporating domain knowledge (MeSH) into this Pretrained Language Model and build a voting system to combine the insights from multiple models. Independently from the challenge, we also investigate a way to reduce the number of input tokens to bypass BERT limitation of 512 tokens for the input sequence. Keywords biomedical document ranking, information retrieval, thesaurus-based knowledge, BERT cross-encoder, multi- stage retrieval 1. Introduction Passage and document rankings are very important tasks in information retrieval (IR) systems as they facilitate users’ navigation through different sources. With the recent breakthrough of generative artificial intelligence (AI), they are also used in Retrieval Augmented Generation (RAG) [1] workflow to help generative systems produce more precise answers to a given query. In the academic field, the rise of open science and online information access multiply the amount of knowledge available. The drawback is that it is becoming harder to find a specific and precise piece of information. For this reason the BioASQ1 initiative [2] proposes an annual evaluation campaign to solve several tasks of biomedical IR. Specifically, TaskB-PhaseA [3] of the challenge focuses on document retrieval and text snippets extraction. In recent years, pre-trained language models (PLMs) [4, 5] have achieved state-of-the-art results on various natural language processing (NLP) tasks due to their ability to learn the semantic of various texts. However, the performance of such models tends to drop when they are applied to corpora from specific domains like biomedicine. Indeed, biomedical literature contains special features that exacerbate the semantic gap between general information and biomedical knowledge. One may cite as examples the polysemy of biomedical terms (tumor, neoplasm, cancer referring to the same concept) and the complex lexical structures (abbreviations, formulas, proper names, etc). One way to solve this problem is to use language models (LMs) pre-trained on biomedical corpora [6, 7, 8] but several works show that it is not enough to capture semantic relationships between terms in a document. Following our work during last year’s evaluation campaign [9], we investigated the impact of incorporating biomedical knowledge into PLMs trained on domain specific texts. More precisely, we propose to modify the input sequence of a BERT-based model [4] by tagging its biomedical terms in CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ mael.lesavourey@irit.fr (M. Lesavourey); gilles.hubert@irit.fr (G. Hubert) € https://www.irit.fr/~Gilles.Hubert/ (G. Hubert) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 http://www.bioasq.org/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings order to direct the attention mechanism towards them. We combine this approach with a voting system to take advantage of several models outputs at the same time. 2. Method 2.1. Task Description We participated in the TaskB-PhaseA on document retrieval. The campaign was divided in 4 batches of 85 queries each. Participating systems were expected to return, for each query, a ranked list of at most 10 relevant articles. These articles needed to be retrieved from PubMed Annual Baseline for 2024 (37 million citations). The evaluation metric used to build the leaderboard is the Mean Average Precision (MAP) due to its capability to take into account the order of the submitted items while being less restrictive than the traditional Precision score. 2.2. Systems Description The strategy we used to answer this task is a multi-stage retrieval approach [10] which aims at dividing the document ranking into different phases. We built a two-stage pipeline with a retriever and a reranker. Traditionally, a retriever is based on bag of words (BoW) representations [11] and aims at creating a candidate list of hundreds of documents from the whole corpus (several millions). It is often less effective but also have lower computational costs. On another hand, the goal of the reranker is to build the final list of a dozens of documents from the list generated by the retriever. State-of-the-art models for this task are PLMs based on transformers [12] architecture which greatly enhance results but also increase the computational cost. Finally, we explore a way of incorporating biomedical knowledge into PLMs using the Medical Subjects Headings2 (MeSH) thesaurus. This workflow is illustrated in Figure 1. Figure 1: Pipeline of our retrieve then rerank approach. In order to generate the candidate lists, we created our own index with PubMed articles and queried it using BM25 and RM3 [13]. BM25 is a term weighting model for evaluating document relevance, while RM3 uses pseudo-relevance feedback to improve search performance by incorporating additional information from relevant documents (query expansion). We used Pyserini [14], a Python toolkit for reproducing IR tasks, to create the index and the retriever. Our base model for the reranking part is a BERT cross-encoder pre-trained on biomedical publications. Following the results of the BLURB3 [15] leaderboard and due to computational limitations, we worked with the base version of BioLinkBERT [7]. We built a cross-encoder architecture trained with a pairwise approach. It takes as input the query and candidate document with the following sequence: 2 https://www.nlm.nih.gov/mesh/meshhome.html 3 https://microsoft.github.io/BLURB/ [𝐶𝐿𝑆]𝑞1 , ..., 𝑞𝑛 [𝑆𝐸𝑃 ]𝑑1 , ..., 𝑑𝑚 [𝑆𝐸𝑃 ]. (𝑞𝑖 )𝑖∈[1,𝑛] and (𝑑𝑖 )𝑖∈[1,𝑚] being respectively the terms of the query and the document. The fine-tuning is done by computing a relevance score between the query and 4 documents (2 positive and 2 negative ones). The objective is to predict if each one is relevant or not using the [𝐶𝐿𝑆] token embedding, which is the pooled output of the model and represents the whole input. This embedding is passed through a single linear layer that produces the classification scores (logits). Applying the 𝑆𝑜𝑓 𝑡𝑀 𝑎𝑥 function on these scores allows to create a probability distribution on our classes (relevant and non-relevant). During inference, documents from a candidate list are ranked with their probability of being relevant for a given query. To enhance the learning of biomedical entities and their semantic relations in the text, we propose to incorporate biomedical knowledge into the cross-encoder by modifying its input sequence. Our intuition is that a term referenced in a knowledge base will bring a greater piece of information than other words in the text. We extracted this knowledge from the MeSH thesaurus which is a controlled vocabulary maintained by the National Library of Medicine4 (NLM) and used to index every citation in MEDLINE/PubMed. To incorporate MeSH terms into our PLM, we propose to tag the input sequence with a unique special token “ # ”. We built a vocabulary with all Main headings, Qualifiers and Suplementary Records and their corresponding Entry Terms of the MeSH thesaurus. We detected those terms in the query with an exact match between 1,2,3-grams in the query and the vocabulary. Then, we added the special token before and after each detected term in the query and their corresponding exact-matches and synonyms in the document. The idea is to guide the attention mechanism towards biomedical terms using this soft-matching of biomedical items. An example of the input sequence tagging is given in Figure 2. Figure 2: Example of the marking strategy on a query and the first sentences of one of its most relevant documents. Finally, we learnt from last year experiments that systems’ performances are inconsistent depending on the batches. To level them off, we built a voting system to mix the results of all our models along the batches. For a given query, we assigned a score to each of the 10 articles returned by each model. The voting system returns a new list of 10 items sorted along the scores obtained. 2.3. Additional Study One of the main limitation of BERT-based model is the length of the input sequence [16]. Indeed such systems only take as input up to 512 tokens which represents even less words. This limit is easily exceeded when computing scientific publications even when considering only their title and abstract. In our systems described in section 2.2, we chose to simply truncate the input sequence and kept the first tokens until we reach the limitation. This method still provides competitive results in various tasks but lead to a loss of information as it deliberately ignores a part of the text. To overcome this problem, we fine-tuned another BioLinkBERT cross-encoder to compute the similarity between a sentence and a 4 https://www.nlm.nih.gov/ query. This allows us to rank the sentences of a document by order of relevance to a given query. Thus, we reduced the input length by selecting the most relevant sentences without exceeding 512 tokens. 3. Experimental Settings 3.1. Data To build our index we used the PubMed Annual Baseline5 for 2024 from which we removed all articles without available abstracts. We concatenated the title and abstract of each citation to obtain a document. We also used the datasets released by the BioASQ6 team [17] which contain all queries and their gold standards (relevant documents) from past editions. We created a training set for the BERT cross-encoder. For each query we retrieved 4000 articles using BM25+RM3. The first 20 articles that were not in the gold standard were chosen as hard-negative samples while all the others were chosen as positive ones. We selected negative samples close to the query in terms of BoW embeddings as it enhances inference of BERT-based model[18]. In our last year experiments, we only extracted Main Headings and their corresponding Entry Terms from MeSH thesaurus. This year we also added Qualifiers and Supplementary Records in order to detect more biomedical and chemical concepts in the text. We used the 2024 release of MeSH7 to apply the tagging strategy during both training and inference. For inference, we generated a candidate list of 500 articles to be reranked by 4 PLMs as described in section 3.2. 3.2. Systems Settings We used the same retriever, with the same parameters for every submission (𝑘1 = 0.6, 𝑏 = 0.6, 𝑓 𝑏𝑡 𝑒𝑟𝑚𝑠 = 16, 𝑓 𝑏𝑑 𝑜𝑐𝑠 = 2, 𝑜𝑞 𝑤 = 0.9) [14]. In Table 1, we wrote down the settings of the 4 PLMs we trained and their corresponding submission name. BioASQ10 and BioASQ11 are 2 datasets provided by BioASQ teams containing all queries and their gold standards from the first year to respectively the 10th and 11th challenges. Note that for each batch, we selected two positive and two negative publications as described in 3.1. Fine-tuning was done using BertForSequenceClassification8 with 2 classes (relevant or irrelevant) and we used the go-to loss function from the HuggingFace tool. The voting system is inspired by the slate-vote as each voter (here the reranking systems) has to rank a list of candidates. We assigned a score for every document in the four generated lists of each query. Here are the scores assigned to the top 10 articles sorted by decreasing order of relevance: [25, 19, 15, 12, 10, 8, 6, 5, 4, 4]. A document is assigned a score of 0 for each list it does not appear in. Then, for every publication, we summed its 4 scores and used it to sort the documents and create a final list of the 10 most relevant papers. Note that a publication always in the bottom 4 of the ranked lists has a lower final score than a document ranked 1st by one system and not appearing in the other slates. This allows to reward a system that notices a strong similarity between the query and a document but would be left aside by other systems. The PLM used to reduce the input sequence length has the same settings than “IRIS1” except for the data. We used as positive samples the “ideal answers” of each query (handwritten answers provided by BioASQ) and as negative samples sentences from the 50 first documents (except gold standard) retrieved by BM25+RM3. We used this model in two different ways: • to reduce input length during inference and before applying “IRIS1”. We will refer to this model as “IRIS1-R”. 5 https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ 6 http://participants-area.bioasq.org/datasets/ 7 https://www.nlm.nih.gov/databases/download/mesh.html 8 https://huggingface.co/docs/transformers/v4.42.0/en/model_doc/bert#transformers.BertForSequenceClassification/ Table 1 Reranking systems parameters Submission name Model description Tags Training Corpus Training parameters −4 IRIS1 BioLinkBERT-base N/A BioASQ10 LR=𝑒 , Epochs=5, Batch=128 × 16 IRIS2 BioLinkBERT-base N/A BioASQ11 LR=𝑒−4 , Epochs=5, Batch=128 × 16 IRIS3 Voting System # or N/A N/A N/A IRIS4 BioLinkBERT-base # BioASQ10 LR=𝑒−4 , Epochs=5, Batch=128 × 16 IRIS5 BioLinkBERT-base N/A BioASQ10 LR=𝑒−5 , Epochs=5, Batch=128 × 16 • during training and inference with the same settings as “IRIS1”, i.e., we train the same model with documents having their input modified. We will refer to this model as “IRIS-R”. 4. Results In this section, we present the results obtained during batches 1, 2, 3, and 4 of the Task 12B Phase A on document ranking. We officially participated in batches 1, 2, and 3 of this year evaluation campaign with five models that are reported in Table 2 in which we also include unofficial results on batch 4. Additional results for systems including input length reduction are written down in Table 3. Table 2 Results for Task 12B Phase A Batch Model MAP Recall System Rank Team Rank 1 IRIS1 0.1357 0.2339 13/40 3 - IRIS2 0.1318 *0.2413* 14/40 3 - IRIS3 0.1471 0.2404 11/40 3 - IRIS4 0.1467 0.2389 12/40 3 - IRIS5 0.1132 0.2365 18/40 4 2 IRIS1 0.1429 0.2564 21/51 6 - IRIS2 0.1151 0.2144 26/51 6 - IRIS3 0.1487 *0.2682* 19/51 6 - IRIS4 0.1097 0.2099 27/51 6 - IRIS5 0.1452 0.2562 20/51 6 3 IRIS1 0.1682 *0.3155* 16/57 4 - IRIS2 0.1217 0.2245 31/57 5 - IRIS3 0.1890 0.3059 12/57 4 - IRIS4 0.1777 0.2962 14/57 4 - IRIS5 0.1734 0.2780 15/57 4 4 IRIS1 0.2166 0.3655 N/A N/A - IRIS2 0.1710 0.3027 N/A N/A - IRIS3 0.2378 0.3716 N/A N/A - IRIS4 0.2246 *0.3745* N/A N/A - IRIS5 0.1982 0.3404 N/A N/A We observe that, in all 4 batches, the best MAP scores are obtained by the voting system, validating the value of using insights from several models at the same time. However, this system performs better in terms of Recall during only one batch, indicating it finds less relevant documents but the one retrieved are better ranked among the “top 10”. It is interesting to note that biomedical knowledge incorporation has a positive effect on the MAP results. Indeed, except during batch 2, “IRIS4” performs better than the other PLMs submitted. We investigated if reducing the input length by ranking sentences by order of relevance would be beneficial for a PLM. The results tend to confirm this hypothesis, as in the majority of cases MAP scores Table 3 Results of Systems with Input Reduction Batch Model MAP Recall 1 IRIS1-R 0.1344 0.2566 - IRIS-R 0.1480 0.2358 2 IRIS1-R 0.1481 0.2672 - IRIS-R 0.122 0.2596 3 IRIS1-R 0.1746 0.3084 - IRIS-R 0.1812 0.2757 4 IRIS1-R 0.2232 0.3732 - IRIS-R 0.2373 0.3633 increase whether we apply the reduction during training or only during inference. Moreover, applying the reduction during training leads (in most cases) to better results in terms of MAP while applying it only for inference seems to mainly enhance the Recall score. To better understand our results, we conducted additional analyses. Table 4 presents the scores of 3 representative systems for each type of question. There is a huge drop of performance for “IRIS4” on the factoid questions of Batch 2 which partially explains why this system is globally less effective during this batch. We also observe that, locally speaking, “IRIS3” (the voting system) is not always performing better than other systems but still manage to obtain the highest scores overall. This proves its capability to take advantage and balance the scores of different models. Table 4 MAP scores per question type Yes/No List Summary Factoid Batch IRIS1 IRIS3 IRIS4 IRIS1 IRIS3 IRIS4 IRIS1 IRIS3 IRIS4 IRIS1 IRIS3 IRIS4 1 0.162 0.167 0.155 0.071 0.106 0.123 0.216 0.204 0.2 0.1 0.116 0.114 2 0.215 0.187 0.125 0.149 0.177 0.135 0.091 0.127 0.115 0.097 0.094 0.058 3 0.229 0.232 0.188 0.243 0.332 0.325 0.092 0.093 0.094 0.105 0.104 0.113 4 0.305 0.295 0.262 0.175 0.2 0.22 0.094 0.145 0.129 0.25 0.283 0.268 Table 5 shows that the average query length is increasing along batches as well as the mean number of MeSH terms appearing among them. Our models performances seem to follow this trend as the overall results are also increasing along batches. Except for Batch 2, where “IRIS4” has its worst scores, it seems that detecting more biomedical terms enhances the performance of our tagging model. Table 5 Queries lengths and MeSH terms detected per query Batch Mean number of tokens Mean number of MeSH terms 1 10.06 1.56 2 10.61 1.92 3 10.72 2.05 4 12.04 2.31 5. Conclusion and Perspectives The strategy implemented to incorporate biomedical knowledge into BERT cross-encoders is promising as it obtains better results than base systems most of the time. This emphasises the importance of helping such models understand the semantic relations between domain specific terms in a document. Moreover, we show that taking advantage of several model outputs has a strong and positive effect on the results. Indeed, the voting system provides our best scores which were consistent as they follow the trend of other participating systems being better batch after batch. Finally, we investigated a way to improve publication processing by bypassing the input length limitation. We trained a simple model to rank important sentences to answer the query. It constrains BERT cross-encoder to focus on relevant parts of each document and to ignore sentences that carry less information. This strategy offers better scores in terms of MAP especially when we used it during both training and inference stages. Although these results are promising, our models do not perform as well as the strongest systems for this task. Thus, there are several axes of improvement we would like to explore. First of all, we showed modifying the input sequence can have a positive effect. We plan to apply a more complex tagging strategy. One may think about adding new special tokens to tag other biomedical concepts from Unified Medical Language System metathesaurus9 , triplets [19, 20] or other important words in the text [21]. In addition, we only used MeSH as a controlled vocabulary. One way to improve the knowledge we infuse into PLMs would be to take into account its tree structure. An idea would be to apply knowledge graph embedding algorithms and combine those representations with the textual embeddings of PLMs [22, 23, 24]. Moreover, it would be wise to replace the voting system with a method that enables a deep interaction between the models such as multi-layer perceptron or mutli-head attention. References [1] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neural Information Processing Systems 33 (2020) 9459–9474. [2] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [3] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of BioASQ Tasks 12b and Synergy12 in CLEF2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [5] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep con- textualized word representations, in: M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Lin- guistics, New Orleans, Louisiana, 2018, pp. 2227–2237. URL: https://aclanthology.org/N18-1202. doi:10.18653/v1/N18-1202. [6] R. Tinn, H. Cheng, Y. Gu, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Fine-tuning large neural language models for biomedical natural language processing, CoRR abs/2112.07869 (2021). URL: https://arxiv.org/abs/2112.07869. arXiv:2112.07869. 9 https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html [7] M. Yasunaga, J. Leskovec, P. Liang, Linkbert: Pretraining language models with document links, 2022. arXiv:2203.15827. [8] K. r. Kanakarajan, B. Kundumani, M. Sankarasubbu, BioELECTRA:pretrained biomedical text encoder using discriminators, in: D. Demner-Fushman, K. B. Cohen, S. Ananiadou, J. Tsujii (Eds.), Proceedings of the 20th Workshop on Biomedical Language Processing, Association for Computational Linguistics, Online, 2021, pp. 143–154. URL: https://aclanthology.org/2021.bionlp-1. 16. doi:10.18653/v1/2021.bionlp-1.16. [9] M. Lesavourey, G. Hubert, Bioasq 11b: Integrating domain specific vocabulary to bert-based model for biomedical document ranking., in: CLEF (Working Notes), 2023, pp. 145–151. [10] R. F. Nogueira, W. Yang, K. Cho, J. Lin, Multi-stage document ranking with BERT, CoRR abs/1910.14424 (2019). URL: http://arxiv.org/abs/1910.14424. arXiv:1910.14424. [11] J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, X. Cheng, Semantic models for the first-stage retrieval: A comprehensive review, ACM Trans. Inf. Syst. 40 (2022). URL: https://doi.org/10.1145/3486250. doi:10.1145/3486250. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [13] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends® in Information Retrieval 3 (2009) 333–389. [14] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 2356–2362. URL: https://doi.org/10.1145/3404835.3463238. doi:10.1145/3404835.3463238. [15] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Domain-specific language model pretraining for biomedical natural language processing, CoRR abs/2007.15779 (2020). URL: https://arxiv.org/abs/2007.15779. arXiv:2007.15779. [16] J. Lin, R. F. Nogueira, A. Yates, Pretrained transformers for text ranking: BERT and beyond, CoRR abs/2010.06467 (2020). URL: https://arxiv.org/abs/2010.06467. arXiv:2010.06467. [17] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for Biomedical Question Answering, Scientific Data 10 (2023) 170. [18] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Optimizing dense retrieval model training with hard negatives, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1503–1512. [19] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, Q. Liu, Ernie: Enhanced language representation with informative entities, arXiv preprint arXiv:1905.07129 (2019). [20] H. Kilicoglu, D. Shin, M. Fiszman, G. Rosemblat, T. C. Rindflesch, Semmeddb: a pubmed-scale repository of biomedical semantic predications, Bioinformatics 28 (2012) 3158–3160. [21] L. Boualili, J. G. Moreno, M. Boughanem, Highlighting exact matching via marking strategies for ad hoc document ranking with pretrained contextualized language models, Information Retrieval Journal 25 (2022) 414–460. [22] Q. Dong, Y. Liu, S. Cheng, S. Wang, Z. Cheng, S. Niu, D. Yin, Incorporating explicit knowledge in pre-trained language models for passage re-ranking, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 1490–1501. [23] Q. Xie, P. Tiwari, S. Ananiadou, Knowledge-enhanced graph topic transformer for explainable biomedical text summarization, IEEE Journal of Biomedical and Health Informatics 28 (2024) 1836–1847. doi:10.1109/JBHI.2023.3308064. [24] J. Tan, J. Hu, S. Dong, Incorporating entity-level knowledge in pretrained language model for biomedical dense retrieval, Computers in Biology and Medicine 166 (2023) 107535.