1. Introduction

10.1145/383952.383972

Dimitra Panou

0 1

Martin Reczko

0 1 0 Institute for Fundamental Biomedical Science, Biomedical Sciences Research Center “Alexander Fleming” , 34 Fleming 1 Street, 16672 Vari , Greece

2014

2 2672 2680

The recently introduced semi-supervised method GANBERT for finetuning large language models [ 1] has been applied for document relevance prediction in biomedical question answering. The additional use of unlabeled texts during training enhances the robustness of the prediction and outperforms our previous transformer ELECTROLBERT [2]. The initial document selection phase used both for ELECTROLBERT and GANBERT has been improved using BM25 combined with RM3 query expansion with optimized parameters. Both systems were continuously improved during the BioASQ11 [3] competition and in the last batch, GANBERT ranked as the 3 team for document prediction. The previous version of ELECTROLBERT took the 1 place for the “yes/no” type questions in this years SYNERGY [4] prediction.

1. Introduction

ifne-tuning of BERT with unlabeled data using GAN framework, where a () is trained

that is trained to distinguish samples of the generator from the real instances. By generating only the internal representation of text, GANBERT

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). classification. Two GANBERT variants were later successfully used for predicting he checkworthiness of potential fake news in tweets [7]. In [8], the noise generation in GANBERT was optimized for the task of discriminating correct paraphrases of Spanish texts. In the following we describe optimized document selection and the application of GANBERT for document relevance prediction in biomedical question answering in the BioASQ11 competition [9]. We also provide details for the additional predictions with our ELECTROLBERT algorithm [2] in the same competition.

2. BM25 and RM3 hyperparameter optimization

To identify documents relevant for a question, we replace the TF/IDF method with the widely used BM25 [10]. BM25 has two parameters 1 and . 1 is intuitively related to the rate of increase in a document’s score from matching an additional occurrence of a term, where smaller 1 provides a faster increase. The parameter controls the extent of document-length normalisation. The search is combined with RM3 [11], a classic pseudo-relevance feedback based query expansion model, to find related concepts. RM3 has three parameters, is the number of query expansion terms, is the number of top-ranked documents to obtain the expansion terms and defines the weight of the original query. The eficient Python implementation in the package Pyserini is used [12]. A gridsearch on these parameters to optimize the mean average precision ( ) of the top 10 returned documents for the BioASQ11 training set provided the values that were used in all four batches of BioASQ11. A random search optimizing the average of the top 10 returned documents for the 240 questions in the first three batches of BioASQ11 indicates potential improvements. The optimized parameters shown in table 1 clearly outperform the default settings.

3. Training, validation and test data

For finetuning GANBERT, all pairs of a question and its correct documents provided in the training set for BioASQ11 are used for the ’relevant’ class. As introduced in the ELECTROLBERT training [2], the negative examples for the ’non-relevant’ class are generated using a range of false positives from the initial document selection phase to better discriminate the relevant documents obtained. All questions of the relevance training set were processed with BM25 and RM3 using the settings marked with B3+4:EB0-4 in table 1 to select 1000 relevant documents for each question. The documents were ranked according to their score and all documents between rank 100 and 150 were used as negative examples, excluding potential positive examples in these ranks. The values of the start and end rank positions for the negative set were optimized by retraining and maximizing the mean average precision measured on all batches of BioASQ10. For the unlabeled set, all pairs of a question and its ideal answer and all related snippets from the BioASQ10 training set were used. As a validation set, the top 100 documents scored with BM25 and RM3 (settings again as in B3+4:EB0-4) of the 240 questions in the first three batches of BioASQ11 was used. A final independent test was made on the 90 questions of batch 4 of BioASQ11.

(Original query weight) are parameters of RM3 specifies the number of questions (total 240) with at least one correct document. specifies the number of correctly identified documents (max. 647). In the column “used for”, Bx denotes the BioASQ11 test batch x, and EBy denotes the system ELECTOLBERTy.

1 1.2 0.4 1.1 0.4 0.30 0.40 0.40 0.30 0.45 0.40 0.60 0.50 0.45 0.45 0.40 0.40 0.55 0.60 0.35 0.35 0.75 0.3 0.0 0.3 0.31 0.31 0.31 0.31 0.36 0.31 0.37 0.33 0.37 0.31 0.38 0.30 0.34 0.34 0.34 0.37 123

10 10 10 17 16 20 20 20 20 20 17 20 15 17 15 18 19 14 18 17 10 10 10 14 16 16 16 16 21 16 16 25 22 20 24 21 23 18 25 26 0.5 0.5 0.5 0.6 0.8 0.7 0.9 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.8 0.7 0.7 0.8 0.7 0.7 4. GANBERT finetuning and hyperparameter optimization The adaptation of the GANBERT architecture introduced in [1] for document relevance classiifcation is shown in figure

1. Using the labeled and unlabeled data described in the previous section for finetuning and employing the large pretrained BERT model provided with the GANBERT implementation in the path for the real data (provided by the authors of GANBERT at https://github.com/crux82/ganbert), all relevant hyperparameters for GANBERT are optimized by multiple finetunings while monitoring the performance on the first three batches of BioASQ11 as shown in table 2. All GANBERT models perform substantially better when compared to the standard BERT model and the performance of GANBERT is quite stable for the diferent hyperparameter settings, also for variations as suggested in [ 8] in the noise generation part.

5. Results

In table 3 the performances of our document relevance submissions for the BioASQ11 competition are listed. All submissions marked with ’base model’ use the ELECTROLBERT model of noise real data RQ

R NRQ

F BERT

D relevant non-relevant is real?

6. Conclusion and Future Work

Our suggested GANBERT version for document relevance prediction has shown promising performance, defeating our previous algorithm ELECTROLBERT. As can be seen at the published BioASQ11 results, both algorithms perform better than some of the other systems that seem to employ ChatGPT [15]. One obvious extension would be the replacement of BERT in the path processing the real data with ELECTROLBERT. This would also lead to the use of a more appropriate scientific vocabulary, as the BERT model provided with the GANBERT implementation uses a general purpose vocabulary. It should also be noted that the size of the unlabeled data set in this study is relatively small due to generation of this using only text available with the BioASQ datasets and our limited computational resources. One way to increase this could be the use of random segments from Pubmed abstracts.

Acknowledgments References

GPU computations were ofered by HYPATIA, the Cloud infrastructure of the Greek ELIXIR node.

[1] D. Croce, G. Castellucci, R. Basili, GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 2114–2119. URL: https://aclanthology.org/2020.acl-main.191. doi:10.18653/v1/2020.acl- main.191. [2] M. Reczko, ELECTROLBERT: Combining Replaced Token Detection and Sentence Order Prediction, in: Proc. of CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy, online http://ceur-ws.org/Vol-3180/paper-24.pdf, urn:nbn:de:0074-3180-7, 2022. [3] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, L. Gasco, M. Krallinger, G. Paliouras, Overview of BioASQ 2023: The eleventh BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), 2023. [4] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of BioASQ Tasks 11b and Synergy11 in CLEF2023, in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, 2023. [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018. URL: https://arxiv.org/abs/1810.04805. doi:10.48550/ARXIV.1810.04805. [6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative Adversarial Nets, in: Proceedings of the 27th International Confer