Query-focused Extractive Summarisation for Biomedical and COVID-19 Complex Question Answering Macquarie University’s Participation at BioASQ10 Synergy and BioASQ10b Phase B Diego Mollá1 1 Macquarie University, Australia Abstract This paper presents Macquarie University’s participation to the two most recent BioASQ Synergy Tasks (as per June 2022), and to the BioASQ10 Task B (BioASQ10b), Phase B. In these tasks, participating systems are expected to generate complex answers to biomedical questions, where the answers may contain more than one sentence. We apply query-focused extractive summarisation techniques. In particular, we follow a sentence classification-based approach that scores each candidate sentence associated to a question, and the 𝑛 highest-scoring sentences are returned as the answer. The Synergy Task corresponds to an end-to-end system that requires document selection, snippet selection, and finding the final answer, but it has very limited training data. For the Synergy task, we selected the candidate sentences following two phases: document retrieval and snippet retrieval, and the final answer was found by using a DistilBERT/ALBERT classifier that had been trained on the training data of BioASQ9b. Document retrieval was achieved as a standard search over the CORD-19 data using the search API provided by the BioASQ organisers, and snippet retrieval was achieved by re-ranking the sentences of the top retrieved documents, using the cosine similarity of the question and candidate sentence. We observed that vectors represented via sBERT have an edge over tf.idf. BioASQ10b Phase B focuses on finding the specific answers to biomedical questions. For this task, we followed a data-centric approach. We hypothesised that the training data of the first BioASQ years might be biased and we experimented with different subsets of the training data. We observed an improvement of results when the system was trained on the second half of the BioASQ10b training data. Keywords BioASQ, Synergy, query-focused summarisation, Biomedical, COVID-19, DistilBERT, sBERT, data-centric 1. Introduction The BioASQ challenge1 organises shared tasks on biomedical semantic indexing and question answering. In this paper, we present Macquarie University’s participation in several of these tasks.2 CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ diego.molla-aliod@mq.edu.au (D. Mollá) € https://researchers.mq.edu.au/en/persons/diego-molla-aliod (D. Mollá)  0000-0003-4973-0963 (D. Mollá) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 http://www.bioasq.org/ 2 Code related to this paper is available at https://github.com/dmollaaliod/bioasq10b-public and https://github. com/dmollaaliod/bioasq10-synergy-public. Question Orteronel was developed for treatment of which cancer? Type factoid Snippet Pooled-analysis was also performed, to assess the effectiveness of agents targeting the andro- gen axis via identical mechanisms of action (abiraterone acetate, orteronel). Exact answer castration-resistant prostate cancer Ideal answer Orteronel was developed for treatment of castration-resistant prostate cancer. Figure 1: An example question with its question type, a relevant snippet, an exact answer, and a correct ideal answer, extracted from the training data of BioASQ10b The Synergy tasks aim to evaluate technologies useful for the development of an end-to-end question answering (QA) system for questions about COVID-19 asked by biomedical experts. In particular, the Synergy tasks evaluate the quality of document retrieval over a snapshot of CORD-19 [1], snippet retrieval, and the generation of “ideal answers” that may contain multiple sentences. We present our participation in the second BioASQ9 Synergy task that ran between May and June 2021, and the BioASQ10 Synergy task that ran between December 2021 and February 2022. Task B of BioASQ focuses on biomedical semantic QA. Similar to the Synergy tasks, several technologies corresponding to components of an end-to-end QA system are evaluated. In contrast with the Synergy tasks, Task B of BioASQ has two distinct phases. Phase A evaluates the quality of document and snippet retrieval on a snapshot of PubMed3 , whereas Phase B, given a question, its question type (“summary”, “factoid”, “yesno”, “list”) , and a list of candidate snippets, evaluates the system’s ability to find short answers (“exact answers”) and long, possibly multi-sentence answers (“ideal answers”). Figure 1 shows an example of a question and its question type, a correct snippet for the question, a correct exact answer, and a correct ideal answer. We present our participation in Task B, Phase B of BioASQ10, that ran between March and May 2022 (henceforth BioASQ10b, Phase B). All of our contributions to the above tasks are based on a common question-answering architecture that we will describe in Section 2. Section 3 presents our participation in the Synergy tasks. Section 4 presents our participation in BioASQ10b, Phase B. Finally, Section 5 concludes this paper. 2. Question Answering Architecture The question-answering system that is the focus of our participation in all of the tasks presented in this paper is based on query-focused extractive summarisation. The architecture of the system is illustrated in Figure 2, and follows the classification set up proposed by [2]. The query-focused summarisation system takes the question, a candidate sentence, and the sentence position4 , and calculates a sentence score. The system computes the word embeddings 3 https://pubmed.ncbi.nlm.nih.gov/ 4 The sentence position was incorporated as an absolute number: 1, 2, . . . 𝑛, where 𝑛 is the total number of input sentences. We chose to include the sentence position as earlier experiments in past BioASQ years showed an sentence position word embeddings sentence embeddings relu sigmoid sentence ∫︀ Mean BERT question Figure 2: Architecture of the question answering system used for BioASQ9b, Phase B. of the question and candidate sentence using a BERT architecture [3]. In particular, for the BioASQ9 Synergy task 2 we used ALBERT [4], which was the best-performing system in [5]’s participation in BioASB8b5 . For the BioASQ10 Synergy task, we used DistilBERT [6], which performed very well in [2]’s participation in BioASQ9b, and even outperformed BioBERT [7]. For BioASQ10, Phase B, we also used DistilBERT. Average pooling is then used to merge the word embeddings of the candidate sentence into the sentence embeddings. The sentence position is then concatenated to the sentence embeddings, and an additional intermediate dense layer is added. A final classification layer predicts the sentence score. The question and the sentence were fed to BERT in the same way as defined by the creators of BERT [3]. That is, the input consisted of an initial “[CLS]” token, followed by the question text, then a “[SEP]” token that indicates a new sentence, and finally the candidate sentence text. This information was passed to BERT, indicating the question and the candidate sentence as two separate text segments. The classification labels used for training the system were automatically generated from the training data, based on the ROUGE score of the candidate sentence with respect to the annotated ideal answer. In particular, given a particular question, the top 5 sentences according to their ROUGE score were labelled as 1, and the rest were labelled as 0. For the Synergy tasks we used the BioASQ9b training data, whereas for BioASQ10b, Phase B, we used the BioASQ10b training data. We used the pre-trained ALBERT and DistilBERT models available by Huggingface6 . These models were frozen during training, so that only the weights of the additional layers shown in Figure 2 were updated. improvement of the results. 5 At the time of training the system for the BioASQ9 Synergy task 2, the final results of BioASQ9 had not been released yet. 6 https://huggingface.co/ — For ALBERT, we used ‘albert-xxlarge-v2’. For DistilBERT, we used ‘distilbert-base- uncased’. 3. The Synergy Tasks This section describes the systems that participated in the Synergy task 2 of BioASQ9, and the Synergy task of BioASQ10 (in this paper, we will use the collective expression “the Synergy tasks” to refer to these). The Synergy task 2 of BioASQ9 ran in 2021 but the results were not made available at the time of the paper submission deadline for BioASQ9. For this reason, we are describing the system in this paper. Our participation in the Synergy tasks share the same question answering system architecture described in Section 2. The only difference between the two Synergy tasks is, as mentioned in Section 2, that the BioASQ9 Synergy 2 system used ALBERT, whereas the BioASQ10 Syn- ergy system used DistilBERT. In both cases, the system was trained with the training data of BioASQ9b. To generate the candidate sentences required by the question answering system, we followed this procedure: 1. Retrieve the most relevant documents as described in Section 3.1; 2. Split the retrieved documents into sentences and select the candidate sentences as de- scribed in Section 3.2. 3.1. Document Retrieval The relevant documents were retrieved using the search API provided by the organisers of the BioASQ Synergy task. This API is based on a Web service that accepts a query and returns a JSON data structure. We simply used the unmodified question as the search query. In subsequent work we are exploring pre-processing and fine-tuning steps to improve the quality of the Document Retrieval stage. The final runs submitted consist of the top 10 documents, after removing those that were in previous feedback, to conform with the submission requirements. 3.2. Snippet Retrieval Every sentence from every retrieved document was a candidate snippet. This includes sentences from documents that were retrieved but were not submitted in the Document Retrieval runs. We then experimented with the combination of 2 dimensions to re-rank the candidate snippets, for a total of 4 different approaches. The first dimension was based on the calculation of the similarity between the question and candidate snippet. We experimented with the following two options: TfidfCosine. We represented the question and candidate sentences using tf.idf. Each candi- date sentence was then scored based on the cosine similarity between the question vector and the sentence vector. Table 1 Number of sentences selected, for each question type Summary Factoid Yesno List n 6 2 2 3 sBERTCosine. We used sBERT [8] to represent the question and the candidate sentences, and to determine the similarities between the question and the sentences. We used the default set up for sBERT, which computes the cosine similarity between the question vector and the sentence vector. The second dimension was based on the criteria used for the final ranking of the candidate sentences. We experimented with local sorting and global sorting. LocalSorting. For every relevant document, we extracted the top 3 sentences according to the cosine similarity approaches described above. The final list of sentences was composed of the top 3 sentences from the top document, followed by the top 3 sentences of the second document, and so on. GlobalSorting. In contrast to the local sorting approach, all sentences of all documents were now sorted according to their cosine similarity with the question, regardless of what document the snippets were obtained from. The final runs submitted consist of the first 10 snippets, after removing those that were in previous feedback, to conform with the submission requirements. 3.3. Answer Generation As mentioned above, the question, candidate sentences, and sentence position were fed to the system illustrated in Figure 2. The sentence position was simply the unnormalised position of the sentence within the list of snippets, after the snippets have been ranked as described in Section 3.2. Given a question, the top-scoring 𝑛 sentences according to the scores produced by the QA system were combined to form the final answer. These sentences were presented in the order of appearance in the list of snippets. The value of 𝑛 was based on the question type and is shown in Table 1. 3.4. Results of the Synergy Tasks This section describes the results of the runs submitted to the Synergy tasks. Table 2 shows the F1 score of the documents returned by our systems. As mentioned in Section 3.1, these documents were found by submitting the unmodified question as the query to the search API provided by the developers of the Synergy task. As expected, the results were poor relative to other submissions. Table 3 shows the F1 score of the snippets returned by our runs. For each run, we indicate the run name, the type of similarity used, and the type of sorting performed. We observe that, Table 2 Document retrieval results of the submissions to the BioASQ9 Synergy 2 (top) and BioASQ10 Synergy (bottom) tasks. Metric: F1. The results of rows labelled “Best”, “Median”, and “Worst” refer to the results of other systems, other than our own, submitted to the challenge. Run Round 1 Round 2 Round 3 Round 4 Best 0.3693 0.2039 0.1327 0.1896 Median 0.2388 0.1423 0.0710 0.0800 Worst 0.0157 0.0067 0.0053 0.0175 MQ-BioASQ9 0.1978 0.1087 0.0483 0.0800 Best 0.3220 0.2221 0.1970 0.1564 Median 0.3100 0.1646 0.1327 0.1067 Worst 0.2729 0.1003 0.0655 0.0478 MQ-BioASQ10 0.1003 0.0754 0.0808 Table 3 Snippet retrieval results of the submissions to the BioASQ9 Synergy 2 (top) and BioASQ10 Synergy (bottom) tasks. Metric: F1. The best of our systems in each round is highlighted in bold. The results of rows labelled “Best”, “Median”, and “Worst” refer to the results of other systems, other than our own, submitted to the challenge. Run Similarity Sorting Round 1 Round 2 Round 3 Round 4 Best 0.3290 0.1726 0.1262 0.1355 Median 0.2288 0.1365 0.0732 0.0764 Worst 0.0311 0.0101 0.0231 0.0132 MQ-1-BioASQ9 tfidf local 0.1031 0.1035 0.0707 0.0764 MQ-2-BioASQ9 tfidf global 0.1100 0.0540 0.0324 0.0619 MQ-3-BioASQ9 sBERT local 0.1071 0.0999 0.0692 0.0749 MQ-4-BioASQ9 sBERT global 0.1923 0.1075 0.1044 0.0762 Best 0.2910 0.1525 0.1574 0.1217 Median 0.2757 0.1410 0.1087 0.0948 Worst 0.2296 0.0540 0.0273 0.0416 MQ-1-BioASQ10 tfidf local 0.0660 0.0465 0.0771 MQ-2-BioASQ10 tfidf global 0.0540 0.0273 0.0416 MQ-3-BioASQ10 sBERT local 0.0683 0.0457 0.0770 MQ-4-BioASQ10 sBERT global 0.0928 0.0725 0.0827 considering the poor quality of the documents retrieved, the snippets were of quality comparable to that of other runs of the BioASQ9 Synergy 2 task (but not the runs of the BioASQ10 Synergy task), but there is room for improvement. Among our runs, the most successful configuration was using sBERT cosine similarity and global sort. Table 4 shows the human evaluation results of the ideal answers returned by our runs. Our runs are very competitive, especially given the relatively poor quality of the input snippets. Given the poor quality of the input snippets in all of our runs, it is dangerous to make generalisations about how the quality of the snippets affect the quality of the answers. Having said that, we can observe that, in the BioASQ9 Synergy 2 task, the runs that generated the best snippets Table 4 Ideal answer results of the submissions to the BioASQ9 Synergy 2 (top) and BioASQ10 Synergy (bottom) tasks. Metric: Average of human evaluation scores. The best of our systems in each round is highlighted in bold. The results of rows labelled “Best”, “Median”, and “Worst” refer to the results of other systems, other than our own, submitted to the challenge. Run Similarity Sorting Round 1 Round 2 Round 3 Round 4 Best 4.375 3.850 3.630 3.295 Median 3.625 3.100 3.450 3.045 Worst 1.042 0.450 3.290 2.060 MQ-1-BioASQ9 tfidf local 3.250 3.100 3.450 3.045 MQ-2-BioASQ9 tfidf global 3.210 3.075 3.290 3.295 MQ-3-BioASQ9 sBERT local 3.372 3.250 3.520 3.067 MQ-4-BioASQ9 sBERT global 2.250 3.025 3.490 3.292 Best 3.790 3.810 3.562 3.180 Median 3.367 3.160 3.250 2.617 Worst 3.287 1.550 0.827 0.372 MQ-1-BioASQ10 tfidf local 3.270 3.415 2.617 MQ-2-BioASQ10 tfidf global 3.160 3.305 2.990 MQ-3-BioASQ10 sBERT local 3.360 3.517 2.925 MQ-4-BioASQ10 sBERT global 3.490 3.547 2.690 (MQ-4) did not lead to generating the best ideal answers. The impact of and interplay between the document and snippet retrieval stages, and the question-answering stage, deserves further exploring. 4. BioASQ10b, Phase B For BioASQ10b, Phase B, we used the question answering system described in Section 2, using DistilBERT as the BERT variant chosen to compute the word embeddings. Following a data- centric approach, the main difference between the Synergy tasks and BioASQ10, Phase B, is the choice of training data. We hypothesised that the training data that corresponds to the early years of BioASQ, that is, the first samples of the BioASQ10b training data, might be biased. We therefore tested the use of different portions of the training data as shown in Table 5, by incrementally removing the first samples of the training data. We can observe that best evaluation results are obtained with only 50% of the training data. To double-check that indeed the first samples of the training data are biased, we conducted another round of experiments, but this time removing the last samples of the training data. Table 6 shows that results worsen as the amount of training data diminishes, as one might expect in systems that are based on supervised approaches to machine learning. Hyperparameter search showed that the same hyperparameters give optimal results when training using the entire training data, or using only 50% of the training data: dropout=0.6, number of epochs=1. Table 5 Results of 10-fold cross-validation after removing the first samples of the BioASQ10b training data. Metric: Average ROUGE-SU4 F1. Best result shown in bold. Percentage removed ROUGE-SU4 F1 10% 0.281 20% 0.288 30% 0.298 40% 0.309 50% 0.311 60% 0.308 Table 6 Results of 10-fold cross-validation after removing the last samples of the BioASQ10b training data. Metric: Average ROUGE-SU4 F1. Percentage removed ROUGE-SU4 F1 10% 0.275 20% 0.268 30% 0.270 40% 0.255 50% 0.241 60% 0.229 Table 7 Preliminary results of the submissions to BioASQ10b, Phase B. The best of our systems in each batch is highlighted in bold. The results of rows labelled “Best”, “Median”, and “Worst” refer to the results of all systems, including our own, submitted to the challenge. ROUGE-SU4 F1 Run Training Data Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 Batch 6 Best 0.3715 0.4168 0.3689 0.4165 0.3916 0.1705 Median 0.3339 0.3521 0.3387 0.3556 0.3389 0.1581 Worst MQ-1 All BioASQ10b 0.3490 0.3484 0.3344 0.3525 0.3415 0.1581 MQ-2 Last 50% of BioASQ10b 0.3339 0.3480 0.3316 0.3556 0.3431 0.1640 4.1. Submission Results to BioASQ10b, Phase B Table 7 shows the results of our submissions to BioASQ10b, Phase B7 . Note that the results reported in the BioASQ website8 may change in the future after the test data is potentially enriched with further annotations. Our runs are comparable to the median of those of other participating systems. Surprisingly, there is little difference between using all training data or only the latter 50%. When we visually inspected the outputs of the runs, we noticed that the output of all runs in each batch were 7 At the time of writing, only the automated evaluation results were available. 8 http://bioasq.org virtually identical, with only a few differences. 5. Summary and Conclusions We have presented Macquarie University’s contribution to the BioASQ9 Synergy task 2, the BioASQ10 Synergy task, and BioASQ10b, Phase B (Ideal Answers). In all of our runs, the base question answering architecture was virtually the same, the only differences being the choice of DistilBERT vs. ALBERT, and the training data used. For the synergy tasks, we used a system that has been trained using BioASQ9b training data. We experimented with approaches for snippet retrieval based on two dimensions: vectors used for similarity comparison, and final ranking approach. Cosine similarity using sBERT gave the best results, and we observed that not always the best snippets for the snippet retrieval task led to best answers in the question answering task. Overall, the results of the question answering parts were competitive, especially given the relatively poor quality of the documents and snippets retrieved. We will investigate approaches to increase the quality of the retrieval stages, and explore the relation between quality of retrieval vs. quality of final answers. For the BioASQ10b, Phase B task, we followed a data-centric approach and experimented with training regimes that incrementally removed samples from the training data. During our preliminary cross-validation experiments we observed an improvement of results using only the latter 50% of the training data, but this difference of results vanished in the submitted runs. With a data-centric approach in mind, we plan to conduct further experiments that test the impact of changes and transformations of the training data. For example, besides further examining the impact of using portions of the training data, we will investigate the use of data augmentation techniques. Acknowledgments This research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. References [1] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. M. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier, CORD-19: The COVID-19 open research dataset, in: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Association for Computational Linguistics, Online, 2020. URL: https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1. [2] D. Mollá, U. Khanna, D. Galat, V. Nguyen, M. Rybinski, Query-focused extractive summari- sation for finding ideal answers to biomedical and COVID-19 questions, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Working Notes of CLEF 2021 — Conference and Labs of the Evaluation Forum, Bucharest, 2021. URL: http://ceur-ws.org/Vol-2936//paper-20. pdf. [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423. [4] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for self- supervised learning of language representations, in: Proceedings of the 8th International Conference on Learning Representations, Virtual, 2020. URL: https://iclr.cc/virtual_2020/ poster_H1eA7AEtvS.html. [5] D. Mollá, C. Jones, V. Nguyen, Query-focused multi-document summarisation of biomedical texts, in: L. Cappellato, C. Eickhoff, N. Ferraro (Eds.), Working Notes of CLEF 2020 — Conference and Labs of the Evaluation Forum, Thessaloniki, 2020. URL: http://ceur-ws.org/ Vol-2696/paper_119.pdf. [6] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, in: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019. arXiv:1910.01108. [7] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234– 1240. doi:10.1093/bioinformatics/btz682. [8] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT- networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Hong Kong, 2019, pp. 3982–3992. URL: https://www.aclweb.org/anthology/D19-1410/.