<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Query-focused Extractive Summarisation for Biomedical and COVID-19 Complex Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego Mollá</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Macquarie University</institution>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents Macquarie University's participation to the two most recent BioASQ Synergy Tasks (as per June 2022), and to the BioASQ10 Task B (BioASQ10b), Phase B. In these tasks, participating systems are expected to generate complex answers to biomedical questions, where the answers may contain more than one sentence. We apply query-focused extractive summarisation techniques. In particular, we follow a sentence classification-based approach that scores each candidate sentence associated to a question, and the  highest-scoring sentences are returned as the answer. The Synergy Task corresponds to an end-to-end system that requires document selection, snippet selection, and finding the final answer, but it has very limited training data. For the Synergy task, we selected the candidate sentences following two phases: document retrieval and snippet retrieval, and the final answer was found by using a DistilBERT/ALBERT classifier that had been trained on the training data of BioASQ9b. Document retrieval was achieved as a standard search over the CORD-19 data using the search API provided by the BioASQ organisers, and snippet retrieval was achieved by re-ranking the sentences of the top retrieved documents, using the cosine similarity of the question and candidate sentence. We observed that vectors represented via sBERT have an edge over tf.idf. BioASQ10b Phase B focuses on finding the specific answers to biomedical questions. For this task, we followed a data-centric approach. We hypothesised that the training data of the first BioASQ years might be biased and we experimented with diferent subsets of the training data. We observed an improvement of results when the system was trained on the second half of the BioASQ10b training data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;BioASQ</kwd>
        <kwd>Synergy</kwd>
        <kwd>query-focused summarisation</kwd>
        <kwd>Biomedical</kwd>
        <kwd>COVID-19</kwd>
        <kwd>DistilBERT</kwd>
        <kwd>sBERT</kwd>
        <kwd>data-centric</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Question Orteronel was developed for treatment of which cancer?</title>
      </sec>
      <sec id="sec-1-2">
        <title>Type factoid</title>
      </sec>
      <sec id="sec-1-3">
        <title>Snippet Pooled-analysis was also performed, to assess the efectiveness of agents targeting the andro</title>
        <p>gen axis via identical mechanisms of action (abiraterone acetate, orteronel).</p>
      </sec>
      <sec id="sec-1-4">
        <title>Exact answer castration-resistant prostate cancer</title>
      </sec>
      <sec id="sec-1-5">
        <title>Ideal answer Orteronel was developed for treatment of castration-resistant prostate cancer.</title>
        <p>
          The Synergy tasks aim to evaluate technologies useful for the development of an end-to-end
question answering (QA) system for questions about COVID-19 asked by biomedical experts.
In particular, the Synergy tasks evaluate the quality of document retrieval over a snapshot of
CORD-19 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], snippet retrieval, and the generation of “ideal answers” that may contain multiple
sentences. We present our participation in the second BioASQ9 Synergy task that ran between
May and June 2021, and the BioASQ10 Synergy task that ran between December 2021 and
February 2022.
        </p>
        <p>Task B of BioASQ focuses on biomedical semantic QA. Similar to the Synergy tasks, several
technologies corresponding to components of an end-to-end QA system are evaluated. In
contrast with the Synergy tasks, Task B of BioASQ has two distinct phases. Phase A evaluates
the quality of document and snippet retrieval on a snapshot of PubMed3, whereas Phase B,
given a question, its question type (“summary”, “factoid”, “yesno”, “list”) , and a list of candidate
snippets, evaluates the system’s ability to find short answers (“exact answers”) and long, possibly
multi-sentence answers (“ideal answers”). Figure 1 shows an example of a question and its
question type, a correct snippet for the question, a correct exact answer, and a correct ideal
answer. We present our participation in Task B, Phase B of BioASQ10, that ran between March
and May 2022 (henceforth BioASQ10b, Phase B).</p>
        <p>All of our contributions to the above tasks are based on a common question-answering
architecture that we will describe in Section 2. Section 3 presents our participation in the
Synergy tasks. Section 4 presents our participation in BioASQ10b, Phase B. Finally, Section 5
concludes this paper.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Question Answering Architecture</title>
      <p>
        The question-answering system that is the focus of our participation in all of the tasks presented
in this paper is based on query-focused extractive summarisation. The architecture of the
system is illustrated in Figure 2, and follows the classification set up proposed by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The query-focused summarisation system takes the question, a candidate sentence, and the
sentence position4, and calculates a sentence score. The system computes the word embeddings
3https://pubmed.ncbi.nlm.nih.gov/
4The sentence position was incorporated as an absolute number: 1, 2, . . . , where  is the total number of
input sentences. We chose to include the sentence position as earlier experiments in past BioASQ years showed an
sentence position
e
c
n
e
t
n
e
s
n
o
i
t
s
e
u
q</p>
      <sec id="sec-2-1">
        <title>BERT</title>
        <p>word embeddings
sentence embeddings</p>
      </sec>
      <sec id="sec-2-2">
        <title>Mean</title>
        <p>
          relu
sigmoid
∫︀
of the question and candidate sentence using a BERT architecture [3]. In particular, for the
BioASQ9 Synergy task 2 we used ALBERT [4], which was the best-performing system in [5]’s
participation in BioASB8b5. For the BioASQ10 Synergy task, we used DistilBERT [6], which
performed very well in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]’s participation in BioASQ9b, and even outperformed BioBERT [7]. For
BioASQ10, Phase B, we also used DistilBERT. Average pooling is then used to merge the word
embeddings of the candidate sentence into the sentence embeddings. The sentence position is
then concatenated to the sentence embeddings, and an additional intermediate dense layer is
added. A final classification layer predicts the sentence score.
        </p>
        <p>The question and the sentence were fed to BERT in the same way as defined by the creators
of BERT [3]. That is, the input consisted of an initial “[CLS]” token, followed by the question
text, then a “[SEP]” token that indicates a new sentence, and finally the candidate sentence text.
This information was passed to BERT, indicating the question and the candidate sentence as
two separate text segments.</p>
        <p>The classification labels used for training the system were automatically generated from
the training data, based on the ROUGE score of the candidate sentence with respect to the
annotated ideal answer. In particular, given a particular question, the top 5 sentences according
to their ROUGE score were labelled as 1, and the rest were labelled as 0. For the Synergy tasks
we used the BioASQ9b training data, whereas for BioASQ10b, Phase B, we used the BioASQ10b
training data.</p>
        <p>We used the pre-trained ALBERT and DistilBERT models available by Huggingface6. These
models were frozen during training, so that only the weights of the additional layers shown in
Figure 2 were updated.
improvement of the results.</p>
        <p>5At the time of training the system for the BioASQ9 Synergy task 2, the final results of BioASQ9 had not been
released yet.</p>
        <p>6https://huggingface.co/ — For ALBERT, we used ‘albert-xxlarge-v2’. For DistilBERT, we used
‘distilbert-baseuncased’.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Synergy Tasks</title>
      <p>This section describes the systems that participated in the Synergy task 2 of BioASQ9, and the
Synergy task of BioASQ10 (in this paper, we will use the collective expression “the Synergy
tasks” to refer to these). The Synergy task 2 of BioASQ9 ran in 2021 but the results were not
made available at the time of the paper submission deadline for BioASQ9. For this reason, we
are describing the system in this paper.</p>
      <p>Our participation in the Synergy tasks share the same question answering system architecture
described in Section 2. The only diference between the two Synergy tasks is, as mentioned
in Section 2, that the BioASQ9 Synergy 2 system used ALBERT, whereas the BioASQ10
Synergy system used DistilBERT. In both cases, the system was trained with the training data of
BioASQ9b.</p>
      <p>To generate the candidate sentences required by the question answering system, we followed
this procedure:
1. Retrieve the most relevant documents as described in Section 3.1;
2. Split the retrieved documents into sentences and select the candidate sentences as
described in Section 3.2.
3.1. Document Retrieval
The relevant documents were retrieved using the search API provided by the organisers of the
BioASQ Synergy task. This API is based on a Web service that accepts a query and returns a JSON
data structure. We simply used the unmodified question as the search query. In subsequent work
we are exploring pre-processing and fine-tuning steps to improve the quality of the Document
Retrieval stage.</p>
      <p>The final runs submitted consist of the top 10 documents, after removing those that were in
previous feedback, to conform with the submission requirements.
3.2. Snippet Retrieval
Every sentence from every retrieved document was a candidate snippet. This includes sentences
from documents that were retrieved but were not submitted in the Document Retrieval runs.
We then experimented with the combination of 2 dimensions to re-rank the candidate snippets,
for a total of 4 diferent approaches.</p>
      <p>The first dimension was based on the calculation of the similarity between the question and
candidate snippet. We experimented with the following two options:
TfidfCosine. We represented the question and candidate sentences using tf.idf. Each
candidate sentence was then scored based on the cosine similarity between the question vector and
the sentence vector.
sBERTCosine. We used sBERT [8] to represent the question and the candidate sentences,
and to determine the similarities between the question and the sentences. We used the default
set up for sBERT, which computes the cosine similarity between the question vector and the
sentence vector.</p>
      <p>The second dimension was based on the criteria used for the final ranking of the candidate
sentences. We experimented with local sorting and global sorting.</p>
      <p>LocalSorting. For every relevant document, we extracted the top 3 sentences according to
the cosine similarity approaches described above. The final list of sentences was composed
of the top 3 sentences from the top document, followed by the top 3 sentences of the second
document, and so on.</p>
      <p>GlobalSorting. In contrast to the local sorting approach, all sentences of all documents were
now sorted according to their cosine similarity with the question, regardless of what document
the snippets were obtained from.</p>
      <p>The final runs submitted consist of the first 10 snippets, after removing those that were in
previous feedback, to conform with the submission requirements.
3.3. Answer Generation
As mentioned above, the question, candidate sentences, and sentence position were fed to the
system illustrated in Figure 2. The sentence position was simply the unnormalised position of
the sentence within the list of snippets, after the snippets have been ranked as described in
Section 3.2. Given a question, the top-scoring  sentences according to the scores produced by
the QA system were combined to form the final answer. These sentences were presented in the
order of appearance in the list of snippets. The value of  was based on the question type and
is shown in Table 1.
3.4. Results of the Synergy Tasks
This section describes the results of the runs submitted to the Synergy tasks.</p>
      <p>Table 2 shows the F1 score of the documents returned by our systems. As mentioned in
Section 3.1, these documents were found by submitting the unmodified question as the query
to the search API provided by the developers of the Synergy task. As expected, the results were
poor relative to other submissions.</p>
      <p>Table 3 shows the F1 score of the snippets returned by our runs. For each run, we indicate
the run name, the type of similarity used, and the type of sorting performed. We observe that,
considering the poor quality of the documents retrieved, the snippets were of quality comparable
to that of other runs of the BioASQ9 Synergy 2 task (but not the runs of the BioASQ10 Synergy
task), but there is room for improvement. Among our runs, the most successful configuration
was using sBERT cosine similarity and global sort.</p>
      <p>Table 4 shows the human evaluation results of the ideal answers returned by our runs. Our
runs are very competitive, especially given the relatively poor quality of the input snippets. Given
the poor quality of the input snippets in all of our runs, it is dangerous to make generalisations
about how the quality of the snippets afect the quality of the answers. Having said that, we
can observe that, in the BioASQ9 Synergy 2 task, the runs that generated the best snippets
(MQ-4) did not lead to generating the best ideal answers. The impact of and interplay between
the document and snippet retrieval stages, and the question-answering stage, deserves further
exploring.</p>
    </sec>
    <sec id="sec-4">
      <title>4. BioASQ10b, Phase B</title>
      <p>For BioASQ10b, Phase B, we used the question answering system described in Section 2, using
DistilBERT as the BERT variant chosen to compute the word embeddings. Following a
datacentric approach, the main diference between the Synergy tasks and BioASQ10, Phase B, is
the choice of training data. We hypothesised that the training data that corresponds to the
early years of BioASQ, that is, the first samples of the BioASQ10b training data, might be
biased. We therefore tested the use of diferent portions of the training data as shown in Table 5,
by incrementally removing the first samples of the training data. We can observe that best
evaluation results are obtained with only 50% of the training data.</p>
      <p>To double-check that indeed the first samples of the training data are biased, we conducted
another round of experiments, but this time removing the last samples of the training data.
Table 6 shows that results worsen as the amount of training data diminishes, as one might
expect in systems that are based on supervised approaches to machine learning.</p>
      <p>Hyperparameter search showed that the same hyperparameters give optimal results when
training using the entire training data, or using only 50% of the training data: dropout=0.6,
number of epochs=1.
4.1. Submission Results to BioASQ10b, Phase B
Table 7 shows the results of our submissions to BioASQ10b, Phase B7. Note that the results
reported in the BioASQ website8 may change in the future after the test data is potentially
enriched with further annotations.</p>
      <p>Our runs are comparable to the median of those of other participating systems. Surprisingly,
there is little diference between using all training data or only the latter 50%. When we visually
inspected the outputs of the runs, we noticed that the output of all runs in each batch were
7At the time of writing, only the automated evaluation results were available.</p>
      <p>8http://bioasq.org
virtually identical, with only a few diferences.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Summary and Conclusions</title>
      <p>We have presented Macquarie University’s contribution to the BioASQ9 Synergy task 2, the
BioASQ10 Synergy task, and BioASQ10b, Phase B (Ideal Answers). In all of our runs, the base
question answering architecture was virtually the same, the only diferences being the choice
of DistilBERT vs. ALBERT, and the training data used.</p>
      <p>For the synergy tasks, we used a system that has been trained using BioASQ9b training data.
We experimented with approaches for snippet retrieval based on two dimensions: vectors used
for similarity comparison, and final ranking approach. Cosine similarity using sBERT gave the
best results, and we observed that not always the best snippets for the snippet retrieval task led
to best answers in the question answering task.</p>
      <p>Overall, the results of the question answering parts were competitive, especially given the
relatively poor quality of the documents and snippets retrieved. We will investigate approaches
to increase the quality of the retrieval stages, and explore the relation between quality of
retrieval vs. quality of final answers.</p>
      <p>For the BioASQ10b, Phase B task, we followed a data-centric approach and experimented
with training regimes that incrementally removed samples from the training data. During our
preliminary cross-validation experiments we observed an improvement of results using only
the latter 50% of the training data, but this diference of results vanished in the submitted runs.</p>
      <p>With a data-centric approach in mind, we plan to conduct further experiments that test
the impact of changes and transformations of the training data. For example, besides further
examining the impact of using portions of the training data, we will investigate the use of data
augmentation techniques.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was undertaken with the assistance of resources and services from the National
Computational Infrastructure (NCI), which is supported by the Australian Government.
Labs of the Evaluation Forum, Bucharest, 2021. URL: http://ceur-ws.org/Vol-2936//paper-20.
pdf.
[3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
[4] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for
selfsupervised learning of language representations, in: Proceedings of the 8th International
Conference on Learning Representations, Virtual, 2020. URL: https://iclr.cc/virtual_2020/
poster_H1eA7AEtvS.html.
[5] D. Mollá, C. Jones, V. Nguyen, Query-focused multi-document summarisation of biomedical
texts, in: L. Cappellato, C. Eickhof, N. Ferraro (Eds.), Working Notes of CLEF 2020 —
Conference and Labs of the Evaluation Forum, Thessaloniki, 2020. URL: http://ceur-ws.org/
Vol-2696/paper_119.pdf.
[6] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter, in: 33rd Conference on Neural Information Processing Systems
(NeurIPS 2019), 2019. arXiv:1910.01108.
[7] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: pre-trained biomedical
language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234–
1240. doi:10.1093/bioinformatics/btz682.
[8] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese
BERTnetworks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing, Association for Computational Linguistics, Hong Kong, 2019, pp.
3982–3992. URL: https://www.aclweb.org/anthology/D19-1410/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chandrasekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Burdick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Funk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Katsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Kinney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Merrill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Murdick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sheehan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Wade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. X. R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wilhelm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Raymond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kohlmeier</surname>
          </string-name>
          , CORD-
          <volume>19</volume>
          : The COVID-19 open research dataset,
          <source>in: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          . URL: https://www.aclweb.org/anthology/
          <year>2020</year>
          .nlpcovid19-acl.1.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mollá</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Khanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rybinski</surname>
          </string-name>
          ,
          <article-title>Query-focused extractive summarisation for finding ideal answers to biomedical and COVID-19 questions</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          (Eds.), Working Notes of CLEF 2021 - Conference and
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>