<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Cochrane Systematic Literature Reviews for Prospective Evaluation of Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wojciech Kusa</string-name>
          <email>wojciech.kusa@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harrisen Scells</string-name>
          <email>harry.scells@uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moritz Staudinger</string-name>
          <email>moritz.staudinger@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Allan Hanbury</string-name>
          <email>allan.hanbury@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leipzig University</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While systematic literature reviews are central to evidence-based medicine, creating them takes significant time and efort. As such, numerous eforts have been dedicated to automating various aspects of systematic review creation. One key technology being applied to automating systematic reviews is large language models (LLMs). However, evaluating methods that use LLMs poses one glaring risk: it is often unknown whether the LLM was trained on systematic reviews, the data used to evaluate many areas of systematic review automation. We propose a conceptual framework for constructing a new dataset based on the Cochrane Database of Systematic Reviews. We envision a dataset to enable prospective evaluation of large language models in an end-to-end systematic literature review automation task. In essence, we provide a way to evaluate systematic review automation methods that strictly guarantees no train-test leakage. This paper highlights limitations in current LLM evaluation methodologies by advocating for a real-world, evolving and dynamic dataset. We aim to mitigate data contamination and prompt sensitivity through prospective evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Prospective Evaluation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Systematic Literature Reviews</kwd>
        <kwd>Citation Screening</kwd>
        <kwd>Data Contamination</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Systematic literature reviews (SLRs) are central to evidence-based medicine, informing clinical practice
and policy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They have a well-established and rigorous methodology for synthesising and evaluating
the evidence on a specific research question. Despite their importance, creating an SLR is slow and
labour-intensive, often taking months or years due to the amount of literature that needs assessing
and analysing, making them an ideal candidate for automation. Current eforts in automating SLRs
have focused primarily on individual processes, such as search query formulation [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ], query
refinement [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], document screening prioritisation [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], screening cut-of prediction [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], data
extraction [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] or evidence summarisation [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ].
      </p>
      <p>To this date, prospective evaluation of SLR automation was limited to single reviews conducted
frequently by biomedical and healthcare researchers, focusing on commercial tools or small-scale
experimental setups. This was due primarily to the cost of preparing and annotating the necessary
examples. On the other hand, the typical evaluation of ML algorithms in the tasks mentioned above
relies on rather simplified and siloed datasets, with previous research raising concerns about issues
such as annotation quality, data overlap, and even data leakage [15, 16].</p>
      <p>Large Language Models (LLMs) provide a path towards end-to-end SLR automation, which can
significantly speed-up the generation of SLRs. However, evaluating LLMs in this context has several
challenges, such as reduced reproducibility, problems with data contamination, the need to adapt to
rapid changes in evidence, the occurrence of hallucinations, and the imperative to ensure high Recall.
This paper highlights our conceptual framework for evaluating LLMs for SLR automation tasks and
SLR</p>
      <sec id="sec-1-1">
        <title>SLR Protocol</title>
        <p>DPerfointoitciooln 1. SLR title</p>
      </sec>
      <sec id="sec-1-2">
        <title>2. SLR abstract</title>
      </sec>
      <sec id="sec-1-3">
        <title>3. Background</title>
      </sec>
      <sec id="sec-1-4">
        <title>4. Methods</title>
        <p>LLM</p>
        <p>SLR Results
1. SLR abstract
2. Plain language
summary</p>
      </sec>
      <sec id="sec-1-5">
        <title>3. Search strategies</title>
      </sec>
      <sec id="sec-1-6">
        <title>4. Characteristics of</title>
        <p>studies</p>
      </sec>
      <sec id="sec-1-7">
        <title>5. Meta-analysis 6. ...</title>
        <p>Tylosin[tiab] OR
Amphotericin[tiab]
OR Antimycin[tiab]
OR Brefeldin
[tiab] OR
Bryostatins [tiab]</p>
      </sec>
      <sec id="sec-1-8">
        <title>Search</title>
        <p>strategy</p>
      </sec>
      <sec id="sec-1-9">
        <title>Relevant</title>
        <p>documents</p>
      </sec>
      <sec id="sec-1-10">
        <title>Metaanalysis</title>
        <p>Models</p>
        <p>M1
M2</p>
        <p>Query
Generation
↓
↑</p>
        <p>Screening
↑
↓</p>
        <p>Analysis
↑
↓
how we aim to extend it for end-to-end automation while also preventing data contamination. This
framework provides the groundwork for evaluating LLMs in the biomedical domain, featuring multiple
NLP and retrieval tasks and focusing on an evolving dataset. All these features allow for a relatively
annotation-free prospective assessment of the efectiveness of LLMs in SLR automation.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Design of Prospective Evaluation Dataset</title>
      <p>Current LLM evaluation methods often do not adequately reflect the creation of SLRs, mainly due to
the reliance on retrospective data, leading to issues such as data contamination. Our dataset creation
framework (Figure 1) can address this issue by enabling the evaluation of newly published Cochrane
SLR protocols, efectively mitigating contamination. Cochrane is an organisation that manually collects,
synthesises, and disseminates medical evidence to aid in making informed decisions regarding health
treatments and policies.1</p>
      <p>We envisage creating ‘evaluation sandboxes’ where LLMs can be evaluated in real-time against
the gold standard data available after the completion of manual reviews. We plan to use the TIREx
platform [17] as the online model submission platform. Additionally, we envision using the CSMeD
framework [15] and the ReviewManager (RevMan)2 format published by the Cochrane Library as
the basis for dynamically creating the dataset. With these properties, we intend to publicly release
dataset snapshots of the so-called ‘knowledge cutofs’ on a regular basis. Finally, we aim to include
several under-investigated SLR methodologies in the dataset, such as qualitative reviews (analysing
based on characteristics such as interviews or focus groups) and prognosis reviews (analysing based
on characteristics such as demographic or lifestyle factors) to improve the generalisability of LLMs on
diferent types of SLRs.</p>
      <p>Our task begins with the SLR protocol, including the title, abstract, and eligibility criteria, and aims
at predicting the entirety of an SLR’s future components. Specifically, the envisioned tasks that our
prospective evaluation dataset would encompass includes:
1https://www.cochrane.org/
2https://training.cochrane.org/online-learning/core-software/revman
• generating a search strategy through Boolean queries,
• identifying relevant publications,
• extracting PICO (population, intervention, comparison, and outcome) elements, and
• calculating meta-analysis outcomes.</p>
      <p>Historically, each of these steps has been evaluated independently. However, the advancement of LLMs
enables us to test this as an integrated end-to-end approach, ofering a comprehensive solution to
automate the SLR process efectively. This holistic approach aims to comprehensively automate the
SLR process, addressing the need for eficient evidence synthesis in evidence-based medicine.</p>
      <p>Recent studies[18, 19] have found many benchmark datasets already compromised. The CONDA
database3 is one recent example of a community efort to try to keep track of the contamination of
datasets in LLMs. Our prospective evaluation framework prevents contamination by using recently
ifnished SLR protocols for evaluation, which were only published after the knowledge cutof date of the
LLM it evaluates. Therefore, no test collection data can already be present in the pretraining data of the
LLM.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Limitations and Future Directions</title>
      <p>One of the biggest limitations is that the time between SLR registration and publication is typically
around two years for Cochrane SLRs [20]. One way to mitigate this problem could be to use SLR which
protocols were registered some time ago but still do not have the review available (as we expect that they
are close to the first publication). For instance, 166 registered reviews in 2022 still have not published
their results by June 2024, and we can assume that many of them will publish the final review in the
next six months (by the end of 2024).4</p>
      <p>Future directions of research using our prospective dataset include:
1. Extension of the dataset to non-biomedical SLRs, e.g., social science data,5 to assess the LLMs’
abilities in other contexts;
2. Multi-dimensional evaluation metrics: new metrics that measure aspects beyond Recall, such as
contextual understanding and outcomes of systematic reviews [21];
3. Metrics for adaptive learning: new metrics to evaluate how well LLMs adapt to new evidence;
4. Bias detection and mitigation protocols: new methods to identify and address biases in the dataset
and the LLM outputs;
5. Finally, this dataset could also be expanded to focus on predicting SLR updates published by</p>
      <p>Cochrane, and living systematic reviews.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Our conceptual framework proposes a way to evaluate LLMs for SLR automation that mitigates data
contamination. By leveraging prospective Cochrane reviews for forward prediction, we address key
challenges in current evaluation practices. Furthermore, as the prediction task is the end-to-end
generation of SLR, we believe that this work could lay the groundwork for more efective IR and NLP
systems in the biomedical and healthcare domains. Finally, this framework allows for the creation of a
relatively annotation-free evolving test collection.
3https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database
4https://www.cochranelibrary.com/cdsr/reviews
5Using SLRs created by the Campbell Collaboration.
This work was supported by the EU Horizon 2020 ITN/ETN on Domain Specific Systems for Information
Extraction and Retrieval – DoSSIER (H2020-EU.1.3.1., ID: 860721).
61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
ACL 2023, Toronto, Canada, July 9-14, 2023, Association for Computational Linguistics, 2023, pp.
1387–1407.
[15] W. Kusa, Ó. E. Mendoza, M. Samwald, P. Knoth, A. Hanbury, CSMeD: Bridging the Dataset Gap in
Automated Citation Screening for Systematic Literature Reviews, in: 37th Conference on Neural
Information Processing Systems Track on Datasets and Benchmarks, 2023.
[16] A. Dhrangadhariya, H. Müller, DISTANT-CTO: A zero cost, distantly supervised approach to
improve low-resource entity extraction using clinical trials literature, in: D. Demner-Fushman, K. B.
Cohen, S. Ananiadou, J. Tsujii (Eds.), Proceedings of the 21st Workshop on Biomedical Language
Processing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 345–358. URL:
https://aclanthology.org/2022.bionlp-1.34. doi:10.18653/v1/2022.bionlp-1.34.
[17] M. Fröbe, J. H. Reimer, S. MacAvaney, N. Deckers, S. Reich, J. Bevendorf, B. Stein, M. Hagen,
M. Potthast, The information retrieval experiment platform, in: Proceedings of the 46th
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp.
2826–2836.
[18] S. Balloccu, P. Schmidtová, M. Lango, O. Dusek, Leak, cheat, repeat: Data contamination and
evaluation malpractices in closed-source LLMs, in: Y. Graham, M. Purver (Eds.), Proceedings of
the 18th Conference of the European Chapter of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp.
67–93. URL: https://aclanthology.org/2024.eacl-long.5.
[19] O. Sainz, J. Campos, I. García-Ferrero, J. Etxaniz, O. L. de Lacalle, E. Agirre, NLP evaluation in
trouble: On the need to measure LLM data contamination for each benchmark, in: H. Bouamor,
J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP
2023, Association for Computational Linguistics, Singapore, 2023, pp. 10776–10787. URL: https:
//aclanthology.org/2023.findings-emnlp.722. doi: 10.18653/v1/2023.findings-emnlp.722.
[20] M. Sampson, K. G. Shojania, C. Garritty, T. Horsley, M. Ocampo, D. Moher, Systematic reviews
can be produced and published faster, Journal of clinical epidemiology 61 (2008) 531–536.
[21] W. Kusa, G. Zuccon, P. Knoth, A. Hanbury, Outcome-based evaluation of systematic review
automation, in: Proceedings of the 2023 ACM SIGIR International Conference on the Theory of
Information Retrieval (ICTIR ’23), ACM, Taipei, Taiwan, 2023, p. 9. URL: https://doi.org/10.1145/
3578337.3605135. doi:10.1145/3578337.3605135.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-I.</given-names>
            <surname>Raquel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Duncan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Alison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Debra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Susanne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stephen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Amber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lesley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Christian</surname>
          </string-name>
          , W. Paul, W. Nerys, Systematic Reviews:
          <article-title>CRD's guidance for undertaking reviews in health care</article-title>
          , CRD, University of York, York,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zuccon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Koopman</surname>
          </string-name>
          ,
          <article-title>A comparison of automatic boolean query formulation for systematic reviews</article-title>
          ,
          <source>Information Retrieval Journal</source>
          <volume>24</volume>
          (
          <year>2021</year>
          )
          <fpage>3</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Koopman</surname>
          </string-name>
          , G. Zuccon,
          <article-title>Can ChatGPT write a good boolean query for systematic review literature search?</article-title>
          ,
          <source>arXiv preprint arXiv:2302.03495</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Staudinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kusa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lipani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>A reproducibility and generalizability study of large language models for query generation</article-title>
          ,
          <source>in: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP '24)</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1145/3673791.3698432.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zuccon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Koopman</surname>
          </string-name>
          ,
          <article-title>Automatic boolean query refinement for systematic review literature search</article-title>
          ,
          <source>The Web Conference 2019 - Proceedings of the World Wide Web Conference</source>
          ,
          <string-name>
            <surname>WWW</surname>
          </string-name>
          <year>2019</year>
          11 (
          <year>2019</year>
          )
          <fpage>1646</fpage>
          -
          <lpage>1656</lpage>
          . URL: https://doi.org/10.1145/3308558.3313544. doi:
          <volume>10</volume>
          .1145/ 3308558.3313544.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alharbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          ,
          <article-title>Refining Boolean queries to identify relevant studies for systematic review updates</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>27</volume>
          (
          <year>2020</year>
          )
          <fpage>1658</fpage>
          -
          <lpage>1666</lpage>
          . URL: https://doi.org/10.1093/jamia/ocaa148. doi:
          <volume>10</volume>
          .1093/jamia/ocaa148. arXiv:https://academic.oup.com/jamia/article-pdf/27/11/1658/34363868/ocaa148.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kusa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth</surname>
          </string-name>
          ,
          <article-title>Automation of citation screening for systematic literature reviews using neural networks: A replicability study</article-title>
          , in: M.
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Verberne</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Seifert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Nørvåg</surname>
          </string-name>
          , V. Setty (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>584</fpage>
          -
          <lpage>598</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -99736-6_
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McDonagh</surname>
          </string-name>
          ,
          <article-title>A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review, in: AMIA annual symposium proceedings</article-title>
          , volume
          <volume>2010</volume>
          , American Medical Informatics Association,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bin-Hezam</surname>
          </string-name>
          ,
          <article-title>Stopping methods for technology assisted reviews based on point processes</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Koopman</surname>
          </string-name>
          , G. Zuccon,
          <article-title>Zero-shot generative large language models for systematic review screening automation</article-title>
          ,
          <source>in: Advances in Information Retrieval - 46th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2024</year>
          , Glasgow, UK, March
          <volume>24</volume>
          -28,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>14608</volume>
          of Lecture Notes in Computer Science,
          <year>2024</year>
          , pp.
          <fpage>403</fpage>
          -
          <lpage>420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Nye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Marshall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature</article-title>
          ,
          <source>in: Proceedings of ACL</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhrangadhariya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Not so weak PICO: leveraging weak supervision for participants, interventions, and outcomes recognition for systematic review automation</article-title>
          ,
          <source>JAMIA open 6</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. DeYoung</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>Overview of MSLR2022: A shared task on multi-document summarization for literature reviews</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Scholarly Document Processing</source>
          , Association for Computational Linguistics, Gyeongju, Republic of Korea,
          <year>2022</year>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>180</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .sdp-
          <volume>1</volume>
          .
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shaib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Marshall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          , Summarizing, simplifying, and
          <article-title>synthesizing medical evidence using GPT-3 (with varying success)</article-title>
          ,
          <source>in: Proceedings of the</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>