<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Model Literature Reviews In Interdisciplinary Science: A Systems Biology Perspective</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Charvi Jain</string-name>
          <email>charvi.jain@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sahar Vahdati</string-name>
          <email>sahar.vahdati@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nandu Gopan</string-name>
          <email>nandu.gopan@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivo F. Sbalzarini</string-name>
          <email>ivo.sbalzarini@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>jens.lehmann@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Literature Review, Large Language Models, Scientific Literature, Interdisciplinary Science, Systems Biology</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI</institution>
          ,
          <addr-line>Dresden/Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Center for Systems Biology Dresden</institution>
          ,
          <addr-line>Pfotenhauerstr. 108, 01307 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Cluster of Excellence Physics of Life, TU Dresden</institution>
          ,
          <addr-line>Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Dresden University of Technology, Faculty of Computer Science</institution>
          ,
          <addr-line>Nöthnitzer Str. 46, 01187 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Leibniz University Hanover</institution>
          ,
          <addr-line>Welfengarten 1, 30167 Hannover</addr-line>
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Max Planck Institute of Molecular Cell Biology and Genetics</institution>
          ,
          <addr-line>Pfotenhauerstr. 108, 01307 Dresden</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>26</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>We evaluate the efectiveness of current large language model (LLM) literature review systems in interdisciplinary domains. While LLMs can support and accelerate reviewing the scientific literature, it is unclear how they cope with interdisciplinary science, where sources from multiple fields must be integrated according to relevance defined by context. We study this from the perspective of systems biology, a field that combines biology, mathematics, physics, and computer science. Using a set of expert-defined research questions, we assess the ability of LLMs to meaningfully integrate cross-domain knowledge and correctly reflect relevance. Specifically, we evaluate the quality of generated reports and the relevance of retrieved references from five diferent review models. We find that LLMs are a valuable augmentative tool for literature reviews, but trade of report quality for completeness in interdisciplinary domains. We address these limitations by proposing a novel method, termed AURORA, which is particularly designed for interdisciplinary applications. On the interdisciplinary systems biology benchmark, AURORA ofers good coverage with high-quality reports.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Scientific research usually starts by reviewing prior art from the literature. This involves extensive
search across multiple sources, followed by critical evaluation of the relevance of each result and the
semantic relations between them. This time-consuming and somewhat repetitive task can be accelerated
and augmented with the help of LLMs, including generating the final written report. However, LLMs
cannot directly be used for literature surveys, due to the risk of hallucinating nonexistent references or
generating unverifiable reports. Instead, LLMs must be carefully integrated with literature databases
into robust automated workflows for scientific literature review (SLR). This has been successfully
practiced in several approaches, including Elicit [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Scite [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], and Undermind [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Such systems have
demonstrated their potential for reducing workload, but their efectiveness in interdisciplinary research
domains remains unclear.
      </p>
      <p>Literature reviews in interdisciplinary fields require integrating sources from multiple disciplines
in a contextually meaningful way. This hinges on matching potentially diferent domain-specific
vocabularies and assessing results based on their semantic relevance to the question. Here, we
empirically benchmark current state-of-the-art LLM SLR methodologies in an interdisciplinary setting. We
specifically consider the example of systems biology, which integrates knowledge from biology, physics,
mathematics, and computer science in order to provide predictive mechanistic understanding of living</p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
systems [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The breadth of the research field of systems biology creates several data integration and
integrative analysis challenges [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] that are dificulty even for human scientists. In order to evaluate
how well LLM-based SLR systems perform in systems biology, we introduce an end-to-end systematic
comparison framework based on research questions formulated by domain experts and quantitative
evaluation of results. We find that previous LLM SLR systems focus on certain performance dimensions
while neglecting others. Based on this observation, we propose a new approach: AURORA — Automated
Understanding and Review Of Research Articles. The proposed AURORA method balances well on
report quality and reference retrieval. This allows us to identify opportunities for further development
of LLM SLR systems in interdisciplinary science.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>We design an end-to-end evaluation framework as illustrated in Fig. 1. The key ingredients are: selection
of SLR methods, dataset acquisition, AURORA development, and evaluation metrics.</p>
      <p>Selection of SLR Methods: From amongst the previous LLM SLR methods Elicit1, Scite2, Perplexity3,
EvidenceHunt4, MirrorThink5, Scispace6, and Undermind7, we selected four state-of-the-art approaches
by the following criteria: the approach should be LLM-driven; it should either focus on biology, medicine,
or randomized controlled trials as their domain; it should be a retrieve-and-generate assistant rather
than a simple search wizard or writing assistant.</p>
      <p>Dataset acquisition: We organized a workshop with a total of 15 practicing domain experts from
systems biology. This included doctoral students, doctoral researchers, and principal investigators from
the Center for Systems Biology Dresden. We asked them to formulate broad and open-ended research
questions that were not previously answered in any single publication. We collected 12 such research
questions of diverse type and nature (see appendix A).</p>
      <p>AURORA development: We design AURORA to perform end-to-end systematic review of all
literature relevant to a given research question. It combines keyword search, abstract screening,
fulltext understanding of individual papers, and combining information from multiple papers. As data
sources, AURORA uses the public APIs of ArXiv and PubMed, providing almost complete coverage of
the literature relevant to systems biology. PubMed’s metadata supports convergent search over multiple
iterations, which is not possible for ArXiv. Hence, the evaluation below uses the PubMed database.</p>
      <p>The overall design of AURORA is summarized in Fig. 2. We automate keyword search by prompting
a LLM to generate five search phrases for a given research question, selecting the one yielding the
most results. We screen the abstracts of the resulting papers in a vector-space embedding. We repeat
the process for successively longer search time horizons until we find 5–10 “seed papers”. Next, we
scan the references of these seed papers using PubMed’s metadata, collect their unique identifiers
(PubMedID-PMID, PubMedCentralID-PMCID, DOI), and screen the abstracts of all cited papers. This is
repeated until the result set converges (i.e., no new papers are found). Upon convergence, the full texts
of all found papers are fetched and collectively fed into a long-context LLM. If full text is not available
(or pay-walled) only the abstract is used. The LLM summarizes each paper, and a re-ranking model is
applied to score the summaries. From this, the final written report is generated using the LLM.</p>
      <p>
        Implementation details: Overall, AURORA uses one embedding model, one re-ranking model, and
three LLMs (for keyword–phrase generation, full-text summarizing, and report compilation). For the
evaluation below, all three LLM instances were GPT-4o-mini. For abstract screening, embeddings were
generated using the nomic-embeddings model8 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which supports a sequence length of 8192 tokens.
The search time horizon was iteratively extended in 12-month steps until seed papers were found. The
maximum number of scans allowed is 20 per research question. The summaries were re-ranked using
the Jina re-ranker9 based on their relevance to the initially given research question.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation and Results</title>
      <p>We use evaluation metrics from Fig. 1 to perform qualitative analysis of generated reports and
quantitative analysis of references retrieved for Elicit, Scite, Evidence Hunt, Undermind, and AURORA.</p>
      <sec id="sec-3-1">
        <title>8https://huggingface.co/nomic-ai/nomic-embed-text-v1 9https://huggingface.co/jinaai/jina-reranker-v1-turbo-en</title>
        <sec id="sec-3-1-1">
          <title>3.1. LLM-Generated Reports</title>
          <p>LLM-as-judge evaluation: We use the state-of-the-art models GPT-4o and Claude 3.5-sonnet as judges
to rate the reports generated by the diferent LLM SLR approaches across all 12 research questions on a
Likert scale of 1–5 based on the criteria in Table 1. Higher ratings indicate better reports. The spider
charts of the results from both judges are shown in Fig. 3. From those, the performance of Evidence
Hunt and AURORA were similar for almost all criteria, whereas Scite showed moderate performance,
and Elicit and Undermind under-performed.</p>
          <p>Metric
Coverage</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Thoroughness</title>
      </sec>
      <sec id="sec-3-3">
        <title>Boldness</title>
      </sec>
      <sec id="sec-3-4">
        <title>Readability</title>
      </sec>
      <sec id="sec-3-5">
        <title>Balance viewpoints Sensibility of</title>
        <p>Prompt description
Assess how well the report addresses the research question by considering all
relevant aspects, perspectives, and sub-questions.</p>
        <p>Examine how thoroughly and in-depth each aspect is explained in the report, along
with how well it is supported by evidence and clear explanations.</p>
        <p>Examine whether the report presents well-supported arguments with confidence,
handles ambiguous points with clarity and free of vague language.</p>
        <p>Assess how easily the report can be understood and followed by evaluating its
structure, including the use of headings, subheadings, and suficient context.</p>
        <p>Evaluate if the report presents balanced perspectives on conflicting literature, avoids
extreme views unless well supported, and acknowledges diferent sides of argument.</p>
        <p>Assess the report’s alignment with common sense and logical coherence. Evaluate
the rationality and chronology of references in the report.</p>
        <p>Average length of generated reports: We compare the average number of tokens generated for
each research question and approach. As shown in Table 2, Elicit and Undermind produced shorter
reports than Scite and Evidence Hunt, suggesting they may not comprehensively address open-ended
literature research questions. This aligns with Fig. 3, where Elicit and Undermind received lower
ratings for coverage and thoroughness. AURORA used the most tokens, indicating it may provide more
comprehensive contextual information, whereas Evince Hunt and Scite were almost equal in the middle.</p>
        <sec id="sec-3-5-1">
          <title>3.2. Retrieved References</title>
          <p>
            Elicit, Scite and Undermind use Semantic Scholar having access to 200 million articles [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] whereas
Evidence Hunt and AURORA use PubMed with approx. 37 million articles [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] (cf. Table 3). We used the
initially retrieved references without further asking for extension to ensure fairness across approaches.
          </p>
          <p>
            Overlap of references: We count the number of common references reported by any two SLR
approaches. Since approaches are uncorrelated, this allows estimating the fraction of all references any
method  finds as: (Number of common references between  and  )/(Number of references found by
 ) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. This metric is called “overlap”, since a value of 1 indicates complete overlap of results. A color
map of the overlap between any two approaches is shown in Fig. 4 where rows are  and columns
 . The diagonal is set to 0 to better utilize the color range. Undermind shows the highest overlap
with other methods, likely because it finds the most references. The overlap of AURORA is almost as
high, while Elicit, Scite, and Evidence Hunt show significantly lower overlaps (below 0.075) with other
methods, indicating limited eficiency to find the best, most relevant, or foundational papers for the
question.
          </p>
          <p>SLR
Approach</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>Elicit</title>
      </sec>
      <sec id="sec-3-7">
        <title>Scite</title>
      </sec>
      <sec id="sec-3-8">
        <title>EvidenceHunt</title>
      </sec>
      <sec id="sec-3-9">
        <title>Undermind</title>
      </sec>
      <sec id="sec-3-10">
        <title>AURORA</title>
        <p>Articles
(million)
200
200
37
200
37</p>
        <p>Recall
(%)
4.46
6.64
3.88</p>
        <p>Recall of references: Recall is defined as the fraction of references retrieved by any SLR method
compared to the total set of references found by all methods combined. This generalizes the pair-wise
overlap metric to considering all other methods as a reference set. Table 3 shows that Evidence Hunt
has the lowest recall score, suggesting that it fails to find a suficient number of relevant references
despite high scores in other criteria (see Fig. 3). AURORA mitigates this limitation and stands close to
Undermind in its ability to retrieve highly relevant references.</p>
        <p>Recency of references: We count the average number of references found from recent years (2022,
2023, 2024), across all research questions. Table 3 shows that Elicit has only a 75% chance to find one
recent paper on average, whereas all other approaches find several recent papers. AURORA’s iterative
temporal approach helps it capture more recent papers than Evidence Hunt from the same database.
Undermind’s very high recency score is likely due to its use of Semantic Scholar, which is a much larger
literature database than PubMed.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We presented an initial study on using LLM SLR methods in an interdisciplinary field of science, here
systems biology. We found that the performance of state-of-the-art methods is complementary in terms
of report quality and reference recall/recency, reflecting a trade-of in interdisciplinary searches. The
AURORA approach proposed here could provide a way of addressing this limitation, as we found it
to maintain the best balance. This hints at its ability of retrieving contextually relevant references
while still achieving good literature coverage. This hinges on iterating temporal search with LLM-based
full-text understanding until convergence. In the future, we will extend AURORA to additional database
sources, including databases of gene and protein sequences.
This work was supported by the German Federal Ministry of Education and Research BMBF through
DAAD project 57616814 (SECAI, School of Embedded Composite Artificial Intelligence). We thank
all participants of the workshop at the Center for Systems Biology Dresden for contributing research
questions for the evaluation.</p>
    </sec>
    <sec id="sec-5">
      <title>A. Online Resources</title>
      <p>• The research questions used in the evaluation are available as online spreadsheet A.
• The reports from all approaches with criteria scores are available as online spreadsheet B.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kung</surname>
          </string-name>
          ,
          <article-title>Elicit (product review)</article-title>
          ,
          <source>Journal of the Canadian Health</source>
          Libraries Association / Journal de l'Association des bibliothèques de la
          <source>santé du Canada</source>
          <volume>44</volume>
          (
          <year>2023</year>
          ). URL: https://journals.library. ualberta.ca/jchla/index.php/jchla/article/view/29657. doi:
          <volume>10</volume>
          .29173/jchla29657.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nicholson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mordaunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Uppala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rosati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Grabitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rife</surname>
          </string-name>
          ,
          <article-title>Scite: A smart citation index that displays the context of citations and classifies their intent using deep learning</article-title>
          ,
          <source>Quantitative Science Studies</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . doi:
          <volume>10</volume>
          .1162/qss_a_
          <fpage>00146</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>J. M. N. Sean</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Rife</surname>
          </string-name>
          , Domenic Rosati,
          <article-title>scite: The next generation of citations</article-title>
          ,
          <source>Learned Publishing</source>
          <volume>34</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ramette</surname>
          </string-name>
          ,
          <source>Benchmarking the undermind search assistant</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Szallasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stelling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Periwal</surname>
          </string-name>
          ,
          <article-title>System modeling in cellular biology: From concepts to nuts</article-title>
          and bolts,
          <year>2006</year>
          . doi:
          <volume>10</volume>
          .7551/MITPRESS/9780262195485.001.0001.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rowen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aitchison</surname>
          </string-name>
          ,
          <article-title>Systems biology at the institute for systems biology</article-title>
          .,
          <year>2008</year>
          . doi:
          <volume>10</volume>
          .1093/bfgp/eln027.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. D. . th.</given-names>
            <surname>Manchester</surname>
          </string-name>
          ,
          <article-title>Data integration in the life sciences : 6th international workshop</article-title>
          , dils
          <year>2009</year>
          , manchester, uk,
          <source>july 20-22</source>
          ,
          <year>2009</year>
          : Proceedings,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kurali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Menius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Review of integrative analysis challenges in systems biology</article-title>
          ,
          <year>2011</year>
          . doi:
          <volume>10</volume>
          .1198/sbr.
          <year>2010</year>
          .
          <volume>09027</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nussbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Duderstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mulyar</surname>
          </string-name>
          ,
          <article-title>Nomic embed: Training a reproducible long context text embedder</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402.01613. arXiv:
          <volume>2402</volume>
          .
          <fpage>01613</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10] U. of Calgary, Semantic scholar resources statistics,
          <year>2024</year>
          . URL: https://libguides.ucalgary.ca/c.php? g=
          <volume>732144</volume>
          &amp;p=
          <volume>5260798</volume>
          , last accessed
          <issue>15</issue>
          <year>September 2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>N. L.</surname>
          </string-name>
          of Medicine,
          <source>Pubmed resources statistics</source>
          ,
          <year>2024</year>
          . URL: https://pubmed.ncbi.nlm.nih.gov/,
          <source>last accessed 15 September</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>