<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Research in the Era of Agentic AI: Requirements and Limitations for Scholarly Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamad Yaser Jaradeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sören Auer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center, Leibniz University Hannover</institution>
          ,
          <addr-line>Hanover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TIB - Leibniz Information Centre for Science and Technology</institution>
          ,
          <addr-line>Hanover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the fast-evolving era of agentic AI, Large Language Models (LLMs) from major providers and open-source alternatives ofer unprecedented capabilities for “deep search”, enabling complex, iterative information retrieval and synthesis crucial for academic endeavors. However, their application in scientific research and paper writing necessitates strict requirements and a critical awareness of inherent limitations, including the risks of unreviewed content, temporal biases, and access barriers such as paywalls. This vision paper discusses a list of requirements that a scientific deep research system should have to become a viable candidate (i.e., to become a valuable system for researchers). As well as a list of limitations that are observed from current systems (industry-grade and community-developed). We also outline a path forward for harnessing agentic AI in scientific discovery and scholarly communication.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Agentic AI</kwd>
        <kwd>Deep (Re)Search</kwd>
        <kwd>Information Asymmetry</kwd>
        <kwd>Unreviewed Content Risks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advances in Large Language Models (LLMs) have shifted the focus of Artificial Intelligence (AI)
research from static prediction to agentic autonomy. In agentic systems, an LLM iterates between
natural-language reasoning with external actions—taking API calls, executing code, and retrieving new
documents, while iteratively revising its action plan. The ReAct [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] framework first formalized this
plan-act-reflect loop, demonstrating that explicit reasoning traces boost performance on multi -step
tasks and provide hooks for tool integration.
      </p>
      <p>
        Commercial providers have rapidly productized these ideas. OpenAI’s Assistants and function-calling
APIs, built on GPT-4 and its multimodal successor GPT-4o, expose browsing, code-execution, and
memory primitives that let developers compose end-to-end research agents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Google’s Gemini family
extends the paradigm to images, audio, and video while supporting one million token contexts.
      </p>
      <p>We term these capabilities agentic deep search: long-horizon research workflows in which an
autonomous agent formulates queries, harvests evidence from heterogeneous APIs, evaluates source
quality, and synthesizes structured outputs. By shrinking days of literature search or patent mining
into hours, agentic deep search promises to accelerate discovery across disciplines.</p>
      <p>
        Yet raw capability does not guarantee full reliability. LLMs embed a dated snapshot of the world;
factual accuracy diminishes (or completely lacks) for events or findings that post -date their last training
cut-of. Even frontier models’ system cards (i.e., a document that provides transparent, standardized
information about the model’s characteristics, capabilities, limitations) cautions that alignment safeguards
remain a “work in progress”, with residual risks of hallucination and persuasive misinformation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Autonomy introduces further hazards: an unchecked agent may cite unreviewed blogs, omit
paywalled studies, or leak sensitive data while invoking external tools. Without rigorous provenance logs,
researchers cannot audit how conclusions were reached – an unacceptable opacity in scholarly contexts.
      </p>
      <p>This vision paper argues that agentic AI can become a trustworthy partner in scientific inquiry only
if its design satisfies research-specific requirements. We explore our vision with two core components:
1. Requirements: We formulate eight criteria—verifiability, bias mitigation, temporal awareness,
selective memory, and more—that distinguish research-grade agents from consumer chatbots.
2. Limitations: We catalog current failure modes, including unvetted inputs, concept drift, paywall
bias, hallucination, context bottlenecks, privacy constraints, and operational costs.</p>
      <p>By categorizing these principles and clarifying them, we aim to guide developers, funders, and
research communities toward agentic AI that is trustworthy and to be used by scientists and researchers
to help them perform complex tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background: Agentic AI and Deep Search Capabilities</title>
      <p>
        Agentic AI refers to autonomous systems designed to pursue complex goals with minimal human
intervention, demonstrating adaptability, advanced decision-making, and self-suficiency in evolving
environments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They couple a LLM’s internal chain-of-thought reasoning with external actions
in a feedback loop of plan → act → reflect. Paradigms such as ReAct interleave natural -language
reasoning traces with tool calls, letting the model update its plan as new evidence arrives [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. More
generally, chain-of-thought prompting shows that exposing intermediate reasoning steps markedly
boosts accuracy on complex tasks, establishing an essential primitive for autonomy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Major providers now ship APIs that turn foundation models into multi-modal research agents.
OpenAI’s GPT-4/4o [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] family supports function-calling, web browsing, and code execution, while
Anthropic’s Claude 3/4 series [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] advertises sustained “computer use” workflows and long context
reasoning suitable for document ranking and assessments. Academic and open-source eforts, e.g.,
Toolformer, where the model teaches itself when and how to call external APIs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], further illustrate
this trend toward tightly integrated tool use.
      </p>
      <p>
        With browsing and API hooks, LLM agents can conduct multi-step literature and data dives that
previously required human curators:
• Web-scale retrieval &amp; citation. For instance, WebGPT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] fine -tunes GPT-3 to navigate the
web, gather evidence, and output answers with inline references, outperforming human answers
(collected from Reddit) in blind tests.
• Systematic literature reviews. Products like Elicit [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] use LLMs to rank and summarize
hundreds of abstracts for evidence synthesis, while open-source LLAssist [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] automates key
parts of PRISMA-style1 reviews.
• Domain-specific mining . LLM agents already appear in patent invalidity or prior-art searches,
delivering faster recall of obscure filings [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and in bibliometric trend analysis pipelines that
classify thousands of papers to surface emerging topics [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Collectively, these examples show that agentic AI search approaches can ingest and process 102–103
raw sources, draft structured summaries, and iterate as new information arrives.</p>
      <p>
        However, scientific inquiry sets a higher bar than consumer question answering. Researchers need
transparent provenance, reproducibility, domain-aware ranking, and the ability to handle uncertainty
requirements [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The next sections highlight where the current systems fall short and outline a list of
what is needed for trustworthy Deep (Re-)search in academia.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Requirements for Applying Agentic AI Search in Research</title>
      <p>Casual question and answering agents can aford occasional gaps or fuzzy information provenance;
scholarly workflows cannot. To be admissible in a paper, every claim, dataset, and result produced</p>
      <p>User
Intent</p>
      <p>GUI
Wizard</p>
      <p>WORKFLOW
Retrieval</p>
      <p>Reason/</p>
      <p>Plan</p>
      <p>TOOLS
Paywalls</p>
      <p>Latency/</p>
      <p>Cost</p>
      <p>Bias</p>
      <p>Retrieval Hallucina</p>
      <p>Gaps tion</p>
      <p>Context
Limits</p>
      <p>Eval</p>
      <p>Gaps</p>
      <p>Task
Decom.</p>
      <p>Verify</p>
      <p>Deliver
Vector</p>
      <p>DB</p>
      <p>Web
Scraper</p>
      <p>Guard
rails</p>
      <p>Fact
Checking</p>
      <p>Memory
by an LLM agent must withstand peer review and re-analysis for years to come. This section distills
eight core requirements that are required to turn an autonomous search agent into a reliable Deep
(Re-)search companion tool.</p>
      <p>
        Verifiability and Traceability. An agent must emit a machine-readable “audit trail” that links
each generated statement to (i) the exact retrieval query, (ii) stable identifiers (e.g., DOI, PubMed ID,
patent number), and (iii) the intermediate reasoning step that justified inclusion. Frameworks such as
TRiSM [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] (Trust, Risk &amp; Security Management) propose logging every tool call and chain-of-thought
token for post-hoc inspection, while LLMAuditor [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] demonstrates how a second LLM plus human
spot-checks can scalably detect citation drift or hallucination.
      </p>
      <p>
        Ethical and Bias Mitigation. Agentic search must actively counter selection biases that skew
evidence toward English-language, high-impact venues or majority demographics. Gallegos et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
survey on bias in LLMs’ catalogs mitigation approaches, from counterfactual prompts to calibrated
relevance scoring, that should be embedded as first -class modules when using agentic AI search. Recent
biomedical screening studies confirm that bias -aware prompts recover ∼ 12% more non-English RCTs
compared with naïve keyword filters [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Scalable Breadth with Human Depth. Hybrid pipelines such as LatteReview [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] or GPT-assisted
screening prototypes [19] show that LLM agents can filter through tens of thousands of abstracts,
yet final selection, annotation, and interpretation still require expert researcher review. Designing
checkpoints where humans judge borderline cases keeps false positives below systematic-review
thresholds without sacrificing the speed and comfort of using agentic AI search techniques.
Seamless Workflow Integration. To fit existing toolchains, agents should use bibliographic as well
as natural language. Feeding candidates straight into Zotero, exporting RIS/BibTeX, querying the arXiv
and Crossref APIs, and respecting journal impact-factor filters is essential. Tutorials on GPT-Zotero
bridges and OpenAI’s plugin/GPT-store model illustrate how modest wrappers can embed deep-search
capabilities inside everyday reference managers used by researchers [20, 21].
      </p>
      <p>Temporal Awareness &amp; Concept-Drift Control. Scientific consensus evolves; an agent that
combines 2012 and 2025 findings can produce misleading results. Temporal -drift frameworks such as
Zilean [22] and CORAL [23] adapt retrieval and ranking to shifting evidence bases, while code-centric
studies like Byam [24] show LLMs repairing projects broken by outdated dependencies. Conversely,
using LLMs for coding highlights the efects of temporal drift, where models use old and conflicting
packages to write code, unaware of newer releases or security vulnerabilities.</p>
      <sec id="sec-3-1">
        <title>Researcher-Controlled Narrative Scope. Deep (re-)searches must be bounded.</title>
        <p>Prompt-engineering studies in requirements engineering propose declarative “in-scope/out-of-scope”
schemas that agents use to accept or reject documents, ensuring that a cancer-biology review
does not drift into plant genomics unless explicitly instructed by the researcher [25]. Specification
work on multi-agent LLM systems further argues that formal task definitions are a prerequisite for
reproducibility and safety.</p>
        <p>
          Peer-Reviewed-Only Source Constraints. Agents need configurable whitelists (e.g., journals
ranked ≥ Q1, * conferences, or datasets with open licenses) and should fail fast if no qualifying
evidence exists. Title and abstract screening benchmarks show that adding an impact-factor filter
and peer-review flag cuts noise by 35% without hurting recall when compared to baseline semantic
search [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>Selective (Long-Term vs Short-Term) Memory. Researchers accumulate domain knowledge
(longterm) but isolate project-specific facts (short-term). Emerging work on episodic memory [ 26] for AI
agents recommends detachable, human-editable memory slots. Persistent memories store lab standards
or preferred datasets. Temporary slots hold the current paper set and disappear after publication. This
separation supports both cumulative expertise and clean-slate reproducibility.</p>
        <p>Meeting these eight requirements turns an LLM-based search agent from a helpful assistant into a
research-grade collaborator. Table 1 shows a condensed overview of the requirements discussed above.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Key Limitations of Agentic AI in Deep Research</title>
      <p>Scientific rigor requires a high level of traceability and trustworthiness for agentic AI search agents to
be used on a large scale. We now present a taxonomy of the most limiting factors that still separate
today’s autonomous “deep-search” stacks from reliable academic workflows.</p>
      <p>Unvetted or Non-Peer-Reviewed Inputs. Browser-enabled agents easily scrape blogs, press
releases, or social-media threads that have not undergone editorial or peer scrutiny. A recent survey of
LLM hallucinations notes that one-third of observed confabulations could be traced to unreviewed web
pages being pulled into the context window without provenance checks [27]. In systematic-review
works, up to 18% of citations surfaced by naive agents pointed to pre-print servers or personal websites
rather than journals indexed in Web of Science or MEDLINE – forcing experts to step in to filter the
noise [28]. As a mitigation, agents must default to curated APIs (PubMed, Crossref, Dimensions) or
enforce whitelists such as Q1 journals and A* conferences before loosening to grey literature.
Temporal Staleness and Concept Drift. LLMs freeze a snapshot of the world at their last training
cut-of; GPT -4’s base weights, for example, contain little post-2023 knowledge. Studies probing “efective
cut-of” dates reveal uneven freshness across domains, e.g., biomedicine lags by over a year compared
with computer science within the same model [29]. Follow-up work shows performance drops of 15-20%
on factoid questions whose answers shifted after the model’s cut-of [30].</p>
      <sec id="sec-4-1">
        <title>Paywalls, Licensing, and Institutional Access. Because autonomous agents have no legal right</title>
        <p>to bypass subscription barriers, deep searches systematically over-sample open-access content.
Bibliometric analyses of Wikipedia citations confirm a 44% OA bias even before agentic filtering. For
instance some environmental-sciences reviews note that paywalled articles are routinely omitted in
global syntheses, especially for low- and middle-income country authors and institutes [31, 32]. These
numbers are skewed whenever key findings reside behind JSTOR, IEEE-Xplore, or Elsevier access walls.</p>
        <p>Mitigation: Workflow engineers must pair agents with institutional proxy resolvers, maintain
error-out policies when access fails, or flag “coverage gaps” for manual moderation.
Hallucination and Mis-synthesis. Even when sources are valid, language models can fabricate links
or misattribute findings. Controlled evaluations detect high hallucination rates in retrieval -augmented
settings and higher in zero-retrieval modes when using specific models. Mitigation methods like
entropy-based uncertainty estimators catch only a subset of these errors [33]. In climate-change case
studies, agents have promoted retracted pre-prints as definitive evidence, illustrating how unreviewed
claims can be converted into citable facts in research papers.</p>
        <p>Context-Length and Multimodal Bottlenecks. Scientific artifacts are long: a systematic review’s
appendix can exceed 100k tokens, and figures embed essential data. Yet performance degrades as context
length stretches; recent work shows sigmoid-like decay curves once input exceeds 32k tokens [34].
Attempts to push to 128k tokens (e.g., LongRoPE2 [35]) restore accuracy only with specialized fine -tuning
that few commercial APIs expose. Meanwhile, extracting tables, charts, and images still requires separate
vision pipelines; although LLM-based “ChatExtract” tools achieve promising figure -level accuracy, they
remain brittle on complex multi-panel plots [36].</p>
        <p>Scalability, Cost, and API Fragmentation. Token fees, rate limits, and heterogeneous API formats
limit most end-to-end automation eforts. PubMed and arXiv are freely searchable, but Web of Science
and Scopus impose subscription APIs. Many domain-specific repositories (e.g., ICSD for materials
science) have no programmatic interfaces for automated access. Even when APIs exist, processing
millions of abstracts incurs high costs on popular cloud endpoints, constraining reproducibility for the
general public.</p>
        <p>Privacy and Data Protection. When research involves sensitive human data (e.g., genomic
sequences, patient notes, or even private datasets from internal resources), routing data through third-party
LLM endpoints can breach GDPR or institutional data privacy regulations. Surveys on LLM-agent
security enumerate data-exfiltration vectors, from prompt -leaks to malicious tool-wrapper code,
underscoring the need for on-premise or privacy-enhanced deployments [37].</p>
        <p>Given these risks, credible Deep (Re-)search pipelines must curtail full autonomy. Recommended
safeguards include:
• Whitelist enforcement for peer-reviewed venues before fallback to gray literature.
• Recency checks that mark claims older than a configurable horizon or outside the model’s
knowledge window.
• Access-gap alerts when paywalled or API-inaccessible content prevents coverage parity.
• Auto-generated audit trails plus mandatory human veto on any citation lacking stable
identiifers.
• Resource budgets that track cumulative emissions and token spend, prompting the researcher
to justify large-scale runs.</p>
        <p>By incorporating these considerations into agent orchestration layers, we can take advantage of LLM
eficiency while respecting the integrity and sustainability required of modern academia (cf. Figure 1).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Agentic AI has vaulted from laboratory curiosity to production-ready real-world application, giving
researchers unprecedented power to orchestrate deep searches that span thousands of papers, code bases,
and datasets in minutes. Yet this paper has shown that transforming raw capability into research-grade
reliability demands an explicit contract: verifiable provenance, bias controls, temporal awareness,
paywall navigation, privacy safeguards, and human-in-the-loop checkpoints. Without these guardrails,
the very speed and scale that make LLM agents attractive in the first place can increase misinformation
and negatively impact reproducibility.</p>
      <p>Our forward-looking vision is a “research-grade agent” framework that layers domain-specific
enhancements on top of existing approaches and implementations:
• Built-in quality filters: peer-review flags, journal -rank allow-lists, and recency cut-ofs—applied
before content enters the context window.
• Hybrid retrieval architectures: that sequence open-access mirrors, institutional proxy resolvers,
and licensed APIs, ensuring comprehensive coverage while respecting copyright and privacy
constraints.
• Episodic memory partitions: that separate durable disciplinary knowledge from project-specific
scratchpads, enabling cumulative expertise without afecting reproducibility.
• Standardized reliability benchmarks: that score agents on citation accuracy, coverage under
paywalls, and resistance to concept drift.
• Policy hooks: requiring disclosure of AI assistance in grant proposals, systematic reviews, and
journal submissions, creating incentives for transparent and auditable workflows.</p>
      <p>With the current advancements of LLMs we stand at a fork in the road. One path turns LLMs’
autonomy into an academic enhancer, reducing months of literature search into hours while still elevating
the rigor of the scientific method. The other path, if unchecked, would broadcast misinformation at
machine speed.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank our colleague Allard Oelen for his valuable comments in reviewing this paper.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini in order to: Paraphrase and reword,
Improve writing style, and Grammar and spelling check. As well as Deep Research to find more related
work. After using these service, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
view automation using large language models, 2025. URL: https://arxiv.org/abs/2501.05468.
arXiv:2501.05468.
[19] C. Galli, A. V. Gavrilova, E. Calciolari, Large language models in systematic review screening:
Opportunities, challenges, and methodological considerations, Information 16 (2025). URL: https:
//www.mdpi.com/2078-2489/16/5/378. doi:10.3390/info16050378.
[20] Zotero GPT Integration: No Coding — anara.com, https://anara.com/blog/zotero-gpt-integration,
2024. [Accessed 22-07-2025].
[21] ChatGPT plugins — openai.com, https://openai.com/index/chatgpt-plugins, 2024. [Accessed
22-072025].
[22] Z. Deng, Q. Feng, B. Lin, G. G. Yen, Zilean: A modularized framework for large-scale temporal
concept drift type classification, Information Sciences 712 (2025) 122134. URL: https://www.
sciencedirect.com/science/article/pii/S002002552500266X. doi:https://doi.org/10.1016/j.
ins.2025.122134.
[23] K. Xu, L. Chen, S. Wang, Coral: Concept drift representation learning for co-evolving time-series,
2025. URL: https://arxiv.org/abs/2501.01480. arXiv:2501.01480.
[24] F. Reyes, M. Mahmoud, F. Bono, S. Nadi, B. Baudry, M. Monperrus, Byam: Fixing breaking
dependency updates with large language models, 2025. URL: https://arxiv.org/abs/2505.07522.
arXiv:2505.07522.
[25] K. Huang, F. Wang, Y. Huang, C. Arora, Prompt engineering for requirements engineering: A
literature review and roadmap, 2025. URL: https://arxiv.org/abs/2507.07682. arXiv:2507.07682.
[26] C. DeChant, Episodic memory in ai agents poses risks that should be studied and mitigated, 2025.</p>
      <p>URL: https://arxiv.org/abs/2501.11739. arXiv:2501.11739.
[27] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu,
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open
questions, ACM Trans. Inf. Syst. 43 (2025). URL: https://doi.org/10.1145/3703155. doi:10.1145/
3703155.
[28] L. Schmidt, I. Cree, F. Campbell, WCT EVI MAP group, Digital tools to support the systematic
review process: An introduction, J Eval Clin Pract 31 (2025) e70100.
[29] J. Cheng, M. Marone, O. Weller, D. Lawrie, D. Khashabi, B. V. Durme, Dated data:
Tracing knowledge cutofs in large language models, 2024. URL: https://arxiv.org/abs/2403.12958.
arXiv:2403.12958.
[30] C. Zhu, N. Chen, Y. Gao, Y. Zhang, P. Tiwari, B. Wang, Is your LLM outdated? a deep look at
temporal generalization, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), Proceedings of the 2025 Conference
of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human
Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics,
Albuquerque, New Mexico, 2025, pp. 7433–7457. URL: https://aclanthology.org/2025.naacl-long.381/.
doi:10.18653/v1/2025.naacl-long.381.
[31] K. A. Wood, L. L. Jupe, F. C. Aguiar, A. M. Collins, S. J. Davidson, W. Freeman, L. Kirkpatrick,
T. Lobato-de Magalhães, E. McKinley, A. Nuno, J. F. Pagès, A. Petruzzella, D. Pritchard, J. P. Reeves,
S. M. Thomaz, S. A. Thornton, H. Yamashita, J. L. Newth, A global systematic review of the
cultural ecosystem services provided by wetlands, Ecosystem Services 70 (2024) 101673. URL:
https://www.sciencedirect.com/science/article/pii/S2212041624000809. doi:https://doi.org/
10.1016/j.ecoser.2024.101673.
[32] P. Yang, A. Shoaib, R. West, G. Colavizza, Open access improves the dissemination of science:
insights from wikipedia, Scientometrics 129 (2024) 7083–7106. URL: https://doi.org/10.1007/
s11192-024-05163-4. doi:10.1007/s11192-024-05163-4.
[33] S. Farquhar, J. Kossen, L. Kuhn, Y. Gal, Detecting hallucinations in large language models using
semantic entropy, Nature 630 (2024) 625–630. URL: https://doi.org/10.1038/s41586-024-07421-0.
doi:10.1038/s41586-024-07421-0.
[34] Z. Dong, J. Li, J. Jiang, M. Xu, W. X. Zhao, B. Wang, W. Chen, Longred: Mitigating short-text
degradation of long-context large language models via restoration distillation, 2025. URL: https:
//arxiv.org/abs/2502.07365. arXiv:2502.07365.
[35] N. Shang, L. L. Zhang, S. Wang, G. Zhang, G. Lopez, F. Yang, W. Chen, M. Yang, LongroPE2:
Near-lossless LLM context window scaling, in: Forty-second International Conference on Machine
Learning, 2025. URL: https://openreview.net/forum?id=jwMjzGpzi4.
[36] M. P. Polak, D. Morgan, Extracting accurate materials data from research papers with
conversational language models and prompt engineering, Nature Communications 15 (2024) 1569. URL:
https://doi.org/10.1038/s41467-024-45914-8. doi:10.1038/s41467-024-45914-8.
[37] B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, X. Cheng, On protecting the data privacy of large
language models (llms) and llm agents: A literature review, High-Confidence Computing 5 (2025)
100300. URL: https://www.sciencedirect.com/science/article/pii/S2667295225000042.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Cao,</surname>
          </string-name>
          <article-title>ReAct: Synergizing reasoning and acting in language models</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>GPT</given-names>
            <surname>-4o System</surname>
          </string-name>
          Card - openai.com, https://openai.com/index/gpt-4o
          <string-name>
            <surname>-</surname>
          </string-name>
          system-card/,
          <year>2024</year>
          . [Accessed 22-
          <fpage>07</fpage>
          -2025].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Robison</surname>
          </string-name>
          ,
          <article-title>OpenAI says its latest GPT-4o model is 'medium' risk - theverge</article-title>
          .com, https://www. theverge.com/
          <year>2024</year>
          /8/8/24216193/openai-safety
          <article-title>-assessment-gpt-</article-title>
          <string-name>
            <surname>4o</surname>
          </string-name>
          ,
          <year>2024</year>
          . [Accessed 22-
          <fpage>07</fpage>
          -2025].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kuppan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Divya</surname>
          </string-name>
          , Agentic ai:
          <article-title>Autonomous intelligence for complex goals-a comprehensive survey</article-title>
          ,
          <source>IEEE Access 13</source>
          (
          <year>2025</year>
          )
          <fpage>18912</fpage>
          -
          <lpage>18936</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2025</year>
          .
          <volume>3532853</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Neural Information Processing Systems</source>
          , NIPS '22, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Introducing</given-names>
            <surname>Claude</surname>
          </string-name>
          4
          <article-title>- anthropic</article-title>
          .com, https://www.anthropic.com/news/claude-4,
          <year>2025</year>
          . [Accessed 22-
          <fpage>07</fpage>
          -2025].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dwivedi-Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dessí</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raileanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hambro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cancedda</surname>
          </string-name>
          , T. Scialom,
          <article-title>Toolformer: language models can teach themselves to use tools</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Neural Information Processing Systems</source>
          , NIPS '23, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balaji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Eloundou</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>K.</given-names>
            <surname>Button</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          , Webgpt:
          <article-title>Browser-assisted question-answering with human feedback</article-title>
          ,
          <source>ArXiv abs/2112</source>
          .09332 (
          <year>2021</year>
          ). URL: https://api.semanticscholar.org/CorpusID:245329531.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kung</surname>
          </string-name>
          ,
          <article-title>Elicit (product review)</article-title>
          ,
          <source>J. Can. Health Libr. Assoc</source>
          .
          <volume>44</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C. Y.</given-names>
            <surname>Haryanto</surname>
          </string-name>
          , Llassist:
          <article-title>Simple tools for automating literature review using large language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.13993. arXiv:
          <volume>2407</volume>
          .
          <fpage>13993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dadri</surname>
          </string-name>
          ,
          <article-title>A New Era for Eficient Patent Invalidity Searches - XLSCOUT - xlscout</article-title>
          .ai, https: //xlscout.ai
          <article-title>/a-new-era-for-eficient-patent-invalidity-searches-with-</article-title>
          <string-name>
            <surname>llms</surname>
          </string-name>
          ,
          <year>2025</year>
          . [Accessed 22-
          <fpage>07</fpage>
          - 2025].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Munawar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mylonas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Delsoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Madadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Abu</given-names>
            <surname>Serhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Inam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Munir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abd-Alrazaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yousefi</surname>
          </string-name>
          ,
          <article-title>Automated category and trend analysis of scientific articles on ophthalmology using large language models: Development and usability study</article-title>
          ,
          <source>JMIR Form Res</source>
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <article-title>e52462</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>J. de la Torre-López</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramírez</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Romero</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence to automate the systematic review of scientific literature</article-title>
          ,
          <source>Computing</source>
          <volume>105</volume>
          (
          <year>2023</year>
          )
          <fpage>2171</fpage>
          -
          <lpage>2194</lpage>
          . URL: https://doi.org/10.1007/ s00607-023-01181-x. doi:
          <volume>10</volume>
          .1007/s00607-023-01181-x.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sapkota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karkee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Emmanouilidis</surname>
          </string-name>
          ,
          <article-title>Trism for agentic ai: A review of trust, risk, and security management in llm-based agentic multi-agent systems</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/ 2506.04133. arXiv:
          <volume>2506</volume>
          .
          <fpage>04133</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amirizaniani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavergne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Okada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Roosta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Llmauditor: A framework for auditing large language models using human-in-the-</article-title>
          <string-name>
            <surname>loop</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2402.09346. arXiv:
          <volume>2402</volume>
          .
          <fpage>09346</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Gallegos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barrow</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Tanjim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Bias and fairness in large language models: A survey</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <fpage>1097</fpage>
          -
          <lpage>1179</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .cl-
          <volume>3</volume>
          .8/. doi:
          <volume>10</volume>
          .1162/coli_a_
          <fpage>00524</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dennstädt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Putora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hastings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cihoric</surname>
          </string-name>
          ,
          <article-title>Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain</article-title>
          ,
          <source>Systematic Reviews</source>
          <volume>13</volume>
          (
          <year>2024</year>
          )
          <article-title>158</article-title>
          . URL: https://doi.org/10.1186/s13643-024-02575-4. doi:
          <volume>10</volume>
          .1186/ s13643-024-02575-4.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rouzrokh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shariatnia</surname>
          </string-name>
          ,
          <article-title>Lattereview: A multi-agent framework for systematic re-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>