<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LLM-Driven Clinical Trial Matching for Lung Cancer Patients: An Explainable Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vittoria Peppoloni</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Leone</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Mazzeo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Ferrarin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vanja Miskovic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Lo Russo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Baili</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arsela Prelaj</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federica Corso</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electronic, Information and Bioengineering, Politecnico di Milano</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Epidemiology and Data Science, Fondazione IRCCS Istituto Nazionale dei Tumori di Milano</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Medical Oncology Department 1, Fondazione IRCCS Istituto Nazionale dei Tumori di Milano</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Matching lung cancer patients to clinical trials remains a labor-intensive and error-prone process. We present MedMatch, an explainable, on-premises system leveraging Large Language Models (LLMs) for automated patienttrial matching. We evaluated 35 lung cancer patients from IRCCS Istituto Nazionale dei Tumori, achieving 80% trial matching accuracy. As part of the pipeline, the system first extracts 12 clinical parameters from Electronic Health Records with 84.9% overall accuracy using LLaMA 3.1 8B, then performs trial matching using Gemma 3 27B. Performance varied from perfect for demographics (100%) to more challenging for complex features such as mutations (71%) and line of therapy (77% accuracy, 36% F1-score). Benchmarking on a 10-patient subset showed that LLaMA 3.1 8B outperformed three alternative models, including the domain-specific MedLLaMA 2 (87.4% vs. 50% accuracy). Incorporating few-shot prompting with oncologist-curated examples further improved LLaMA's performance. However, hallucination analysis revealed unreliable behavior in cases with missing data, with hallucination rates ranging from 15.4% for previous treatments to 100% for ECOG status. MedMatch addresses these challenges by combining structured extraction with layered explainability through JSON outputs, eligibility justifications, and PDF evidence highlighting, while preserving data privacy via local deployment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Electronic Health Records</kwd>
        <kwd>Explainability</kwd>
        <kwd>Clinical Trial Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Lung cancer is biologically complex, with high molecular and phenotypic heterogeneity that demands
increasingly personalized therapeutic approaches. Clinical trials are critical to advancing treatment,
yet identifying eligible patients remains ineficient: clinicians must manually review patient records
against complex eligibility criteria, a process that is both time-consuming and error-prone.</p>
      <p>Several digital solutions have attempted to address this challenge. Early rule-based systems required
structured inputs and failed to capture the narrative richness of real-world records. Commercial
platforms such as IBM Watson for Oncology, ClinicalTrials.ai, and Mendel.ai have explored natural
language processing approaches, but they have faced criticism for limited validation, lack of transparency,
and privacy concerns due to cloud-based deployment. These shortcomings underline the need for
accurate, explainable, and privacy-preserving systems.</p>
      <p>
        Large Language Models (LLMs) represent a transformative opportunity. They can extract nuanced
information from unstructured text and support clinical decision-making [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but their use in trial
matching raises key challenges: ensuring accuracy, avoiding hallucinations, maintaining explainability,
and protecting sensitive data.
      </p>
      <p>In this work we present MedMatch, an on-premises, explainable system for processing Electronic
Health Records (EHRs) of lung cancer patients. MedMatch extracts clinically relevant features, matches
patients to appropriate trials, and provides transparent explanations with visual evidence highlighting.</p>
      <p>By combining accuracy, interpretability, and local deployment, our system addresses the key limitations
of existing approaches.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Automated trial matching has been studied through commercial and academic eforts. Rule-based
systems such as the NCI Clinical Trials Search required manual data entry, while IBM Watson for
Clinical Trial Matching attempted to use NLP but was discontinued after limited clinical success. More
recent platforms (Mendel.ai, Deep 6 AI) improve automation but remain cloud-based and opaque,
ofering little transparency or benchmarking [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        Academic research has explored machine learning and LLM approaches. BERT-based models achieved
moderate accuracy but required extensive manual annotation, while GPT-3 and GPT-4 have been tested
for criteria extraction, with promising but incomplete results, particularly regarding hallucination
control and explainability [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        Beyond trial matching, LLMs have demonstrated strong potential in healthcare tasks such as note
summarization, diagnostic support, and information extraction. However, studies also document
limitations including bias and high hallucination rates, underscoring the need for rigorous validation
and interpretable outputs. Recent work on explainability has explored both traditional methods (LIME,
SHAP) and prompt-based strategies such as chain-of-thought and structured explanation templates
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Building on these foundations, MedMatch introduces schema-constrained extraction and evidence
highlighting, ensuring traceability from raw EHR text to eligibility decision.
      </p>
      <p>Table 1 summarizes how MedMatch difers from representative systems. Unlike commercial
cloudbased tools, it is deployed on-premises, enforces schema-driven feature extraction, and provides
multilevel explainability through JSON outputs, criteria analysis, and PDF highlighting.</p>
      <p>System Deployment
IBM Watson Oncol- Cloud
ogy
Mendel.ai Cloud
Deep 6 AI Cloud</p>
      <p>MedMatch (ours) On-premises</p>
    </sec>
    <sec id="sec-3">
      <title>3. System Architecture</title>
      <p>Explainability
Limited</p>
      <p>Privacy
No</p>
      <p>Benchmarking</p>
      <p>No
Black-box No
Minimal No
JSON + criteria + high- Yes
light PDF</p>
      <p>
        Limited
Limited
LLM
ing
benchmarkMedMatch employs a modular pipeline architecture with four components:
• a database management system managing the collection of clinical trials, providing eficient
storage and retrieval mechanisms through SQLAlchemy ORM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] with a PostgreSQL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] backend,
as configured through Flask’s application context.
• an input data processing module handling document parsing, text normalization, and
preparation for LLM processing. It accepts both PDFs and text entries, leveraging pdfplumber [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for text
extraction from PDF documents.
• a feature extraction engine leveraging LLaMA 3.1 8B to identify clinically relevant information
from unstructured text, converting narrative clinical descriptions into a structured JSON format
that captures key clinical attributes, while highlighting in the original PDF the source text
segments used for feature extraction.
• a trial matching and explanation module with Gemma 3 27B matching the extracted patient
features against the eligibility criteria of the available trials, generating multi-level explanations
for each match recommendation.
      </p>
      <p>These components are seamlessly integrated within a web-based interface, enabling clinicians to
interact with the system, visualize matching results, and explore detailed explanations. MedMatch
is implemented as a Flask web application, utilizing an Ollama server for LLM deployment. The
model operates locally, ensuring data privacy by avoiding any external data transmission, while GPU
acceleration provides fast and eficient processing.</p>
      <sec id="sec-3-1">
        <title>3.1. Feature Extraction Pipeline</title>
        <p>The feature extraction module processes unstructured EHR narratives to extract twelve clinically
relevant parameters:
• Demographics: Age, Gender
• Disease characteristics: Diagnosis, Stage
• Performance status: ECOG Performance Status, PD-L1 expression
• Genomic alterations: Mutations
• Metastatic involvement: Brain Metastasis
• Therapies: Line of Therapy, Previous Treatments, Concomitant Treatments
• Comorbidities</p>
        <p>Feature extraction relies on structured prompting with explicit schema definitions, guiding the
language model to produce a standardized JSON object. The schema enforces strict constraints (e.g.,
"stage" limited to I–IV) while allowing the use of "not mentioned" when information is absent. For
transparency, the LLM also records the source text for each feature, enabling direct verification within
the original PDF (Figure 1). All attributes are semantically aligned with trial eligibility requirements:
mutation data are restricted to actionable genes (e.g., EGFR, KRAS, MET ), PD-L1 expression is categorized
by clinical cutofs (&lt;1%, 1–49%, &gt;50%), and previous treatments are constrained to those relevant for lung
cancer trials. By returning only the JSON object, the system minimizes hallucinations and extraneous
content, ensuring robust, machine-readable outputs for downstream matching.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Trial Matching and Explainability</title>
        <p>In the trial matching step, the LLM utilizes the JSON representation of a patient’s features to compare
them against the inclusion and exclusion criteria of each trial in the database. For eficiency, the list of
available trials is processed in batches, a parameter that can be adjusted through system settings. For
each trial, the system generates a JSON output containing:
• Trial ID and title retrieved from the database
• A match score (0-100), calculated using a scoring system that starts with 100 points and subtracts
10 points with each unmet inclusion criterion and subtracts 50 points with each violated exclusion
criterion.
• An overall recommendation (“Eligible” or “Not Eligible”), determined by a threshold score of 70.
• A detailed criteria analysis, specifying which inclusion and exclusion criteria were met or violated.
• A brief summary explanation clarifying the reasoning behind the eligibility decision.
Trials with a match score above 70 are displayed on the platform in ascending order.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Materials and Methods</title>
      <sec id="sec-4-1">
        <title>4.1. Study Cohorts</title>
        <p>We analyzed EHRs from 35 lung cancer patients recruited at IRCCS Istituto Nazionale dei Tumori (INT),
divided into:
• Benchmarking set: 10 patients for model comparison and prompt optimization
• Validation set: 25 patients for full pipeline evaluation</p>
        <p>
          The trial database comprised 17 lung cancer studies currently active at our institution,
programmatically retrieved from ClinicalTrials.gov [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] using their NCT identifiers. For each trial, key information
(phase, title, description, eligibility criteria, demographic restrictions, and recruitment status) was
extracted, standardized, and stored in a local relational database. This resource formed the basis for
trial matching and was also made accessible through a dedicated page within the web application.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Design</title>
        <p>The study was organized as a series of targeted experiments, each designed to isolate and assess a specific
component of the pipeline. We began by benchmarking four LLMs for feature extraction (LLaMA 3.1 8B,
DevStral 24B, Mistral 7B, and MedLLaMA 2) on the 10-patient set, in order to compare general-purpose
and domain-specific models. Based on these results, the best model was further evaluated with two
prompting strategies: zero-shot and few-shot prompting, the latter enriched with examples crafted by
an oncologist.</p>
        <p>Once the optimal feature extraction setup was identified (LLaMA 3.1 8B with few-shot prompting),
we assessed trial matching by comparing Gemma 3 27B and Qwen 2.5 32B on the benchmarking set.
The best-performing configuration (LLaMA + Gemma) was then validated on the 25-patient validation
set to evaluate end-to-end performance in a realistic setting.</p>
        <p>To better understand system behavior, we conducted two additional analyses. First, an ablation study
removed the feature extraction step, directly matching trials from raw PDFs to assess the added value of
structured extraction. Second, a hallucination analysis examined whether the model correctly reported
features as "not mentioned" rather than generating unsupported outputs.</p>
        <p>This ablation design allowed us to progressively benchmark components, optimize prompting, validate
the integrated pipeline, and probe system robustness under challenging conditions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Methodology</title>
        <p>The evaluation framework assessed both the accuracy of feature extraction and the trial matching
performance. Feature extraction was evaluated by comparing LLM-extracted features with
expertannotated ground truth data. Trial matching performance was measured by comparing the system’s
eligibility decisions with expert assessments, focusing on both the accuracy of predictions and the
clarity of generated explanations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. LLM Benchmarking for Feature Extraction</title>
        <p>We compared the performance of four privacy-preserving LLMs. The heatmap below (Figure 2) shows
comparative performance across them on the 10-patient benchmarking set. LLaMa 3.1 8B achieved the
highest average accuracy (87.4%), followed by Mistral 7B (84.1%), DevStral 24B (74.1%), and MedLLaMa
2 (50%).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Prompt Engineering Impact</title>
        <p>The radar plot (Figure 3) compares zero-shot versus few-shot prompting strategies on the benchmarking
cohort using LLaMA 3.1 8B. Few-shot prompting, incorporating oncologist-crafted examples, improved
overall accuracy from 80.8% to 87.4% (+6.6% relative improvement). The most substantial gains were
observed in complex features: metastases detection improved from 70 to 80 (+10%), concomitant
treatment from 60 to 100 (+40%), and previous treatments from 70 to 100 (+30%). Simple demographic
features showed minimal improvement.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Trial Matching Model Comparison</title>
        <p>With LLaMA 3.1 8B established as the feature extractor, we compared Gemma 3 27B and Qwen 2.5 32B
as trial matching engines on the 10-patient benchmarking set. By fixing the extracted features, the
evaluation isolated the performance of the matching logic itself. Table 2 summarizes the results.</p>
        <p>Gemma 3 27B outperformed Qwen 2.5 32B, correctly identifying eligible trials in 9 of 10 cases and
producing coherent, criterion-based explanations. Its outputs consistently traced inclusion and exclusion
criteria, yielding interpretable justifications. Qwen 2.5 32B, by contrast, achieved 6 correct matches,
one of which contained partially incorrect eligibility assumptions, while in the four mismatched cases
it hallucinated inclusion/exclusion criteria.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Full Pipeline Validation</title>
        <p>The optimized configuration (LLaMA 3.1 8B with few-shot prompting + Gemma 3 27B) was validated
on the full set of 35 patients. Table 3 shows feature extraction performance.</p>
        <p>The model reached an overall accuracy of 84.9% in feature extraction and 80% accuracy in trial
matching. Demographic features (gender, age) were extracted with perfect accuracy (100%). Disease
characterization showed strong accuracy, with diagnosis at 88% and stage at 85%; however, they report
the corresponding F1-scores (59% and 68%). ECOG performance status reached 91% accuracy but only
47% F1. Treatment-related variables varied: concomitant treatments achieved balanced performance
(85% accuracy, 74% F1), while previous treatments reached 80% accuracy but a lower F1-score (55%).
Line of therapy was particularly challenging, with 77% accuracy but the lowest F1-score (36%). Among
complex features, PD-L1 expression achieved the highest F1-score (78%) with 77% accuracy. Brain
metastasis detection yielded moderate results (77% accuracy, 53% F1), and genetic mutations showed
the lowest accuracy overall (71%).</p>
        <p>Of the 35 patients, 28 (80.4%) were correctly matched to eligible trials with the highest match score.
An additional 2 patients (5.7%) were correctly matched, but their eligible trials did not receive top scores.
One patient (2.8%) was not matched due to feature extraction errors, and 4 (11.1%) were incorrectly
classified as ineligible despite meeting trial criteria.</p>
        <p>Importantly, the system provides clear and logical explanations for each eligibility determination,
generating comprehensive schema that detail why a patient qualifies or does not qualify for a trial.
An example of this explanatory output is illustrated in Figure 5, which demonstrates how the system
highlights specific criteria contributing to the matching decision.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Ablation Study: Impact of Structured Feature Extraction</title>
        <p>To quantify the contribution of structured feature extraction, we compared the full pipeline against
direct PDF-to-matching (Table 4).</p>
        <p>While direct PDF matching reduced processing time from five minutes to one minute per patient, it
led to a substantial loss in performance, with accuracy dropping by 15%.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Missing Data Hallucination</title>
        <p>We investigated whether the model could correctly handle missing information in the EHR by returning
the label not mentioned. In such cases, the model frequently generated plausible but incorrect values:
a behavior commonly referred to as hallucination in the LLM literature. This issue was particularly
evident for ECOG, PD-L1, mutations and previous treatments. Table 5 reports hallucination rates for
these features when the ground truth indicated missing information.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>This study introduced an explainable LLM-based system designed to address the challenge of matching
lung cancer patients with appropriate clinical trials. The system was developed with the objective of
streamlining the trial matching process by automatically extracting relevant clinical features from EHRs
and identifying eligible clinical trials while providing transparent explanations for the recommendations.
Our evaluation on the full dataset of 35 patients demonstrates 80% matching accuracy, highlighting the
feasibility of achieving high performance without compromising explainability or data privacy.</p>
      <p>Our initial benchmarking revealed notable disparities in feature extraction performance across
LLMs. LLaMA 3.1 8B emerged as the most efective model, reaching 87.4% accuracy on the 10-patient
benchmark set. Surprisingly, it substantially outperformed MedLLaMA 2, a domain-specific model
ifne-tuned on medical literature, which achieved only 50%. This result challenges the assumption that
domain specialization guarantees better clinical performance. Instead, it suggests that general-purpose
models with broader training corpora and more advanced architectures may generalize better to the
fragmented, variable language of real-world EHRs than models trained on curated biomedical texts.</p>
      <p>Prompt engineering also emerged as a critical lever for performance. We observed a 6.6% gain in
overall accuracy using few-shot prompting, particularly when prompts were crafted by oncologists.
These carefully designed examples enhanced the model’s ability to reason over complex clinical features,
such as comorbidities or treatment history, more efectively than over simple demographic fields,
highlighting the importance of domain expertise not just in annotation but in prompt design as well.</p>
      <p>At the model level, significant variation in reasoning ability was observed. For instance, Qwen
2.5 32B underperformed Gemma 3 27B, not only in accuracy but also in reliability. Qwen frequently
hallucinated eligibility criteria, underscoring that raw parameter size is not a suficient indicator of
clinical utility. Hallucinations in this context represent more than just noise, they pose tangible safety
risks in clinical decision-making.</p>
      <p>Our validation revealed a clear performance hierarchy: demographics and ECOG status achieved
near-perfect extraction due to standardized reporting, while complex features like mutations and line
of therapy showed concerning accuracy gaps. The exceptionally low F1-score for line of therapy
(36%) indicates fundamental dificulties in temporal reasoning: determining treatment sequences
requires understanding information spanning multiple documents, a task that challenges current LLM
architectures.</p>
      <p>The ablation study further reinforced this limitation. Removing the intermediate feature extraction
step led to a 15% drop in trial matching accuracy, clearly demonstrating that current LLMs cannot
reliably parse complex eligibility criteria directly from raw text. This challenges the trend toward fully
end-to-end LLM pipelines and supports the inclusion of structured intermediate representations as a
core architectural element.</p>
      <p>Our hallucination analysis revealed a critical safety concern with dramatic variability by feature type.
The model’s tendency to invent ECOG scores (100% hallucination rate) or assume molecular testing
was performed when absent poses direct clinical risks. A hallucinated EGFR mutation could falsely
suggest targeted therapy eligibility, while an invented performance status could inappropriately exclude
patients from trials. These patterns represent fundamental limitations requiring mitigation strategies
before clinical deployment.</p>
      <p>Despite these challenges, the system demonstrated robust trial matching performance, suggesting
resilience to individual extraction errors through redundant eligibility criteria and weighted scoring. The
multi-layered explainability module illustrated in Figure 5 ofers multi-level transparency, combining
match scores, criteria-level justifications, and evidence highlighting. This layered approach marks
a significant step forward from black-box systems and aligns directly with clinician demands for
interpretability.</p>
      <p>To conclude, the local deployment architecture successfully balanced performance with privacy
concerns, demonstrating that modern LLMs can be efectively deployed within institutional boundaries
while maintaining competitive performance, though requiring substantial computational resources.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations</title>
      <p>The present study has several significant limitations that must be considered when interpreting the
results. The limited sample size (35 patients) restricts the generalizability of our findings. The
singleinstitution source of the data introduces potential biases in documentation structure and content. The
predominance of specific histological subtypes and disease stages may have influenced overall system
performance. Additionally, the lack of significant demographic diversity limits our ability to assess
system behavior across diferent populations.</p>
      <p>The LLMs demonstrate some systematic dificulties processing complex temporal information and
managing clinical ambiguities. The tendency to generate information not present in the original data
represents a significant risk in clinical settings. The system also shows limitations in understanding
particularly complex or subjective eligibility criteria, such as those related to comorbidities or functional
status. Despite eforts to implement explainability mechanisms, the black-box nature of the model limits
the ability to directly address the causes of specific errors. Local deployment, while privacy-preserving,
entails significant computational requirements that might limit adoption in resource-constrained clinical
settings.</p>
      <sec id="sec-7-1">
        <title>Declaration on Generative AI</title>
        <p>During the preparation of this work, the authors used ChatGPT for grammar and spelling checks. After
using this tool, the authors reviewed and edited the content as needed and take full responsibility for
the publication’s content.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Data Availability</title>
        <p>We thank all the patients who accepted to actively participate in the following studies: INT 23 22, INT
46 23, INT 68 24, INT 76 24, INT 192 23, INT 196 22, INT 225 24, INT 239 24, INT 247 23, INT 251 23,
INT 270 23. This work was supported by Fondazione IRCCS ‘Istituto Nazionale dei Tumori’.
The datasets presented in this article are not readily available because of patients’ privacy protection.
Requests to access the datasets should be directed to the corresponding author.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Online Resources</title>
        <p>The sources for this paper are available via GitHub.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>S. M.S.,</surname>
          </string-name>
          <article-title>An overview of revolutionizing lung cancer management with ai: Current advances and future prospects</article-title>
          ,
          <source>International Journal of Pharmaceutical Sciences</source>
          <volume>3</volume>
          (
          <year>2025</year>
          )
          <fpage>884</fpage>
          -
          <lpage>909</lpage>
          . URL: https://www.ijpsjournal.com/article/An+Overview+of+Revolutionizing+Lung+
          <article-title>Cancer+ Management+with+AI+Current+Advances+</article-title>
          and+Future+Prospects, accessed:
          <fpage>2025</fpage>
          -05-21.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calaprice-Whitty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Galil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Salloum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zariv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <article-title>Improving clinical trial participant prescreening with artificial intelligence (ai): A comparison of the results of ai-assisted vs standard methods in 3 oncology trials</article-title>
          ,
          <source>Therapeutic Innovation &amp; Regulatory Science</source>
          <volume>54</volume>
          (
          <year>2020</year>
          )
          <fpage>69</fpage>
          -
          <lpage>74</lpage>
          . URL: https://doi.org/10.1007/s43441-019-00030-4. doi:
          <volume>10</volume>
          .1007/s43441-019-00030-4, epub 2020 Jan 6.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhiying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A meta-analysis of watson for oncology in clinical application</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <article-title>5792</article-title>
          . URL: https://doi.org/10.1038/s41598-021-84973-5. doi:
          <volume>10</volume>
          .1038/ s41598-021-84973-5.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Idnay</surname>
          </string-name>
          , et al.,
          <article-title>Evaluating large language models on medical evidence summarization</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>6</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1038/s41746-023-00896-7. doi:
          <volume>10</volume>
          .1038/s41746-023-00896-7.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hegselmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sontag</surname>
          </string-name>
          ,
          <article-title>Large language models are few-shot clinical information extractors</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>1998</fpage>
          -
          <lpage>2022</lpage>
          . URL: https: //aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>130</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>130</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain of thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>CoRR abs/2201</source>
          .11903 (
          <year>2022</year>
          ). URL: https://arxiv.org/ abs/2201.11903. arXiv:
          <volume>2201</volume>
          .
          <fpage>11903</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bayer</surname>
          </string-name>
          ,
          <article-title>Sqlalchemy: The database toolkit for python</article-title>
          , https://www.sqlalchemy.org/,
          <source>2025. Version 2</source>
          .0.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>The</given-names>
            <surname>PostgreSQL Global Development Group</surname>
          </string-name>
          ,
          <article-title>Postgresql: The world's most advanced open source relational database</article-title>
          , https://www.postgresql.org/,
          <year>2025</year>
          . Version 15.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Welsh</surname>
          </string-name>
          ,
          <article-title>pdfplumber: A python library for extracting information from pdf files</article-title>
          , https://github. com/jsvine/pdfplumber,
          <source>2023. Version 0.7.7.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>U. N. L.</surname>
          </string-name>
          of Medicine, Clinicaltrials.gov api, https://clinicaltrials.gov/data-api/api,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -05-21.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>