=Paper=
{{Paper
|id=Vol-3702/paper25
|storemode=property
|title=Assessing Generative Pre-trained Transformer 4 in Clinical Trial Inclusion Criteria Matching
|pdfUrl=https://ceur-ws.org/Vol-3702/paper25.pdf
|volume=Vol-3702
|authors=Vasyl Pasichnyk,Ivan Dyyak,Vitaliy Horlatch,Tymofii Pasichnyk
|dblpUrl=https://dblp.org/rec/conf/cmis/PasichnykDHP24
}}
==Assessing Generative Pre-trained Transformer 4 in Clinical Trial Inclusion Criteria Matching==
Assessing Generative Pre-trained Transformer 4 in
Clinical Trial Inclusion Criteria Matching
Ivan Dyyak1, Vitaliy Horlatch1, Tymofii Pasichnyk1, and Vasyl Pasichnyk1
1 Ivan Franko National University of Lviv, 1 Universytetska St., Lviv, 79000, Ukraine
Abstract
Clinical trials play a very important role in improving the quality and expectancy of life. At the same
time, this is a complicated process which require a lot of time, effort, and costs. Effective interpretation
and understanding of inclusion and exclusion criteria are key elements for streamlining trial
documentation, enhancing patient engagement, and facilitating data analysis. The nuances in clinical
criteria can significantly affect trial outcomes and participant selection across different linguistic and
cultural backgrounds. The aim of this study is to assess the capability of the Generative Pre-trained
Transformer 4 (GPT-4) model in comprehending and matching clinical trial inclusion criteria between
English and Ukrainian, focusing on the model's potential to automate document processing and improve
patient engagement in a multilingual context. The model was tested on 315 inclusion criteria from 15
clinical trials from ClinicalTrials.gov and their corresponding Ukrainian versions from
ClinicalTrials.dec.gov.ua. The GPT-4 model accurately identified and matched 91% of the criteria pairs
fully and demonstrated a detailed understanding of clinical language and nuances. An additional 4% of
pairs were identified partially. All together it resulted in 95% accuracy for the tasks which accept partial
matches. These findings indicate the model's substantial capacity for understanding complex medical
and clinical terminologies and for facilitating the automation of clinical documents. Moreover, the
experiment has already helped to identify several cases where the English and the Ukrainian versions
had significant meaningful differences, most likely caused by human mistakes. The results of the study
open a wide range of opportunities for automating clinical documentation processes and enhancing
patient engagement through modern technologies. The scientific novelty is in applying Natural
Language Processing (NLP) to understand and process clinical trial criteria in different languages, in
this case Ukrainian, leading to more automated and patient-centered approaches in clinical research.
Keywords
Natural Language Processing, Generative Pre-trained Transformer, GPT-4, Clinical Trial Matching,
Inclusion Criteria 1
1. Introduction
The increasing globalization of clinical trials presents significant challenges and opportunities in
clinical research. One of the cases is related to the accessibility of trial participation in different
regions and landscapes. Typically, clinical trials are conducted in many different countries.
According to the regulations, all the related documentation should be translated into the
corresponding languages. Also, in most cases, local Electronic Health Records (EHR) keep data in
local languages too.
The use of automated tools and natural language processing (NLP) technologies is becoming
increasingly common in addressing the complexities of multilingual documentation in clinical
trials. Despite advancements, there remains a significant gap in the understanding and
application of automated tools related to clinical trial criteria matching, especially across
different languages. This gap not only prevents the efficiency of global trial operations but also
impacts patient engagement and onboarding, which are crucial for the ethical and effective
conduct of clinical research.
The object of this study is the process of interpreting and understanding clinical trial inclusion
criteria in a multilingual context using computer sciences. The subject covers the methods and
CMIS-2024: Seventh International Workshop on Computer Modeling and Intelligent Systems, May 3, 2024,
Zaporizhzhia, Ukraine
ivan.dyyak@lnu.edu.ua (I. I. Dyyak); vitaliy.horlatch@lnu.edu.ua (V. M. Horlatch);
tymofiy.pasichnyk@lnu.edu.ua (T. V. Pasichnyk); vasyl.pasichnyk.apmi@lnu.edu.ua (V. T. Pasichnyk)
0000-0001-5841-2604 (I. I. Dyyak); 0000-0001-5401-1731 (V. M. Horlatch);
0000-0002-1495-7423 (T. V. Pasichnyk); 0009-0004-3738-6629 (V. T. Pasichnyk)
© 2024 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
technologies which are used to understand and match these criteria across languages, with a
particular focus on the English and Ukrainian language pair, assuming that it can be generalized
and extended to other languages too.
The goal of this work is to analyze and validate the use of artificial intelligence and the
Generative Pre-trained Transformer 4 (GPT-4) in particular to enhance the understanding and
application of clinical trial inclusion criteria in different linguistic contexts. This involves
identifying, comparing, and assessing the equivalence of clinical trial criteria between English and
Ukrainian, aiming to improve document automation processes, facilitate more effective patient
engagement, and streamline data analysis procedures.
The importance of this research originated from the ongoing need to improve inclusivity and
accessibility in global clinical trials, ensuring that non-English speaking populations are
adequately represented and can participate fully and safely in clinical research. By leveraging
advanced NLP tools like GPT-4, this study addresses a critical gap in current research
methodologies, offering a potential for significant advancements in clinical trial design, execution,
and analysis. The appropriateness of this article within the relevant field is underscored by its
contribution to improving clinical trial efficiency and participant comprehension, ultimately
promoting more fair and effective global health research.
2. Problem Statement
The core problem addressed in this study involves the accurate understanding and matching of
clinical trial inclusion criteria across different languages, specifically English and Ukrainian. This
problem can be formalized mathematically as follows.
Input Variables:
• A set of inclusion criteria in English extracted from ClinicalTrials.gov, which can be
represented by the formula
𝐼𝐸 = {𝑒1 , 𝑒2 , … , 𝑒𝑛 } (1)
where each ei represents a specific criterion.
• A corresponding set of inclusion criteria in Ukrainian obtained from
ClinicalTrials.dec.gov.ua, which can be represented by the formula
𝐼𝑈 = {𝑢1 , 𝑢2 , … , 𝑢𝑛 } (2)
where each ui represents a specific criterion in Ukrainian.
Output Variables:
• A set of matched pairs between English and Ukrainian inclusion criteria:
𝑀 = {(𝑒𝑖 , 𝑢𝑗 )|𝑒𝑖 ∈ 𝐼𝐸 , 𝑢𝑗 ∈ 𝐼𝑈 } (3)
where each pair consists of one English criterion and one Ukrainian criterion.
• A set of scores indicating the degree of match between pairs in M:
𝑆 = {𝑠1 , 𝑠2 , … , 𝑠𝑛 } (4)
where each si corresponds to the semantic similarity between ei and ui.
Results Quality Evaluation Criteria:
• Accuracy (Acc) – The proportion of correctly matched criteria pairs in M to the total
number of criteria pairs. It can be calculated by the formula:
⌊{(𝑒𝑖 , 𝑢𝑗 ) ∈ 𝑀| 𝑠𝑖 𝑖𝑛𝑑𝑖𝑐𝑎𝑡𝑒𝑠 𝑎 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑚𝑎𝑡𝑐ℎ}⌋ (5)
𝐴𝑐𝑐 =
𝑛
• Precision (P) – The proportion of true positive matches to the total number of positive
matches identified by the model.
• Recall (R) - The proportion of true positive matches to the total number of actual
positive matches across the datasets.
Necessary Definitions:
• Positive match identified by the model – a pair of criteria that the model identifies as a
match, regardless of whether this match is correct.
• True positive match – a pair of criteria that are correctly identified as equivalent by
the model.
• Actual positive match – a pair of criteria that an expert identifies as a match.
• False positive match – a pair of criteria that are identified as equivalent by the model
but in fact is not a match.
• False negative match – a pair of criteria that are identified by the model as not a match
but in fact is a match.
3. Review of the literature
The application of Artificial Intelligence (AI) and Natural Language Processing (NLP) in the field
of clinical trials is a rapidly emerging area [1]. It is often being discussed in modern scientific and
business conferences and a considerable number of scientific articles have been published. If
properly used, AI can significantly enhance operational efficiencies, patient engagement, and the
accuracy of data interpretation. Furthermore, the capacity of AI to innovate and refine the design
of clinical trials is recognized as a potential catalyst for increasing the efficacy and efficiency of
trial methodologies [2]. The advancements in these technologies underscore their potential to
transform the landscape of clinical research by optimizing trial procedures and ensuring more
targeted and effective patient engagement strategies.
A comprehensive overview of current capabilities and limitations of AI applications in clinical
trials was provided, underscoring the transition from artificial to applied intelligence and
highlighting the importance of these technologies in enhancing trial outcomes [3]. The
transformative role of machine learning in clinical research focuses on the potential to automate
and optimize various aspects of trial execution and evidence generation, aligning with findings
that NLP can enhance the standardization and utilization of clinical data [4–10].
One of the key elements of each clinical trial is eligibility criteria. An accurate and
comprehensive understanding of inclusion and exclusion criteria is fundamental to ensuring that
the proper patients are enrolled. These criteria not only define the scope and structure of a trial
but also ensure the safety and appropriateness of participants. Misinterpretations or
miscommunications in these criteria can lead to incorrect selection of participants, regulatory
issues, and even trial failure or health consequences for the participants [11, 12].
The need for precise matching of patients to trials based on inclusion and exclusion criteria is
a complicated process. It requires a lot of manual effort and collaboration between different
organizations and electronic systems. At the same time, considering that nowadays most of the
information is tracked in electronic records, this process has many opportunities to be
automated. The growing potential of NLP systems plays a critical role in streamlining the patient
recruitment process [13].
However, a significant gap exists in the application of NLP technologies for understanding and
matching clinical trial inclusion criteria across different languages. Research into NLP tasks in
clinical settings has not specifically addressed the challenges of multilingual criteria
interpretation, indicating a need in clinical research where language barriers can prevent
participant enrollment and engagement [14].
Recent studies have explored the capabilities of GPT-4, a large language model, in clinical
settings, demonstrating its potential in outperforming traditional methods and medical
professionals in certain tasks. These findings provide insights into its language understanding
capabilities and suggest promising opportunities for utilizing GPT-4 in the interpretation and
application of clinical trial criteria, potentially overcoming language and comprehension barriers
[15–20].
In summary, while significant advancements have been made in AI and NLP applications
within clinical research, a critical gap remains in the context of multilingual inclusion criteria
interpretation. This study aims to bridge this gap by leveraging the capabilities of GPT-4, thus
contributing to the existing body of knowledge and addressing the identified limitations. The
research underscores the ongoing evolution of AI in medicine, aiming to enhance the accuracy,
efficiency, and inclusivity of clinical trial processes [21].
4. Materials and Methods
The theoretical foundation of this research is grounded in artificial intelligence and machine
learning, particularly in the context of natural language processing (NLP) and machine
translation (MT). Even though, there are special models tuned for medicine (e.g. PubMedBERT,
BioGPT, Med-PaLM, etc.), recent studies have shown that GPT-4 can outperform them [22].
Therefore, we chose the GPT-4 model to analyze and compare inclusion criteria across different
languages in our study. The hypothesis posits that the GPT-4 model, trained on diverse linguistic
datasets, can accurately match and assess the equivalence of clinical trial inclusion criteria in
English and Ukrainian.
The research employs semantic analysis techniques inherent to GPT-4 to interpret and
compare the meaning of text segments from the inclusion criteria. These techniques involve the
decomposition of text into smaller, manageable units (bullets), enabling a detailed, point-by-point
comparison. The theoretical significance lies in assessing the model's ability to understand and
translate medical and clinical terminology accurately, a critical factor given the nuanced nature
of clinical trial criteria.
The methodological approach encompasses several key components:
• Data Preparation: In order to get a proper data set which will be close to the real-
world scenarios, the following subsets were prepared and mixed:
1. Inclusion criteria of the same clinical trial in English and Ukrainian were taken
from the corresponding sites without any changes. Even though they had a
similar meaning, the text could be several pages long and semantically
structured in a completely different way.
2. The inclusion criteria were extracted and segregated into individual bullets.
The number of bullets in the English and Ukrainian versions could differ. As
well as their content.
3. Similar to the above however English and Ukrainian criteria were intentionally
taken from different clinical trials
This step ensured that the model was evaluated on different conditions, enriching the
usability and applicability of the study.
• Model Selection: We utilized the GPT-4 model for identifying and comparing criteria
pairs. This involved configuring the model to process pairs of text bullets and assess
their semantic equivalence across languages. The model's performance metrics were
predetermined to evaluate its accuracy in identifying exact matches, partially correct
matches, and discrepancies.
• Analytical Techniques: The study applied quantitative methods to analyze the
model's output, focusing on the proportion of fully correct, partially correct, and
incorrect matches. Statistical analysis was used to assess the model's accuracy and
reliability in translating and matching criteria across languages.
• Validation Process: To ensure the validity of the research results, a subset of the GPT-
4-identified pairs was manually reviewed by an expert. This step served to cross-verify
the accuracy of the automated assessments and to identify potential areas for model
improvement.
The research was conducted in a sequential manner, beginning with the collection and
preparation of data, followed by the deployment of the GPT-4 model, preparation of
corresponding prompts, data processing, analysis, and validation of the results. The important
step was related to the manual assessment and interpretation of the results. This sequence was
designed to ensure a logical progression of tasks and to build a coherent dataset for analysis.
The selection of the GPT-4 model as the primary research tool was justified by its advanced
language processing capabilities and its proven effectiveness in similar linguistic tasks.
Nowadays, it is easily accessible by different companies and organizations. That provides a
significant boost for further research and development in this area. The use of statistical analysis
for evaluating the model's performance provided a quantitative basis for assessing accuracy,
reliability, and validity.
5. Experiments
The experiment was started from the data collection. We created the dataset which contained
315 inclusion criteria extracted from 15 distinct clinical trials. Each criterion had up to 400 words.
English criteria were collected from publicly available records on ClinicalTrials.gov [23], a
database of privately and publicly funded clinical studies conducted around the world. The
corresponding Ukrainian translations were obtained from ClinicalTrials.dec.gov.ua, the official
Ukrainian counterpart hosting clinical trial information.
As described in the section above, several types of data collection had a place. It is worth
mentioning that even though typically inclusion criteria in the English and Ukrainian versions
should be the same, in reality, they might have many additional details related to local regulations,
measurement details, etc. For example, the English version of a criterion might look like “Has
adequate organ function within 7 days prior to the start of study treatment” while the Ukrainian
version will contain a detailed list of laboratory values to be met in order to be evaluated as
“adequate organ function”.
For each clinical trial, was created a separate but exactly the same (except the input
parameters) prompt to establish proper isolation. The prompt contained the instructions, a
numerated set of inclusion criteria in English, and a numerated set of inclusion criteria in
Ukrainian. The preprocessed inclusion criteria were inputted into the OpenAI GPT-4 model
hosted on an isolated Microsoft Azure instance. The model’s temperature was set to 0 to avoid
randomness, and enhance the model's focus and predictability.
The model's task was to identify corresponding pairs of criteria between the English and
Ukrainian datasets and then to assess whether the pairs matched completely or exhibited any
meaningful differences. The outputs were systematically recorded, categorizing each pair into
'Full Match', 'Partial Match', ‘No Ukrainian equivalent’, ‘No English equivalent’.
All outputs were collected together in a single table. Each result was manually reviewed by
experts to validate the accuracy of the model's assessments. Then the data was summarized in
different views and the corresponding parameters were calculated by the formulas defined in the
“Problem Statement” section.
This setup ensures that the experiments can be replicated by other researchers or automated
tools, assuming they use the same datasets and GPT-4 model configuration. It provides a clear
and systematic approach to evaluating NLP models’ accuracy in cross-linguistic analysis of
clinical trial inclusion criteria.
6. Results
The summary of the results of the experiments is presented in the Table 1.
Table 1
Summary of the results
GPT-4 Manual assessment
Full match 84 69
Partial match 30 38
Significant difference 4 12
No Ukrainian equivalent 16 15
No English equivalent 63 63
Total 197 197
Graphically they can be illustrated in the Fig. 1.
Figure 1: Graphical representation of the results.
Then we structured the data like in the Table 2 for proper accuracy, precision, and recall
calculations. These metrics were selected for their relevance in evaluating the performance of
NLP models, particularly in tasks involving text matching and semantic analysis.
Table 2
Summary of the results validation
Parameter Value
True positive matches: 97
Total number of positive matches identified by the model: 114
Total number of actual positive matches across the datasets: 107
Based on that, we got the following results:
• Accuracy (Acc). The model achieved an overall accuracy rate of 91% in matching the
correct pairs of inclusion criteria between the two languages. This measure indicates
the proportion of all criteria pairs for which the model's predictions were correct to
the total number of pairs.
• Precision (P). The model demonstrated a precision rate of 85%, reflecting the
proportion of true positive matches to all matches identified by the model as correct.
• Recall (R). The recall rate obtained in the study was 91%. This metric measures the
proportion of true positive matches to actual positive matches that the model
successfully identified.
Since it is difficult to define the exact difference between “Full match” and “Partial match”, and
since for some classes of tasks it is important just to identify whether there is a match or not, it is
also worth looking at the results combining “Full match” and “Partial match” together and
considering them as a “Match”. In this case, the data will look like the following.
Table 3
Summary of the results without the breakdown into “Full match” and “Partial match”
GPT-4 Manual assessment
Match 114 107
Significant difference 4 12
No Ukrainian equivalent 16 15
No English equivalent 63 63
Total 197 197
Graphically this can be illustrated in the Fig. 2.
Figure 2: Graphical representation of the results without the breakdown into “Full match” and
“Partial match”.
The summary of the resulting calculation is represented in the Table 4.
Table 4
Summary of the results validation without the breakdown into “Full match” and “Partial match”
Parameter Value
True positive matches: 106
Total number of positive matches identified by the model: 114
Total number of actual positive matches across the datasets: 107
Accuracy (Acc), precision (P), and recall (R) in this case will be the following:
• Acc = 95%
• P = 93%
• R = 99%
Even though the analysis of the incorrect results is not in the scope of this research, we decided
to attach the correlation between the length of a criterion in the English and the Ukrainian
versions and the results. It is assumed that longer texts may contain more nuances, which could
either increase or decrease the model's ability to find a correct match based on linguistic and
semantic complexity. We calculated the length of each inclusion criterion by counting the words
in both the English and Ukrainian texts. This correlation is represented in Figure 3. Even though
no correlation is visually noticed, this data can be used for further discussions and research.
Figure 3: The correlation between the length of a criterion and the results.
In addition to the scientific observations, as a result of the experiments, the model identified
four cases where unexpected significant meaningful differences had a place. In the first case, the
English version of the inclusion criteria had the following text: “Full recovery from cystectomy
and enrollment within 24 weeks following cystectomy.”. While in the Ukrainian version, 14 weeks
were mentioned. In the second example, patients with tuberculosis were specifically mentioned
in the English version of the exclusion criteria but not in the corresponding Ukrainian version. In
the third and the fourth, the Ukrainian version included additional screening requirements for
hepatitis B and C, which were not mentioned in the English version provided.
7. Discussion
The findings from this study illustrate the advanced capabilities of the GPT-4 model in
understanding and accurately matching clinical trial inclusion criteria across languages,
specifically between English and Ukrainian. The high performance in terms of accuracy, precision,
and recall indicates a notable advancement in the application of natural language processing
technologies in clinical research. This opens a variety of opportunities for further research and
development.
GPT-4 model's language understanding capabilities demonstrate not only the validity but also
highlight its potential to mitigate linguistic barriers in global clinical trials. For the cases where a
deep understanding of the eligibility criteria is required, the success rate of 91% in accurately
matching criteria pairs between English and Ukrainian shows great potential in the model to be
used. The precision rate of 85% indicates a high level of specificity in the model's output, reducing
the likelihood of false positives and thereby enhancing the reliability of trial participant matching.
Furthermore, the recall rate of 91% suggests that the model effectively captures almost all of the
relevant matches, i.e. minimizing the risk of overlooking suitable trial participants due to
language variations.
For the cases which do not require a very deep understanding of the eligibility criteria, the
results are even better. The accuracy of 95%, precision of 93%, and recall of 99% indicate the
readiness of the model to be used in real-life cases. For instance, in finding discrepancies in
clinical trials’ documentation in different languages. Even within the scope of this research,
several such cases have been identified. Another example is matching which will be followed by
a human validation.
Additional practical implications of this research are diverse. It can be used for automation
and validation of clinical documentation, clinical trial recruitment, patient comprehension and
engagement, adherence to regulatory compliance, and others. Also, it can be used for integrations
between digital health systems. By leveraging the capabilities of advanced NLP technologies, this
research contributes to making clinical research more efficient, inclusive, and accessible.
The promising results of this study justify further investigation into the application of artificial
intelligence in clinical research. Future studies could explore the model's applicability across a
broader range of languages and medical specializations to understand its full potential and
limitations. This can lead to more inclusive, diverse, and representative clinical research,
contributing to the advancement of global health outcomes.
Additional analysis of the incorrect matches, explorations of correlation between different
parameters and matching results, open new opportunities for further research. The complexity
of trial inclusion and exclusion criteria could impact the effectiveness of AI tools in clinical
settings. This observation could lead to future AI model training processes, emphasizing the need
for diverse and comprehensive datasets that combine various clinical documents in different
languages.
Even though the study has demonstrated great results even using a general GPT-4 model and
it was confirmed by other studies, it is worth keeping in mind special models which are tuned for
medicine (e.g. PubMedBERT, BioGPT, Med-PaLM 2, BioBERT, and BioMegatron, etc.). Obviously,
for some of the cases, their usage will be more feasible and relevant. The upcoming release of
GPT-5 is another intriguing and anticipated event which will be very interesting for further
research.
In conclusion, it is important to keep in mind ethical considerations. Patient data, especially in
clinical trials, is highly sensitive. Also, AI systems can make mistakes which potentially might
affect humans and have a severe sequence. These ethical challenges are not unique to our study
but reflect broader concerns within the AI research community, especially as AI tools become
more integrated into healthcare and clinical research. Efforts must be made to ensure that the
corresponding systems are trained on diverse datasets, regularly evaluated, and compliant.
Qualified medical personnel should always verify the results.
8. Conclusions
This research addresses the challenges of understanding and matching clinical trial inclusion
criteria across languages, specifically English and Ukrainian, utilizing the GPT-4 model. The main
scientific task was to assess whether advanced natural language processing (NLP) technologies
could enhance the accuracy and efficiency of cross-lingual analysis in clinical research settings.
The GPT-4 model has demonstrated a robust capability to understand and match clinical trial
inclusion criteria between English and Ukrainian versions, showcasing high accuracy, precision,
and recall.
The application of GPT-4 represents a significant advancement in bridging language barriers
in clinical research, contributing to more efficient, inclusive, and equitable clinical trial processes.
The study's quantitative and qualitative results provide a strong basis for the practical application
of artificial intelligence in enhancing global health research methodologies in multilingual
environments.
Scientific Novelty. The use of GPT-4 for matching English and Ukrainian clinical trial inclusion
criteria matching demonstrated a high degree of scientific novelty, marking the first instance of
such an application in the field. By breaking down and analyzing inclusion criteria, the study
demonstrates how deep learning models can analyze and understand complex medical
terminology and nuances across languages. Compared to previously known methods, the GPT-4
model has shown superior performance, which allows for a more accurate and inclusive
participant selection process, decreasing linguistic biases and barriers in global clinical research,
and provides other advantages.
Practical Significance. The findings offer a practical framework for applying AI and NLP
technologies in improving global clinical trial processes, particularly in patients’ recruitment and
documentation. The research outcomes can be used by Clinical Research Organizations (CRO),
researchers, trial coordinators, and regulatory bodies to enhance the inclusivity, efficiency, and
accuracy of clinical trials across different linguistic regions. It can also be used by regular people
and patients who are struggling with severe disease and desperately looking for the most recent
medicine which can potentially help with their treatment and give hope for a better and healthier
life.
Recommendations for Practical Use include integrating NLP tools like GPT-4 into clinical
trial management systems to automate and optimize the screening and matching of participants.
In particular, considering the positive results of Ukrainian clinical text interpretation, Clinical
Research Organizations (CRO) should seriously consider integrations with E-health (a medical
information system for the Ministry of Healthcare of Ukraine) or local Electronic Health Records
(EHR) systems like HELSI, DrEleks, and others.
Another area of use can be related to centralized data validation and minimization of human
errors. Even during the experiments of this study, we have identified four significant differences
in comparison of English and Ukrainian inclusion criteria, most likely caused by a human factor.
Such mistakes can potentially cause incorrect patients’ selection and affect their safety as well as
the clinical study’s results.
Further research should explore the application of GPT-4 and similar models across a
broader range of languages and clinical contexts. Investigating the integration of NLP tools with
other digital health systems like Clinical Trial Management Systems (CTMS) and Electronic Health
Records (EHR) could yield additional improvements in clinical trial design and execution.
Longitudinal studies to track the impact of NLP integration on trial outcomes and patient
diversity could provide deeper insights into the long-term benefits and challenges of AI in clinical
research.
9. Acknowledgements
The authors would like to thank the anonymous reviewers for their detailed and constructive
feedback.
References
[1] F. Cascini, F. Beccia, F.A. Causio, A. Melnyk, A. Zaino and W. Ricciardi. Scoping review of the
current landscape of AI-based applications in clinical trials. Frontiers Public Health.
10:949377. (2022). doi:10.3389/fpubh.2022.949377
[2] Bin Zhang, Lu Zhang, Q. Chen, Zhe Jin, S. Liu, S. Zhang. Harnessing artificial intelligence to
improve clinical trial design. Communication Medicine 3, 191 (2023). doi:10.1038/s43856-
023-00425-3
[3] A.F. Hernandez, C.J. Lindsell. The Future of Clinical Trials: Artificial to Augmented to Applied
Intelligence. JAMA. November 11. 330(21): (2023) 2061–2063.
doi:10.1001/jama.2023.23822
[4] E. H. Weissler, T. Naumann, T. Andersson et al. The role of machine learning in clinical
research: transforming the future of evidence generation. Trials 22, 537 (2021).
doi:10.1186/s13063-021-05489-x
[5] K. Kreimeyer, M. Foster, A. Pandey, N. Arya, G. Halford, S. F. Jones, R. Forshee, M. Walderhaug,
T. Botsis. Natural language processing systems for capturing and standardizing unstructured
clinical information: A systematic review. J. Biomedical Informatics. V.73, (2017) 14-29.
doi:10.1016/j.jbi.2017.07.012
[6] Y. Gao, D. Dligach, L. Christensen, S. Tesch, R. Laffin, D. Xu, T. Miller, O. Uzuner, M.M Churpek,
M. Afshar. A scoping review of publicly available language tasks in clinical natural language
processing. J. of the American Medical Informatics Association. V.29. Issue 10. (2022) 1797-
1806. doi:10.1093/jamia/ocac127
[7] Ewoud Pons, Loes M. M. Braun, M. G. Myriam Hunink, Jan A. Kors. Natural Language
Processing in Radiology: A Systematic Review. Radiology. May;279(2). (2016) 329-43.
doi:10.1148/radiol.16142770
[8] T. Cai, A. A. Giannopoulos, S. Yu, T. Kelil, B. Ripley, K. K. Kumamaru, F. J. Rybicki, D. Mitsoura.
Natural Language Processing Technologies in Radiology Research and Clinical Applications.
Radiographics. Jan-Feb;36(1). (2016). doi:10.1148/rg.2016150080
[9] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan et al. Large language models in medicine.
Nature Medicine. 29, (2023) 1930–1940. doi:10.1038/s41591-023-02448-8
[10] I. J. B. Young, S. Luz, N. Lone. A systematic review of natural language processing for
classification tasks in the field of incident reporting and adverse event analysis.
International Journal of Medical Informatics. V.132. (2019).
doi:10.1016/j.ijmedinf.2019.103971
[11] A. Wong, J. M. Plasek, S. P. Montecalvo, L. Zhou. Natural Language Processing and Its
Implications for the Future of Medication Safety: A Narrative Review of Recent
Advances and Challenges. Pharmacotherapy. Aug;38(8). (2018) 822-841.
doi:10.1002/phar.2151
[12] L. P. Nijhawan, M. D. Janodia, B. S. Muddukrishna, K. M. Bhat, K. L. Bairy, N. Udupa,
P. B. Musmade. Informed consent: Issues and challenges. J. of Advanced Pharmaceutical
Technology & Research. Jul;4(3). (2013). doi:10.4103/2231-4040.116779
[13] J. Kim, Y. Quintana. Review of the Performance Metrics for Natural Language Systems for
Clinical Trials Matching. Studies in Health Technology and Informatics. Jun 6;290. (2022)
641-644. doi:10.3233/SHTI220156
[14] C. Fang, Y. Wu, W. Fu, J. Ling, Y. Wang, X. Liu, Y. Jiang, Y. Wu, Y. Chen, J. Zhou, Z. Zhu, Z. Yan,
P. Yu, X. Liu. How does ChatGPT-4 preform on non-English national medical licensing
examination? An evaluation in Chinese language. PLOS Digit Health. Dec 1;2(12). (2023).
doi:10.1371/journal.pdig.0000397
[15] S. Voloshyn, V. Vysotska, O. Markiv, I. Dyyak, I. Budz and V. Schuchmann, "Sentiment Analysis
Technology of English Newspapers Quotes Based on Neural Network as Public Opinion
Influences Identification Tool," 2022 IEEE 17th International Conference on Computer
Sciences and Information Technologies (CSIT), Lviv, Ukraine, 2022, pp. 83-88, doi:
10.1109/CSIT56902.2022.10000627
[16] D. Ueda, S. L. Walston, T. Matsumoto et al. Evaluating GPT-4-based ChatGPT's clinical
potential on the NEJM quiz. BMC Digit Health 2, 4 (2024). doi:10.1186/s44247-023-00058-
5
[17] N. Oh, G.-S. Choi, W. Y. Lee. ChatGPT goes to the operating room: evaluating GPT-4
performance and its potential in surgical education and training in the era of large language
models. Annals of Surgical Treatment and Research. May;104(5). (2023) 269-273.
doi:10.4174/astr.2023.104.5.269
[18] E. Waisberg, J. Ong, M. Masalkhi, S. A. Kamran, N. Zaman, P. Sarker, A. G. Lee, A. Tavakkoli.
GPT-4: a new era of artificial intelligence in medicine. Irish Journal of Medical Science.
V.192(6). (2023) 3197-3200. doi:10.1007/s11845-023-03377-8
[19] G. A. Guerra, H. Hofmann, S. Sobhani, G. Hofmann, D. Gomez, D. Soroudi, B.S. Hopkins, J. Dallas,
D. J. Pangal, S. Cheok, V. N. Nguyen, W. J. Mack, G. Zada. GPT-4 Artificial Intelligence Model
Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery
Written Board-Like Questions. World Neurosurgery. V.179. (2023) e160-e165.
doi:10.1016/j.wneu.2023.08.042
[20] Sang-Wook Lee, Woo-Jong Choi. Utilizing ChatGPT in clinical research related to
anesthesiology: a comprehensive review of opportunities and limitations. Anesthesia and
Pain Medicine. 18(3). (2023) 244-251. doi:10.17085/apm.23056
[21] M. Rosoł, J. S. Gąsior, J. Łaba, K. Korzeniewski & M.Mlynczak. Evaluation of the performance
of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Scientific Reports. 13, 20512
(2023). doi:10.1038/s41598-023-46995-z
[22] H. Nori, Y. T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li, W. Liu, R.
Luo, S. M. McKinney, R. O. Ness, H. Poon, T. Qin, N. Usuyama, C. White & E. Horvitz. Can
Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in
Medicine. ArXiv. Computer Science. Computation and Language. (2023)
[23] A. Wang, X. Xiu, S. Liu, Q. Qian, S. Wu. Characteristics of Artificial Intelligence Clinical Trials
in the Field of Healthcare: A Cross-Sectional Study on ClinicalTrials.gov. International Journal
of Environmental Research and Public Health. 19(20). (2022).
doi:10.3390/ijerph192013691