<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>information⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sourav Maiti</string-name>
          <email>souravmaiti@rcsi.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Syeda Mah-e-Fatima</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qurratal Ain Fatimah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali Hasnain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dementia Management, Standard Clinical Guidelines (SCG)</institution>
          ,
          <addr-line>AI Chatbots, ChatGPT-3.5, ThinkAny, Healthcare</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Pharmacy and Biomedical Sciences, Royal College of Surgeons in Ireland</institution>
          ,
          <addr-line>Dublin</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Hospital Galway</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Artificial intelligence (AI) has emerged as a promising technology in healthcare to answer questions and aid in clinical decision-making processes. This research presents a comparative analysis of three technologies: Google(search engine), ChatGPT-3.5 (AI bot) and ThinkAny (AI search engine), used in the context of asking questions about dementia management while focusing on the results generated by the aforementioned and comparing with a benchmark (standard clinical guidelines). The approach relies on asking questions to these technologies and systematically analyzing the generated responses. The methodology follows performing a series of statistical tests including the Friedman, Wilcoxon signed rank, and Mann-Whitney U test(s) to evaluate the responses generated by the aforementioned and comparing against the standard clinical guidelines (SCG).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Dementia, one of the most common disorders characterized by cognitive decline, afects millions of
people and poses significant challenges for patients, healthcare professionals, formal and informal
caregivers worldwide. According to Nichols et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the number of people estimated to have dementia
is expected to increase from 57 million cases globally in the year 2019 to an estimated 153 million cases
in the year 2050. The increase in the number of dementia patients poses a significant challenge for the
global healthcare system to efectively deal with the medical management issues in dementia[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Efective management of dementia involves addressing the cognitive, emotional, and physical needs
of individuals living with dementia so that they can experience improved quality of life. It also involves
providing support for their caregivers so they can provide optimal care to people with dementia. It
requires adherence, knowledge, and understanding [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] of established clinical guidelines like the one
https://www.rcsi.com/people/profile/alihasnain (A. Hasnain)
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
from the European Academy of Neurology 1. These guidelines 2 provide recommendations on how to
address various management issues related to dementia, including follow-up, vascular risk factors in
dementia, and pain management in dementia [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Information retrieval systems (IRS) are tools that help in finding information within a large collection
of data[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Search engines like Google or Bing, were and still are a popular IRS tool for
accessing/searching medical information[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The search engine helps the users to find relevant information and additional
resources on diferent topics. However, the accuracy, credibility, relevance of the retrieved information
from the diferent search engines vary. For results obtained from a search engine, the user has to
identify the accurate and relevant information from a credible and trustworthy source[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Currently,
search engines have sponsored posts at the top for certain searches, which means search engines get
paid to highly rank the content at the top of the search results, even though the result might not be
accurate, relevant, or from a credible source. These factors raise questions on the reliability of readily
available information to patients, healthcare professionals, formal and informal caregivers looking for
responses to questions related to managing dementia.
      </p>
      <p>
        The importance of IRS in healthcare[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is growing significantly and, to the best of our
knowledge, there is a lack of credible research comparing the performance of diferent systems for
answering the questions related to medical management guidelines for dementia patients, researchers,
and caregivers. This research aims to address this gap by investigating how technologies like Google,
AI Chatbot (ChatGPT-3.5) and AI search engine (ThinkAny) respond to specific dementia management
questions taken from the European Academy of Neurology guidelines[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Our work uses a Likert
scale rating system[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to evaluate the accuracy, clarity, simplicity, relevance, and usefulness of the
retrieved information from each technology for dementia management questions. We apply statistical
analysis techniques including the Friedman test[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Wilcoxon signed-rank test[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Mann-Whitney
U test[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] with Bonferroni correction, to identify diferences in the performance of the technologies
across the questions. By comparing the performance of these technologies through statistical analysis,
we aim to gain valuable insights into the strengths, limitations, and efectiveness of the upcoming
technologies in relation to information retrieval on dementia management issues.
      </p>
      <p>Further sections of this paper are organized as follows. Section 2 summarizes the relevant studies in
exploring the accuracy and efectiveness of the technologies providing medical information. Section
3 describes the research methodology and our approach, including the selection of clinical guideline
questions, description around the selection of three technologies, the data collection and statistical
analysis process. Section 4 presents the findings of the study and provides a detailed discussion of the
results generated by using diferent technologies and compares the performance with clinical guideline.
Section 5 addresses the limitations of this paper, and lastly, section 6 presents the conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Over the last decade, the field of Artificial Intelligence (AI) and Large Language Models (LLMs) has been
contributing to revolutionize diferent sectors, especially the healthcare domain[ 13][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. LLMs[14][15]
are a type of AI systems, based on the transformer model[16], designed to understand and generate
human-like text based on the inputs[17]. The introduction of advanced AI models like LLMs and
their models of transformer-based architectures[16] such as GPT (Generative Pre-trained Transformer)
and BERT (Bidirectional Encoder Representation from Transformers), has fundamentally transformed
how information is processed[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Compared to traditional methods of processing sequential data
using Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), which process
the information in a sequential way, transformers have the capability to process entire sentences at
once, focusing on the relationship between words. Transformers operate on the “self-attention”[16]
mechanism to get the dependencies between diferent elements of the input, instead of the recurrence
approach, which enables parallel data processing and obtains long-range dependencies in the input.
1https://www.ean.org/ l.a 05-03-2025
2https://www.ean.org/research/ean-guidelines l.a 05-03-2025
Additionally, transformers can handle short and long input sequences eficiently (parallel processing),
compared to RNNs which are not very efective at capturing long-range dependencies[ 16].
      </p>
      <p>
        GPTs, relying on the LLMs are trained on large data and can perform a variety of natural language
processing tasks [15][18][17][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This has led to a rapid increase in the development of chatbots and
Generative AI (GenAI) tools across various industries and sectors including the healthcare [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][14].
The AI-powered tools can provide basic medical information, answer patient questions, schedule
appointments, translate one language to others, summarization of appointments with doctors and
documenting medical records [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][13]. LLMs have been increasingly used in various applications like
dementia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], dentistry [17], medical administrative tasks, improving patient support, speeding drug
discovery, analyzing patient data, customer service chatbots [14][15]. The ethical use of AI in healthcare
still remains a critical challenge; however, the potential benefits for patients, medical professionals,
caregivers is vast and yet to be discovered [14][15].
      </p>
      <p>
        Sandmann S., et al.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], evaluate the accuracy of LLMs, GPT-3.5 and GPT-4, in providing initial
diagnosis, examination suggestions, and treatment recommendations of 110 medical cases across
diverse clinical disciplines. It compares their performance with Google search results, where
GPT4 outperformed GPT-3.5 and Google in most tasks, especially in diagnosis and examining patient
tasks. The authors reported cases where LLMs failed to provide accurate diagnoses and reported
better performance of GPT-3.5 and GPT-4 on general diseases compared to rare ones, leaving room
for improvement. Their findings point towards the potential of LLMs in the clinical decision-making
process with essential further development to improve on the accuracy, particularly for rare diseases.
The LLMs struggle in cases where less information is present such as for rare diseases and can only
generate information which they have been trained on[14][15][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Ayoub et al.[13], explore the ability of ChatGPT to provide medical knowledge and recommendations
in the healthcare setting. They use the Clinical Practice Guidelines (CPGs) as a reference for generating
the patient-oriented questions about various medical conditions and assess the response from both
ChatGPT and Google search through the Patient Education Materials Assessment Tool (PEMAT-P).
The findings show that ChatGPT performs better than Google search in providing general medical
knowledge and patient education with high scores for clarity and usability, while Google search results
outperformed ChatGPT in providing accurate medical recommendations [13]. Although ChatGPT
shows promise as a supplementary source of medical knowledge, it cannot fully replace professional
healthcare knowledge [14]. The authors highlight the importance of healthcare providers understanding
both the capabilities and limitations of AI tools like ChatGPT to optimize patient education.</p>
      <p>A similar recent study, and more related to our work, has been conducted by Hristidis et al.[19]
which provides the analysis of ChatGPT and Google in answering dementia-related and other queries
pertaining to cognitive decline. The authors report that Google provided access to a vast array of
sources, including up-to-date medical literature, but often presented less contextualized information.
The findings presented also indicate that responses from Google were more diverse but often lacked
precision, whereas responses generated from ChatGPT were structured and coherent; however, these
were susceptible to inaccuracies due to potential biases and outdated training data. While ChatGPT
demonstrated a stronger understanding of the topic in question, it compromises accuracy and credibility
when compared to the standard clinical guidelines (SCG).</p>
      <p>Table 1 presents the list of technologies considered for this work, their date accessed, version, their
developer, the year developed, and their coverage statement.</p>
      <sec id="sec-2-1">
        <title>Coverage</title>
        <p>Google Search is a giant library for the internet. It scours
websites, news articles, images, and videos for information,
when a question is asked or term is searched on any topic.
Google digs through this massive collection and presents the
most relevant results.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        This research sought to determine whether ChatGPT-3.5, Google search, and AI search engine responses
provide the appropriate information to specific topics on medical management issues in dementia.
Our methodology starts with selecting three questions on dementia management taken from the
European Academy of Neurology guideline on medical management issues in dementia [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Three questions were selected as they encompass a diverse range of medical management scenarios
while providing care to people with dementia (PwD). They are a good sample among the available
questions and cover a range of topics relevant to dementia management; the first question is related to
systematic medical follow-up in dementia, whereas the second question is related to the management
of vascular risk factors in dementia, and the third question is related to the management of pain in
dementia. These three questions were queried across all aforementioned technologies (Step 1). At the
second stage of our methodology, we used the Likert scale (Step 2) to evaluate the accuracy, clarity,
relevance, simplicity, and usefulness of the responses provided by each technology. The systematic
assessment of responses obtained from each technology was based on a straightforward 5-point Likert
scale, for all three questions (details presented in section 3.3). Due to the fact that Likert scale ratings
are more suitable for ordinal data (order of response matters), and the intervals between values may
not be equal [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which could lead to misinterpretation of results, two separate non-parametric tests
namely 1) Friedman test (Step 3) and Wilcoxon signed-rank test (Step 4) were performed as presented
in section 3.4.1 and 3.4.2 respectively.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Questions selected from the Guideline</title>
        <p>
          For our experiments, following three questions were selected from the European Academy of
Neurology guideline on medical management issues in dementia [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]:
Q1: Should home-living (non-institutionalized) patients with dementia be ofered systematic medical
follow-up in a memory clinic setting?
Q2: Does systematic management of vascular risk factors in patients with dementia slow the progression of
dementia?
Q3: Should behavioural symptoms in patients with dementia be treated with mild analgesics?
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Technologies used</title>
        <p>To compare the responses for clinical guideline questions on medical management issues in dementia,
we selected three state-of-the-art technologies: ChatGPT-3.5, Google search engine, and ThinkAny
(AI search engine) to assess the performance of each technology on the three questions (Step 1). The
rationale behind this selection is to choose three diverse systems including first is a LLM-based AI
system, second a search engine, and third one an AI-based search engine. ChatGPT and ThinkAny are
trained using large-scale datasets, including medical information, clinical guidelines, and healthcare
databases, among other information, whereas Google’s vast search, combined with advanced algorithms
of natural language processing capabilities, is widely recognized and used across various domains.
ChatGPT, the current state-of-the-art in natural language processing and text generation, has been
trained on vast amounts of data and has already shown potential in understanding and generating
relevant responses. ThinkAny is an advanced and new-era AI search engine that can retrieve and
aggregate high-quality content and can eficiently answer user questions. Each technology has its own
unique strengths and capabilities, and our experiment ofers valuable insights into their efectiveness in
supporting medical management issues in dementia.
3.2.1. Querying Google
The three clinical guideline questions were sent to Google’s search engine one at a time without any
modifications to ensure a fair comparison between the standard response from the guideline and the
search results. The first Google search link has the highest click-through rate of all search results, so the
ifrst result was selected in our case [ 13]. The information on these websites was evaluated for relevant
responses to the question.
3.2.2. Querying ChatGPT-3.5 and ThinkAny
We used the original three clinical guideline questions directly from the guideline aiming to mirror
normal usage and ensure a fair comparison with the Google search engine. We are running these
questions a single time in this research, and the prompts were not revised after that. We are evaluating the
results based on the first run. It was out of the scope of the paper to see the incremental improvement of
prompts or how more accurate information can be obtained with references. It is widely recognized that
prompt engineering or prompt refinements can increase the performance of the LLMs significantly[ 20].
In our prompt strategy, we have not provided any general context or added any extra details to help
direct the technologies towards an answer.</p>
        <p>All prompts were systematically executed between 25 April 2024 and 12 May 2024 through the website
https://chatgpt.com/ and https://thinkany.ai/en-UK for ChatGPT-3·5 and ThinkAny, respectively.</p>
        <p>Detailed overview of the workflow is pictorially illustrated in Figure 1.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Likert scale</title>
        <p>
          The Likert scale is widely used to measure preferences or degrees of agreement with a statement or
set of statements. We used the Likert scale to evaluate the accuracy, clarity, relevance, simplicity,
and usefulness of the response provided by each technology. The systematic assessment of responses
obtained from each technology was based on a straightforward 5-point Likert scale, ranging from
1 (Strongly Disagree) to 5 (Strongly Agree), for all three questions [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. While this scale is used as a
standard assessment scale by diferent clinicians and physicians, it is still a subjective instrument.
        </p>
        <p>The responses gathered were evaluated by seven domain experts working in Dementia Research. Two
of them were mid-career researchers, three were established researchers, and two entry-level researchers.
These experts rated the three questions on the Likert scale, and the final score was calculated as the
mean of the seven individual scores.</p>
        <p>Figure2 illustrates the distribution of Likert scale ratings assigned to the response of the three clinical
guideline questions on dementia for Guideline, Google, ChatGPT-3.5, and ThinkAny. Each box in the
plot represents the inter-quartile range (IQR) of ratings for the specific resource. The median is marked
by the horizontal line inside each box plot. The flat line at level 5 for Guideline shows most reviewers
rated it highly for all three questions, whereas Google received the whole spread of values from 1 to
5, with most values in the 2 to 5 range. Most scores for ChatGPT and ThinkAny by reviewers ranged
between 4 to 5, with few outliers.</p>
        <p>Figure3 displays the average Likert scale ratings for each question for each technology. It was
observed that Google’s ratings by reviewers for Q3 were the lowest ratings received. The guideline
received the highest ratings at an average of 4.5 ratings for all three questions. ThinkAny received
better ratings compared to ChatGPT for the first two questions; however, ChatGPT’s response obtained
better scores for Q3 compared to ThinkAny.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Statistical Analysis</title>
        <p>
          To test if there is a significant diference in the responses provided by the clinical guideline, Google, and
the AI tools for each question, we performed statistical analyses to compare the responses from the three
technologies for the three questions. The Likert scale ratings are ordinal data (the order of response
matters), and the intervals between values may not be equal [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which could lead to misinterpretation
of results. In order to deal with the ordinal data and potential violations of normality or homogeneity of
variances assumptions, two non-parametric tests (Friedman test and Wilcoxon signed-rank test) were
performed as presented in section 3.4.1 and 3.4.2 respectively.
3.4.1. Friedman test
The Friedman test, a non-parametric analogue of repeated measures ANOVA, was conducted to evaluate
the diferences in mean ratings of the three technologies across the three clinical guideline questions [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
This test is ideal for Likert scale data as we have multiple dependent groups (Clinical Guideline, Google,
ChatGPT-3.5, ThinkAny) and a single independent variable (dementia clinical guideline question). It
gives insight into whether there were statistically significant diferences in the performance of each
technology across the three questions considered.
        </p>
        <p>
          If the Friedman test reveals a significant diference ( p-value &lt; 0.05), at least one pair of systems
(Guideline vs Google, Guideline vs ChatGPT-3.5 or Guideline vs ThinkAny) has statistically diferent
Likert scores, which the Friedman test does not pinpoint[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. To pinpoint the technology, post-hoc
tests were performed, and since for each question multiple comparisons were performed, we applied
Bonferroni correction to adjust the p-value threshold for significance in the post-hoc tests. This was
done to reduce Type 1 errors and account for the increased chance of false positives due to making
multiple comparisons.
        </p>
        <p>The Table 2 below presents the p-value obtained from the Friedman Test of the three technologies for
the three clinical guideline questions on medical management issues in dementia. The three technologies,
Google, ChatGPT-3.5, and ThinkAny, are compared with the standard clinical guideline result for each
question.</p>
        <p>
          Question
Q1
Q2
Q3
3.4.2. Wilcoxon signed-rank test
Wilcoxon signed-rank test was applied to compare the mean ratings between the clinical guideline and
each technology individually for each clinical guideline question. This non-parametric test was chosen
as it is suitable for paired data and is useful when analyzing diferences between two technologies[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
By conducting this test for each question separately, we can possibly identify the specific technology
where its mean ratings difer significantly from the clinical guideline. It will provide us with valuable
insights into the relative performance of each technology compared to the clinical guideline on a
question-by-question basis. The Mann-Whitney U test was also considered to compare the two
nonparametric tests. It compares the mean ratings between the clinical guideline and each technology
individually for each clinical guideline question. Through this test, we can assess if there are statistically
significant diferences in mean ratings of each technology and guideline[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. It will enable us to gather
valuable insights into the overall performance of each technology. The three non-parametric tests will
provide us with a comprehensive understanding of the efectiveness of these technologies in generating
responses to clinical guideline questions on medical management issues in dementia.
        </p>
        <p>All statistical analyses were conducted using Python 3.11.1, using the appropriate Python
library (SciPy) and the significance level was set to α=0·05. The Python code to perform data
processing, statistical analysis, and generate data visualizations has been provided via Github link:
https://github.com/Sourav-rcsi/Clinical-Guidelines</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>
        To compare the responses obtained from three technologies: Google, ChatGPT-3.5, and ThinkAny with
the standard clinical guideline, we first conducted a single comparison of one test per question. We
used the Friedman test to check if there was any overall diference [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] in Likert ratings and if there was
a statistically significant diference in the overall distribution of Likert scores across the technologies
for each question. Then we conducted separate statistical tests for each question so we could analyze
the performance of each source independently and draw meaningful conclusions. We wanted to test if
there is a significant diference in the responses provided by the standard clinical guideline, Google,
ChatGPT-3.5, and ThinkAny for each question.
      </p>
      <p>
        We conduct the non-parametric Wilcoxon signed-rank test[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and the Mann-Whitney U[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] test for
independent samples on each individual question with the overall significance level (α=0.05). Since we
are conducting multiple tests, we applied Bonferroni correction, and adjusted the significance level
by dividing the original significance level (α=0.05) by the number of comparisons being made, in this
case 3, which gives us the adjusted significance values of alpha as α=0.0167 for the two tests.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Statistical Tests</title>
        <p>4.1.1. Q1 Results
For Q1, the Friedman test resulted in p-value of 0.051, close to the original alpha value but not significant,
indicating that there is statistically no evidence that a significant diference in the mean responses
among the three technologies exists. However, the borderline p-value indicates that the diference
could be worth examining further. The post-hoc analysis using the Wilcoxon signed-rank test revealed
that individual comparisons between the clinical guideline and each technology: Google, ChatGPT-3.5
and ThinkAny performed statistically significantly diferently from the clinical guideline response.
Statistically significant diferences were observed in the comparison between the clinical guideline
and Google search results which resulted in a significant p-value of 0.0107 even after Bonferroni
correction (adjusted alpha=0.0167), indicating that Google’s response difered significantly from the
clinical guideline. The comparison between the clinical guideline and ChatGPT-3.5, and clinical guideline
and ThinkAny, were initially significant but failed to remain so after Bonferroni correction. The
MannWhitney U test supports these findings, indicating significant diferences before Bonferroni correction
between the clinical guideline and each technology except for ThinkAny (p=0.0908).
4.1.2. Q2 Results
For Q2, Friedman test shows statistically significant diferences among the responses of the three
technologies (p-value = 0.0018), indicating a diference in performance across the technologies. Wilcoxon
signed-rank test results showed that all the technologies difered significantly from the clinical guideline,
with all p-values falling below the Bonferroni-adjusted significance level. The Mann-Whitney U test
confirmed these diferences with highly significant results, indicating diferences in median scores
and their distributions. Pairwise comparison using the Wilcoxon signed-rank test and Mann-Whitney
U test showed significant diferences between the clinical guideline and Google search results with
p=0.0062 and p=0.0015 respectively, indicating that Google’s response deviated significantly from the
clinical guideline. The comparison between the clinical guideline and ChatGPT-3.5 as well as the clinical
guideline and ThinkAny were initially significant but failed to remain so after Bonferroni correction.
4.1.3. Q3 Results
For Q3, the Friedman test indicated the most substantial diferences among the responses of the three
technologies (p-value = 0.0004), indicating significant diferences in performance across the technologies.
All technologies based on the Wilcoxon signed-rank test difered significantly from the clinical guideline
and their p-values were exceptionally low, indicating strong evidence against the null hypothesis of
no diference. The Mann-Whitney U test had similar findings, indicating significant diferences in
median scores and their distributions between the clinical guideline and each technology. There were
significant diferences between the clinical guideline and Google search results with p= 0.00097 and
p=0.000083 for the two tests, indicating that Google’s response deviated significantly compared to the
clinical guideline. Google’s performance was the worst (smallest p-value) among all the technologies
across both tests for this question.</p>
        <p>Table 4 presents the p-value obtained from the Mann-Whitney U test of the three technologies for the
three clinical guideline questions on medical management issues in dementia. The three technologies
Google, ChatGPT-3.5, and ThinkAny are compared with the standard clinical guideline result for each
question.</p>
        <p>Question|Technology
Q1
Q2
Q3</p>
        <sec id="sec-4-1-1">
          <title>Google</title>
          <p>Overall, the findings indicate that while there are significant diferences in how each technology fares
against the clinical guideline across the various questions, the extent of these diference varies. While the
Friedman tests showed varying degrees of performance among the three technologies across the three
clinical guideline questions. The Wilcoxon signed-rank test and Mann-Whitney U test showed significant
diferences between Google and the clinical guideline answer for all three questions, indicating Google’s
response diverged significantly from the clinical guideline answer. Google showed the largest deviation
from the clinical guideline, especially in Q3. AI chatbot ChatGPT-3.5 and AI search engine, ThinkAny
also showed significant diferences but have closer performance metrics to the clinical guideline,
suggesting they might be more reliable in contexts where adherence to clinical guidelines is critical.
However, for Q2 and Q3, they showed statistically significant diferences across the tests, suggesting
that while these resources can provide useful information, they lack the accuracy, credibility, and
depth provided by specialized clinical guidelines. These findings highlight the importance of resource
selection in information gathering and decision-making processes, especially in situations where
resource selection is essential.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. AI tools</title>
        <p>Figure4 and Figure5 illustrate the comparative analysis of Wilcoxon Signed-Rank test and Mann-Whitney
U test for the three clinical guideline questions across technologies: Guideline, Google, ChatGPT-3.5,
and ThinkAny
4.2.1. ChatGPT-3.5 response to Q1
For Q1, ChatGPT exhibits a thorough understanding of the question’s context and recommendations
even though it was not presented with a context before asking the question. The response presented
uses simple language and avoids any complex medical jargon, is coherent, well-structured, and provides
a summary at the end with clear explanations of the benefits associated with memory clinic settings
for dementia care. It also provides a detailed analysis of the advantages of memory clinic settings like
specialized care, care coordination, education, support and environment. The response recognizes
the importance of personalized treatment and multidisciplinary team in memory clinics, which shows
the ability of ChatGPT to understand the need for tailored plans and recognizes the importance of
collaboration among various healthcare professions. The response also talks about the environment of
memory clinics and telemedicine, indicating the importance of accessible healthcare for people with
mobility issues or geographical constraints. All these factors show the comprehensive understanding
of ChatGPT in providing dementia care, however, the reference or sources of the obtained information
was not mentioned in its response.
4.2.2. ChatGPT-3.5 response to Q2
For Q2, ChatGPT demonstrates understanding of the topic. The response talks about the significance of
addressing vascular risk factors particularly in case of vascular dementia and how managing the risk
factors such as hypertension, diabetes, high cholesterol, obesity, smoking, and lack of physical activity
can help mitigate cognitive decline and promote brain health. It shows that ChatGPT understands the
context of the question and provides a relevant response. ChatGPT acknowledges that dementia is
a complex condition influenced by multiple factors and advocates for a comprehensive care strategy
including medical management, cognitive stimulation, social engagement and support for caregivers.
The response is accurate, clear, coherent, relevant and useful to the question, however, again the
reference or sources of the obtained information was not mentioned in its response.
4.2.3. ChatGPT-3.5 response to Q3
For Q3, the response generated by ChatGPT shows a comprehensive but cautious response compared
to the other responses. The response emphasizes the assessment of pain, identification of underlying
causes, non-pharmacological interventions, risk-benefit assessment, monitoring and reassessment
and interdisciplinary collaboration. It highlights the importance of minimizing medication use while
prioritizing individualized care and ongoing monitoring to optimize safety and efectiveness. This
reflects a good understanding of the complexities involved in managing behavioural symptoms in
dementia and the need for a patient-centered approach for treatment. Again, the reference or sources
of the obtained information was not mentioned in its response.
4.2.4. ThinkAny response to Q1
For Q1, ThinkAny provides an accurate, clear, coherent, relevant, well-structured and useful response. It
provides a direct answer in comparison to ChatGPT, which provides slightly indirect answer. It provides
clear and direct arguments like ChatGPT but are supported by credible and trustworthy references,
however some of the references mentioned in its response is out-of-date and is not based on current
dementia research. The response highlights the progressive nature of dementia and the importance of
addressing complex medical needs of patients through regular follow-ups. It emphasizes the role of
memory clinic settings in monitoring patient conditions, adjusting treatments and providing support to
both patients and caregivers. The response acknowledges the significance of community-based services
and care coordination in enabling patients to remain safely at home. ThinkAny’s response efectively
addresses the importance of systematic follow-up in memory clinic settings for dementia care.
4.2.5. ThinkAny response to Q2
For Q2, ThinkAny efectively integrates information from various with credible and trustworthy sources
and provides the references to them. It provides a direct answer in comparison to ChatGPT, which
provides a slightly indirect answer. The response directly presents a logical argument which is accurate,
clear, coherent well-structured, relevant and useful. It acknowledges the absence of a cure for dementia
and presents the available interventions to manage symptoms and improve quality of life. The response
presents a well-supported assessment of the potential impact of systematic management of vascular
risk factors on slowing the progression of dementia and suggests that controlling vascular risk factors
has the potential to slow the progression of dementia.
4.2.6. ThinkAny response to Q3
For Q3, ThinkAny presents a balanced, accurate, clear, coherent, relevant and useful response, however,
does not take a cautious approach like ChatGPT. It provides a direct answer in comparison to ChatGPT,
which provides a slightly indirect answer. It again acknowledges the absence of a cure for dementia and
uses a credible source to highlight the availability of interventions to manage the symptoms and improve
quality of life, suggesting mild analgesics could potentially be used for this purpose. It mentions the use
of mild analgesics as a recommended approach in caregiver strategies for addressing behavioral issues
in dementia patients. It also emphasizes careful evaluation and monitoring is needed by healthcare
professionals considering factors like the patient’s overall condition, other medications and potential
side efects. ThinkAny’s response provides a well-rounded assessment on the use of mild analgesics for
treating behavioral symptoms in dementia.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <p>There are limitations to this work. The assessment relies on subjective Likert scale ratings, which may
introduce bias and variability in the results. It focuses on specific clinical guideline questions and may
overlook the broader aspects of dementia management. Only the first Google search link was used
for assessment, the first link does not represent the content in every link. Google search results vary
based on the location, search history and other criteria, so each search may result in a diferent first
link. The LLMs are constantly updated by their providers and are trained on current data, this can
lead to diferent results if the questions are re-entered, which increases the variability in the response
received and limits the reproducibility of the exact performance results. LLMs are trained on large
datasets including medical and non-medical content, which may not be able to fully capture the medical
context and guidelines, leading to gaps in their knowledge. The inner workings of GPTs often lack
transparency, making it dificult to assess the reliability and accuracy of their responses to medical
clinical guideline questions.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>AI-powered chatbots and search engines can serve as an important resource in addressing medical
management issues in dementia patients and meet the needs of healthcare professionals, patients and
caregivers (formal and informal). Factors beyond accuracy, relevance such as those helpful in capturing
the evolving nature of search and retrieval algorithms are needed. AI chatbots ChatGPT-3.5 and
ThinkAny showed significant diferences, but have closer performance metrics to the clinical guideline
than Google, suggesting that while these resources can provide useful information, they lack the accuracy,
credibility and depth provided by specialized clinical guidelines. However, comparison of AI tools with
Google search engine suggests AI tools ofer more contextualized and personalized responses, which
might be more reliable in contexts where obtaining general, but personalized information is required.
Our paper finds researcher, medical professional (like physician, nurse), formal and especially informal
caregivers can benefit significantly from leveraging AI tools and AI search engines as companion tool
to generate immediate answers for their day-to-day problems. This paper serves as a stepping stone
towards understanding the role of AI in answering guideline questions on medical management issues
in dementia. It paves the way for future research aimed at harnessing the full potential of AI tools and
search engines in addressing the complex challenges posed by dementia and other medical conditions.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[13] N. F. Ayoub, Y.-J. Lee, D. Grimm, V. Divi, Head-to-head comparison of chatgpt versus google
search for medical knowledge acquisition, Otolaryngology–Head and Neck Surgery (2023).
[14] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, D. S. W. Ting, Large
language models in medicine, Nature medicine 29 (2023) 1930–1940.
[15] C. Peng, X. Yang, A. Chen, K. E. Smith, N. PourNejatian, A. B. Costa, C. Martin, M. G. Flores,
Y. Zhang, T. Magoc, et al., A study of generative large language model for medical research and
healthcare, NPJ Digital Medicine 6 (2023) 210.
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
        <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[17] K. Giannakopoulos, A. Kavadella, V. Stamatopoulos, E. Kaklamanos, et al., Evaluation of generative
artificial intelligence large language models chatgpt, google bard, and microsoft bing chat in
supporting evidence-based dentistry: A comparative mixed-methods study., Journal of Medical
Internet Research (2023).
[18] K. Chowdhary, K. Chowdhary, Natural language processing, Fundamentals of artificial intelligence
(2020) 603–649.
[19] V. Hristidis, N. Ruggiano, E. L. Brown, S. R. R. Ganta, S. Stewart, Chatgpt vs google for queries
related to dementia and other cognitive decline: comparison of results, Journal of Medical Internet
Research 25 (2023) e48966.
[20] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C.</p>
        <p>Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt, arXiv preprint
arXiv:2302.11382 (2023).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Nichols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Steinmetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Vollset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fukutaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chalek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Abd-Allah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abualhasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Abu-Gharbieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Akram</surname>
          </string-name>
          , et al.,
          <article-title>Estimation of the global prevalence of dementia in 2019 and forecasted prevalence in 2050: an analysis for the global burden of disease study 2019, The Lancet Public Health 7 (</article-title>
          <year>2022</year>
          )
          <fpage>e105</fpage>
          -
          <lpage>e125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>BT</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Chen</surname>
          </string-name>
          ,
          <article-title>Performance assessment of chatgpt versus bard in detecting alzheimer's dementia</article-title>
          ,
          <source>Diagnostics</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <fpage>817</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Arvanitakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <article-title>Diagnosis and management of dementia</article-title>
          ,
          <source>Jama</source>
          <volume>322</volume>
          (
          <year>2019</year>
          )
          <fpage>1589</fpage>
          -
          <lpage>1599</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Frederiksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cooper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Frisoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Frölich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Georges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kramberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nilsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Passmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Mantoan</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Religa</surname>
          </string-name>
          , et al.,
          <article-title>A european academy of neurology guideline on medical management issues in dementia</article-title>
          ,
          <source>European journal of neurology 27</source>
          (
          <year>2020</year>
          )
          <fpage>1805</fpage>
          -
          <lpage>1820</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Panja</surname>
          </string-name>
          ,
          <article-title>Information retrieval systems in healthcare: Understanding medical data through text analysis, in: Transformative Approaches to Patient Literacy and Healthcare Innovation</article-title>
          ,
          <source>IGI Global</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Hussain,</surname>
          </string-name>
          <article-title>Retrieval performance of google, yahoo and bing for navigational queries in the field of “life science</article-title>
          and biomedicine”,
          <source>Data technologies and applications 54</source>
          (
          <year>2020</year>
          )
          <fpage>133</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sandmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riepenhausen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plagwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Varghese</surname>
          </string-name>
          ,
          <article-title>Systematic analysis of chatgpt, google search and llama 2 for clinical decision support tasks</article-title>
          ,
          <source>Nature Communications</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <year>2050</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Puladi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleesiek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Egger</surname>
          </string-name>
          ,
          <article-title>Chatgpt in healthcare: a taxonomy and systematic review, Computer Methods</article-title>
          and Programs in Biomedicine (
          <year>2024</year>
          )
          <fpage>108013</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Jebb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <article-title>A review of key likert scale development advances: 1995-2019</article-title>
          , Frontiers in psychology 12 (
          <year>2021</year>
          )
          <fpage>637547</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Cleophas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Zwinderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Cleophas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Zwinderman</surname>
          </string-name>
          ,
          <article-title>Non-parametric tests for three or more samples (friedman and kruskal-wallis), Clinical data analysis on a pocket calculator: understanding the scientific methods of statistical reasoning and hypothesis testing (</article-title>
          <year>2016</year>
          )
          <fpage>193</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Woolson</surname>
          </string-name>
          ,
          <article-title>Wilcoxon signed-rank test, Wiley encyclopedia of clinical trials (</article-title>
          <year>2007</year>
          )
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P. E.</given-names>
            <surname>McKnight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Najab</surname>
          </string-name>
          ,
          <article-title>Mann-whitney u test, The Corsini encyclopedia of psychology (</article-title>
          <year>2010</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>