<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explainable Classification System for Hip Fractures: A Hybrid CBR+LLM Surrogate Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Enrique Queipo-de-Llano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marius Ciurcau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Paz-Olalla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Belén Díaz-Agudo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan A. Recio-Garcia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Software Engineering and Artificial Intelligence, Universidad Complutense de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Tecnologías del Conocimiento, Universidad Complutense de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>65</fpage>
      <lpage>80</lpage>
      <abstract>
        <p>Hip fractures, mainly displaced femoral neck fractures, pose a significant diagnostic challenge due to their elusive nature on conventional radiographs. These fractures often evade detection in initial assessments, underscoring the critical necessity for refined classification systems. This paper presents a neural network model that can classify X-ray hip fractures accurately and focuses on the need for explanations to increase trust and interpretability. As a first contribution, we evaluate the suitability of several model-specific explanation methods for this concrete classification task. To enhance the limited explainability of these model-specific methods, we also present a surrogate model applying Case-Based Reasoning (CBR) to perform explanations by examples enriched with textual descriptions. This CBR surrogate uses a clustered case base with prototypical cases that include textual medical reports of the fractures. The last contribution of the paper is the integration of LLMs into the reuse stage of the CBR cycle to adapt and personalize these textual reports to the query image and target user.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hip fracture classification</kwd>
        <kwd>eXplainable AI</kwd>
        <kwd>Case-Based Reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Interpretability and trust have become a requirement for black box models applied to real-world tasks
like diagnosis or decision-making processes. In the medical domain, interpretability ensures that medical
practitioners can validate and explain the reasoning behind AI-driven diagnoses. This transparency not
only enhances the credibility and acceptance of AI technologies within the medical community but also
facilitates collaboration between human experts and machine learning systems.</p>
      <p>Our research aims to enrich AI-driven diagnoses with diferent types of explanations. Firstly, we
have explored the applicability of several model-specific explanation methods, such as Grad-CAM, that
highlight crucial regions of an image that influence the network’s prediction based on the gradients of
the last convolutional layer.</p>
      <p>
        However, this approach has limited explainability, and therefore, we propose a surrogate system
based on CBR [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] to provide explanations by example. The explanation cases also contain a textual
component that associates medical reports with the most significant (or prototypical) examples. Then,
as a novel contribution to the XCBR field, we integrate large language models (LLMs) into the reuse
stage of the XCBR system to adapt these textual reports to the target user and the X-ray image being
explained. To the best of our knowledge, this approach is a novelty, and it presents both a challenge and
an opportunity for the impact of synergies of CBR and LLMs in the eXplainable AI (XAI) landscape. The
proposed approach could prove highly beneficial for providing textual explanations in domains with
small sets of training data (where generative approaches are not possible). Besides, a CBR methodology
is useful for capturing new textual explanations from experts and even for the training process of future
experts in similar domains.
      </p>
      <p>Section 2 describes related work on the hip fracture detection problem and its interpretability.
Section 3 describes our fracture classification system, which uses a deep learning approach trained
from a heterogeneous dataset of X-ray images. As it is a black-box system, we describe
gradientbased explanations in section 4. In Section 5, we define a CBR system to generate natural language
explanations without an extensive corpus of textual data for training. The system first retrieves a
prototypical explanation that is extended using a generative LLM. Finally, Section 6 concludes the paper
and discusses open lines of future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The classification of hip fractures is a critical aspect of orthopedic diagnosis, guiding treatment decisions
and patient care. With the advent of artificial intelligence (AI), there has been a surge in research aimed
at automating and improving the accuracy of hip fracture classification. In this section, we review
relevant literature, including the pioneering work of Krogue et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Tanzi et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], along with
other significant contributions in the field.
      </p>
      <p>
        Justin Krogue et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] conducted groundbreaking research aimed at automating the identification
and classification of hip fractures using deep learning techniques. Their study involved the analysis of
hip and pelvic radiographs from a substantial dataset of 3026 hips gathered throughout 20 years and
a total of 1118 diferent studies. Through meticulous labeling and classification, they trained a deep
learning–based (DenseNet169) object detection model to automatically identify and classify hip fractures
into various categories, including displaced and nondisplaced femoral neck fractures, intertrochanteric
fractures, as well as previous surgical interventions such as open reduction and internal fixation or
arthroplasty.
      </p>
      <p>The results of Krogue et al.’s research were highly promising, with the developed model achieving
remarkable accuracy rates. The binary accuracy for detecting a fracture reached 93.7%, with a sensitivity
of 93.2% and a specificity of 94.2%. Moreover, the multiclass classification accuracy was reported at
90.8%. Remarkably, the model’s performance was comparable to or even surpassed that of human
experts, including fellowship-trained radiologists and orthopedists, under various conditions. As for
explainability, point-based gradient explanations were applied to the model. It is noted that the model
appears to pay attention to cortical outlines to make its classification, while the lucent fracture line
appears to receive very little attention.</p>
      <p>Furthermore, Krogue and his team demonstrated that the use of the deep learning model significantly
improved human performance in hip fracture classification. When aided by the model’s predictions,
resident physicians achieved classification accuracies approaching those of unaided fellowship-trained
attending physicians.</p>
      <p>
        The work of Tanzi et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] represents a significant advancement in femur fracture classification,
particularly in sub-fracture classification. The study utilized a ViT-based architecture to classify femur
fractures with remarkable accuracy, substantially improving baseline methods such as InceptionV3 and
a hierarchical network. Through extensive experimentation and evaluation, Tanzi et al. demonstrated
that ViT outperformed existing approaches and provided valuable insights into fracture localization
and feature extraction. The attention maps generated by ViT revealed the network’s focus on specific
anatomical regions, enhancing interpretability and clinical relevance. Moreover, the study highlighted
the synergistic efect of combining ViT predictions with specialist expertise as a Computer-Aided
Diagnosis (CAD) system, resulting in significantly improved diagnostic accuracy compared to either
approach alone. Importantly, Tanzi et al.’s work emphasizes the clinical significance of accurate
subfracture classification, addressing a critical need in medical diagnostics and attaining satisfactory results
on the matter. This pioneering use of ViT in femur fracture classification sets a new standard for future
research in this domain, with potential applications extending to more complex levels of sub-fractures
in the AO/OTA classification system.
      </p>
      <p>
        The work conducted by Twinprai et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] shows the performance of a YOLO-v4-Tiny model trained
with a carefully elaborated dataset of 900 training images. A bounding box was manually drawn onto the
images in a rectangular shape with an annotation tool. The article shows the performance comparison
between the model and several human doctors of diferent expertise.
      </p>
      <p>
        Previous successful hip fracture detection studies include the models of Cheng [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which used
DenseNet-121 for a sensitivity of 98% and an accuracy of 91% and also implemented Grad-CAM
explanations. Lee [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] successfully used the meta-learning deep neural network GoogLeNet (Inception
v3) to classify femoral fractures in pelvic radiographs. Adams [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] compared AlexNet and GoogLeNet
for femoral neck fracture detection with an accuracy of 88.1% and 89.4%, respectively.
      </p>
      <p>Regardless of the quality of diagnosis, trust and acceptance are vital to the adoption of medical AI
systems, and these depend on the ability to explain the system’s decisions. Our focus lies in pushing the
boundaries of application by integrating a CBR model to generate diagnostic insights from radiographic
images.</p>
      <p>
        The work conducted by Kim et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] resonates with our work. Their approach is to use prototype
representations and clustering in a dataset to perform joint inference on cluster labels, prototypes
and important features. Similarly, the studies by Schank and Leake [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Li et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] ofer
valuable insights into the use of case-based reasoning (CBR) to generate explanations by retrieving and
adapting prior explanations in memory. We believe their work highlights the importance of leveraging
expert knowledge and experience to enhance the interpretability of AI systems in the field of medical
diagnostics.
      </p>
      <p>
        Our approach proposes ANN-CBR Twins [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] where a deep-learning, black box classifier is twinned
with CBR as an interpretable model. The successful demonstrations of ANN-CBR Twins to provide
posthoc explanations by example that is enriched with the generation of textual explanations personalized
according to the user context and expertise, make this a valuable approach for explainability. As
outlined in Ford and Keane [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] very few eXplainable AI studies consider how users understanding of
explanations might change depending on whether they difer in their context or expertise. Expertise
is a critical facet of human decision-making in the health domain, and understanding difers from a
trainee doctor, an experienced consultant, or a patient.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Hip fracture classification system</title>
      <p>Our hip fracture classification system uses a deep learning approach trained from a heterogeneous
dataset of X-ray images. Next, we describe this dataset, its preprocessing, and the resulting classification
model.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          Our starting point is the AO/OTA1 classification for bone fractures [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Following their scheme, hip
fractures are labeled as: "A" for fractures in the trochanteric region, "B" for femoral neck fractures, and
"C" for femoral head fractures. Figure 1 illustrates such classification.
        </p>
        <p>
          Our primary goal was to train a classification model for hip fractures. To do this, we needed to
acquire correctly labeled data in the form of hip X-rays. Thanks to the Virgen de la Victoria Hospital
and Müller Foundation Spain, we have collected this data from three diferent sources:
• The YOLO Roboflow project available online at [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This dataset consisted of both complete hip
X-rays (Bilateral) and some single hip (Unilateral) X-rays. We were able to find a total of 640
X-ray images.
• A private research collection of 246 hip fractures provided by the Virgen de la Victoria Hospital
(Málaga, Spain).
• X-ray fractures provided by the Maurice E. Müller Foundation. We were given a total of 132
images labeled according to the aforementioned classification.
        </p>
        <p>No personal data was collected, all the data was completely anonymous when received.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Preprocessing</title>
        <p>Since our collection of X-rays came from three diferent sources, we had to deal with the problem of
obtaining a uniform and homogeneous dataset with which to train our model. To achieve this, we
performed various methods of preprocessing on the images. The data from the Müller foundation
was the most challenging to preprocess as a portion of the data (n=62) had an inverted color palette,
meaning the image displayed a black bone on a white background. In contrast, the rest of our images
had a black background with white bones. Furthermore, most of these images came with two X-rays:
one showing the bone fracture and another showing the prosthesis or implant used to treat the patient.
This is shown in Figure 2.</p>
        <p>An additional problem with the collected images is the cropping of the femur area. As Figure 2 (right)
illustrates, most of them were general X-ray images where the femur area was only a small portion of
the whole image. Therefore, we required a method to crop the hip area.</p>
        <p>For this task, we obtained the femur crops using a custom YOLOv5 model trained on our complete
dataset. Ultimately, we implemented Python scripts to flip right femurs into left femurs, unify color
scale into gray, and resize images to 224x224 pixels, which are the dimensions required for the input
channels of our residual neural network.</p>
        <p>A diagram representing the obtention of the whole dataset can be found in Figure 3. As shown in the
diagram, we finally obtained 891 images corresponding to unbroken hips, 212 to femoral neck fractures,
and 294 to hips presenting a fracture in the trochanteric region.</p>
        <p>The collection of such a challenging dataset led us to take the following additional preprocessing
steps:
• Given that the data is generally only labeled as type A, type B, or type C, it was not possible for
us to train a classifier for the given subtypes of A, B, or C. Even if we had the complete labels, we
1AO/Orthopaedic Trauma Association. AO stands for Arbeitsgemeinschaft für Osteosynthesefragen or the Association of
the Study of Internal Fixation (ASIF), established in 1958 by a group of Swiss general and orthopedic surgeons to strive to
transform the contemporary treatment of fractures in Switzerland.
Color inversion
Manual crop</p>
        <p>n = 62</p>
        <p>Class 0
No fracture
n = 891</p>
        <p>Flip right femurs
Homogenize grayscale</p>
        <p>Resize to 224x224</p>
        <p>Class 2
Trochanteric fracture
n = 294
did not have enough data to implement a model able to predict the subtypes, so we decided to
focus on just the general type, forgetting the subtypes. Moreover, type C fractures are unusual,
which results in a three-class model, predicting no fracture, type A fractures, or type B fractures.
Working beyond this problem has no easy solution, as it involves experts manually labeling each
X-ray into more complicated types and gathering more data.
• Looking at the data, it is obvious we have an imbalanced class problem: for types 1 and 2, we have
a similar amount of data, but for type 0 (unbroken), we have about three times as much data. To
solve this, we used the Albumentations library, which has been designed for image augmentations,
primarily focused on deep learning tasks. It ofers a wide range of transformations from where we
implemented uniform angle rotations between 0 and 15 degrees and random gamma, adjusting
the global brightness of the image. These transformations were applied with a probability of
50% and 10%, respectively. We perform three augmentations for classes 1 and 2, and this way,
we finally obtain a balanced dataset with about 2600 images ready to train our model with 891
images for class 0 and about 800-900 for each of classes 1 and 2.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Classification model</title>
        <p>We performed tests on diferent architectures of neural networks, finally opting for a Residual Network
with 18 convolutional layers (Resnet18). This model has an 18-layer structure that incorporates residual
blocks with skip connections, addressing the vanishing gradient problem and enabling efective training
of deeper networks. Pre-trained weights available for ResNet18 further enhance its utility through
transfer learning, allowing fine-tuning on medical datasets with reduced labeled data requirements. By
automatically extracting relevant features from medical images, ResNet18 proves adept at capturing
bone structures, edges, and textures crucial for fracture identification. We have trained this model for
10 epochs, implementing cross-validation to avoid over-fitting. We split the dataset into 80% to train
and 20% to test, and we obtained satisfactory results as reported in Table 1 and the confusion matrix
shown in Figure 4. Although the score of the model is very high, we have ruled out the posibility of
over-fitting for two main reasons: firstly, the results on the test set are impressively satisfactory, and
secondly, helped by experts of the University Hospital Virgen de la Victoria, we have noticed that the
explanations generated by Grad-CAM, and detailed in the following section, highlight regions close to
the location of the fracture.</p>
        <sec id="sec-3-3-1">
          <title>Precision</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Recall F1-score</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Support</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>No Fracture</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>Femoral Neck</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>Trochanteric</title>
        </sec>
        <sec id="sec-3-3-7">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-3-3-8">
          <title>Macro avg</title>
          <p>Weighted avg</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Explainability Model</title>
      <p>As a classifier, the model ofers no means of interpretability, a fundamental cornerstone in medical
systems. One common approach for explainability when working with image data is applying
Gradientweighted Class Activation Mapping (Grad-CAM) or Integrated Gradients, and it was our first focus.</p>
      <p>
        Grad-CAM analyzes gradients in the last convolutional layer to decode the importance of each
feature map for a specific class. It generates a heat map, highlighting crucial regions of an image that
influence the network’s prediction, providing interpretability without compromising accuracy. By
visualizing how models arrive at predictions, we gain insights, debug models, and enhance performance.
Several heat-map explanation methods from the libraries Xplique 2 and pytorch_grad_cam 3 have been
implemented and compared. The ones from Xplique were discarded because, under the criteria of
experts at Virgen de la Victoria Hospital, the results were not satisfactory, as those explanations did not
seem to highlight the regions where fractures were visible. To choose between GradCam, Gradcam++,
EigenCam, and Ablation, the ROAD (RemOve And Debias) score was implemented. Fundamentally,
this metric gives a score to a heatmap by obscuring the more important regions and performing a
new prediction to understand how hiding such regions afects the output. This way, if hiding the red
zone of a heatmap means a change in the model’s prediction, we can afirm this region is relevant to
the model so that this explanation will have a high score. More information about this metric can be
found in Rong et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In Figure 5, a batch of diferent explanations and scores is shown for some
X-rays. A thourough revision was conducted by an expert in dedicated meetings, where he would
be shown fractures and discussed which of the explanations helped most. According to the expert,
Grad-CAM seems to be the most helpful explanation method, so we finally decided to use this one.
We also tested Random-CAM, which consists of generating a random heat map, to prove the ROAD
metric works. The information conveyed by gradient-based explanations is what part of the image our
model focuses on to generate the prediction. However, this information is not helpful when providing a
diagnosis by giving an X-ray. Other authors have also outlined this issue [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Textual explanations can
be more beneficial than integrated gradients in X-ray images due to their ability to provide contextual
understanding, and educational value, and address subjectivity and uncertainty. While integrated
gradients may highlight important regions in an image, textual explanations ofer a comprehensive
understanding by describing anatomical structures, abnormalities, diferential diagnoses, and treatment
options, and acknowledging limitations. Furthermore, textual explanations ofer a level of customization
that can be tailored to the expertise level of the explainee. They can provide detailed insights for experts
in the field, while also ofering simplified explanations for those with less experience in radiology
interpretation. This personalized approach ensures that the information provided is accessible and
relevant to the individual’s level of knowledge, further enhancing comprehension and aiding in efective
decision-making.
      </p>
      <p>Here, LLMs appear as a direct and efective solution. However, current state-of-the-art generative
models, such as GPTs, are of no use in this case since they have not been trained with medical images,
and fine-tuning requires a large corpus of textual explanations associated with X-ray images. As the
manual collection of this corpus is unviable, we proposed a hybrid solution using CBR, as presented
next.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Hybrid CBR-LLM explainer</title>
      <p>Our research faces a pivotal question here: besides gradient-based explanations, we have so far, can
we efectively generate natural language explanations for this AI-driven fracture classification without an
extensive corpus of textual data for training?</p>
      <p>As we only have a limited number of textual explanations associated with prototypical fractures,
CBR raises as a suitable approach to, given a new X-ray image, retrieve the most similar prototypical
case and reuse its textual explanation. The adoption of the CBR paradigm has the additional benefit
of enabling the use of the most similar images as explanation examples. This way, our explanation
methodology is three-fold: attribution-based (Grad-CAM), textual (LLMs), and explanation-by-example.
All these explanation modalities are presented to the user through the graphical interface presented in
Figure 7.</p>
      <p>Therefore, we propose a twin surrogate architecture to explain the classification model. Figure 6
depicts the two parallel pipelines. The first (top) pipeline performs the classification process using
the Resnet model after preprocessing the images with YOLOv5 (see Section 3), and provides a visual
explanation using Grad-CAM (see Section 4). In the second pipeline, we use a CBR system to
retrieve prototypical explanations followed by an LLM that adapts the resulting text to generate textual
explanations. These processes are described in the next sections.</p>
      <sec id="sec-5-1">
        <title>5.1. Case Base</title>
        <p>The case base has been elicited by an expert who provided highly representative images and the
corresponding medical reports, which we will also call prototypical cases. These will be the seed cases
of our case base, defined as  = ⟨, , ⟩, where  is the X-ray image,  the classification given by
the ResNet18 model, and  is the associated textual report.</p>
        <p>Initially, we have about ten seed cases that have been included with the help of experts: one image of
an unfractured femur, three images of subtypes of femoral neck fractures, and two images of subtypes
of trochanteric fractures with their corresponding texts. Each of these texts is a definition for the given
subtype of fracture. We also included three general explanations for femoral neck fractures and two
for trochanteric fractures. Figure 8 shows an example of a prototypical case. In this text, the part
corresponding to a general description of a femoral neck fracture is highlighted in blue, while below, in
gray, appears the text explaining the subtype of fracture that, according to the expert, is recognized in
the prototypical X-ray shown on the right.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Retrieval</title>
        <p>Due to the limited number of seed cases, we propose a retrieval strategy using a clustered organization
of the case base. Having limited textual explanations, we aim to spread such texts to a wider range of
images, expanding our base cases. We implement an algorithm to cluster images within the same type
of fracture, keeping at least one prototypical case in each resulting cluster.</p>
        <p>The idea is to group around the prototypical cases, which would serve as "centroids", providing
their explanation to the associated cases. Initially, our idea was to implement a KMeans algorithm,
but unfortunately, this would move the centroids, meaning that prototypical cases could end up in a
diferent cluster and some clusters may not get a prototypical case at all.
Femoral neck. This area is located at the top of the femur, just
below the spherical part (head of the femur) of the
ball-andsocket joint. Fractures of the femoral neck are divided into
basicervical or transcervical depending on the distance to the
femoral head, basicervical fractures occur at the junction of
the head and the neck, and transcervical fractures occur in
the neck itself. Since the blood supply to the femoral head
passes through the neck, this type of fracture can cause a
complication because the break often blocks the passage of blood
to the femoral head, leading to avascular necrosis, which is
equivalent to a heart attack but in bone. These fractures can
occur in young people in high-energy accidents but are
common among older individuals who fall from their height due
to osteoporosis.</p>
        <p>A transcervical femoral neck fracture is another specific type
of fracture that occurs in the femur bone, in the region of the
femoral neck. In this case, however, the fracture traverses
the femoral neck from side to side, completely dividing it. It
can have serious consequences due to the disruption of blood
flow to the femoral head.</p>
        <p>
          The clustering process is as follows. Wanting to respect the label restriction, every image of a given
label is compared only with the prototypical cases of such label, associating it with the one to which it
is most similar, based on the Structural Similarity Index Measure (SSIM) metric [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. This allows for a
separation of the images into as many subgroups as cases we originally had. Although each cluster does
represent a certain type of subfracture, this should be taken as an attempt to approximate grouping
Class 0, cluster 0 (C0.0)
Class 1, cluster 0 (C1.0)
Class 1, cluster 1 (C1.1)
Class 1, cluster 2 (C1.2)
Class 2, cluster 0 (C2.0)
Class 2, cluster 1 (C2.1)
with restrictions and it should not be taken as an attempt to predict types of subfractures, as searching
by SSIM is not realiable enough for this task. In this sense, even though we talk about clusters, we are
not implementing an unsupervised algorithm and the word cluster should be taken as group.
        </p>
        <p>In this stage, we can retrieve the prototype case for the cluster where the query image is classified.
Firstly, the query goes through our Neural Network classifier, providing a predicted label for it. Secondly,
once the label is obtained, the image is compared to all the cases in that same label. As a result, we
retrieve the most similar image, and, since it already belongs to a cluster, we can provide its explanation
for our query.</p>
        <p>In Figure 9, the diferent clusters generated for each of the classes are shown, along with the image
representing the centroid of each one. Each cluster is built around a centroid that represents a certain
subtype of a fracture, blue being subtypes of femoral neck fractures and red ones for fractures in the
trochanteric region. As mentioned in this section, the idea is that images sharing the subtype within the
same class will exhibit some similarity among them and, therefore, should belong to the same cluster or
be close to each other. This way, we manage to propagate the explanations we have to new images.</p>
        <p>As we have mentioned, for comparing images, we use the metric known as the Structural Similarity
Index Measure (SSIM), which is measured between -1 (completely diferent images) and 1 (identical
images). This measure is particularly interesting for our domain because, in addition to considering
parameters such as luminosity, contrast, or brightness of the images to determine their similarity, it
also takes into account the structure of these images, which is perhaps the most crucial aspect when
comparing X-ray images.
Class 0
Class 1</p>
        <p>Class 2
0.246</p>
        <p>Average SSIM per cluster</p>
        <p>Figure 10 reports an evaluation of the quality of the resulting clusters, measured through the average
SSIM similarities within each cluster. As we can observe, all the clusters have an equivalent similarity,
denoting a balanced distribution of the cases.</p>
        <p>Using the clustering approach has the additional advantage that the image corresponding to the
prototypical case of the cluster can be presented to the user as an explanation example. This way, our
explanation approach not only uses the heat maps provided by Grad-CAM and the textual explanations
generated by the LLMs but also provides an explanation case to ease the understanding of the fracture.</p>
        <p>Recall that a simple search by SSIM will not provide an accurate classification of a given subtype of
fracture. SSIM retrieves images that are structurally similar, but it may not recognize the underlying
diferences between fractures. Since the subtype explanations are generated based on an SSIM search,
they may not be reliable or precise while the general ones will always be, as the images are taken from
the same label.</p>
        <p>Finally, Figure 11, shows an example of the retrieval stage: the most similar image to a given one is
displayed, along with the cluster where this latter image is located. The classifier’s prediction is used to
determine among which images to search for the most similar one since, as mentioned earlier, it is only
compared with images within the same class.
5.3. Reuse
The reuse stage is coupled to an LLM to adapt the textual report  of the retrieved case. This is useful
to prevent the system from being too repetitive, especially in the beginning when there are few cases.
Note that the use of generative models to provide textual explanations has the risk that it may generate
some invalid explanations, which is why the review phase of the CBR system is crucial in our proposal.</p>
        <p>To implement this stage, we have tested Mistral and GPT-4 through the HuggingFace and OpenAI
APIs, respectively. The prompts include the textual report  obtained by the retrieval stage and various
"knowledge levels" of the user, with which we intend to generate outputs adapted to diferent types of
users (medical experts or patients/students).</p>
        <p>Table 2 shows two diferent versions generated by GPT-4 of a case base text created by an expert
to explain a basicervical femoral neck fracture. In the second column, the original text is presented,
and in the fourth column, text versions for students and experts, respectively, generated by GPT-4 are
shown. This approach helps us avoid repetitive explanations, as GPT-4 generates diferent texts each
time. Furthermore, we can adapt the original text depending on the user, whether they are a student or
an expert, thereby enhancing the usability of the system.</p>
        <p>Expertise</p>
        <p>Specific diagnosis</p>
        <p>Prompt</p>
        <p>Generated diagnosis</p>
        <sec id="sec-5-2-1">
          <title>Student</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>Expert</title>
          <p>A basicervical femoral
neck fracture is a
specific type of fracture that
afects the base of the
neck of the femur, at
the junction with the
trochanteric mass. It is
a two-part fracture at
the base of the femoral
neck that was medial
to the intertrochanteric
line, exiting above the
lesser trochanter but was
more lateral than a
classic transcervical fracture.</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>Given the following</title>
          <p>medical explanation:
«Specific diagnosis»,
generate a text of</p>
        </sec>
        <sec id="sec-5-2-4">
          <title>EQUAL or SHORTER</title>
          <p>length that is suitable for
a person with Student
knowledge.</p>
        </sec>
        <sec id="sec-5-2-5">
          <title>Given the following</title>
          <p>medical explanation:
«Specific diagnosis»,
generate a text of</p>
        </sec>
        <sec id="sec-5-2-6">
          <title>EQUAL or SHORTER</title>
          <p>length that is suitable
for a person with Expert
knowledge.</p>
          <p>A basicervical femoral neck fracture is
a specific type of fracture that afects
the base of the neck of the femur, at the
junction with the trochanteric mass. It
is a two-part fracture at the base of the
femoral neck that was medial to the
intertrochanteric line, exiting above the
lesser trochanter but was more lateral
than a classic transcervical fracture.
A basicervical femoral neck fracture is
a break in the bone that occurs where
the thigh bone’s neck and its wider part
at the top meet. Unlike a common neck
fracture that happens further inside, this
type takes place closer to the side, above
a bony prominence but not as far to the
side as some other types.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. Revise and Retain</title>
        <p>In our approach, the revision phase is not merely a formality but a critical mechanism for refining and
enhancing the performance of our CBR system. With an initial case base comprising only a handful
of seed cases and a reuse stage reliant on LLMs, which may occasionally produce inaccurate textual
descriptions, the revision stage emerges as a pivotal component of our methodology.</p>
        <p>Our system has been designed with a user-centric approach, empowering experts and practitioners
to engage in the learning process actively. One of the key features of our system is its capability to
solicit alternative textual descriptions directly from users during the revision process. This functionality
enables users to provide nuanced insights, corrections, or additional context that our system may have
overlooked or inaccurately represented.</p>
        <p>By incorporating this interactive feedback loop, we not only facilitate user involvement but also
foster a process of continuous improvement. Every correction, refinement, or enhancement suggested
by users serves as a valuable input for the system, contributing to its iterative learning process. Over
time, as the system accumulates diverse user-contributed insights, it evolves into a more robust and
adaptive tool for diagnosing hip fractures from X-ray images.</p>
        <p>This way, the review stage serves as a cornerstone of our CBR approach, facilitating ongoing
refinement and enhancement through active user engagement. By embracing user feedback and
leveraging the collective intelligence of the medical community, we strive to create a system that not
only meets the needs of its users but also evolves and adapts to deliver increasingly accurate and reliable
diagnostic capabilities.</p>
        <p>Due to the low number of seed cases in the current state of the case base, we allow the direct addition
of new cases to the case base and we retain each revised case as a new case for future iterations. We
plan to extend the research to evaluate how these new cases can afect the system behavior and other
approaches for the retain phase. It is important to note that the inclusion of new cases potentially
implies recalculating the clusters that organize the case base.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>Hip fractures on conventional radiographs pose a significant diagnostic challenge. This paper first
presents a neural network model that can classify X-ray hip fractures accurately and focuses on the
need for explanations to increase the trust and interpretability of such a black-box classifier. We propose
a twin surrogate architecture to elucidate the classification model. This architecture comprises two
parallel pipelines. The first pipeline conducts classification and provides a visual explanation using
Grad-CAM. The second pipeline consists of a CBR system to retrieve prototypical explanations, followed
by an LLM that adapts the resulting text to generate textual explanations.</p>
      <p>We use a seed case base of prototypical X-ray images for hip fractures manually annotated with
textual explanations by an expert. The CBR system retrieves a prototypical case and uses an LLM to
generate extended textual explanations. This way, our explanation approach is three-fold:
attributionbased (Grad-CAM), textual (LLMs), and explanation-by-example. While integrated gradients serve
as a valuable tool for highlighting important regions in X-ray images, textual explanations ofer a
more comprehensive and personalized approach to understanding. By providing contextual insights,
the interpretation process is more efective. Integrating both approaches can enhance the overall
interpretation process, ensuring accurate diagnosis and informed decision-making in clinical practice.</p>
      <p>As future work, as the case base grows, scaling up this CBR system poses several challenges and
considerations. First of all, storing and managing a vast number of diagnostic texts requires robust
infrastructure and eficient data-handling techniques. Therefore, it is essential to develop a method for
selecting which textual explanations provided by experts are crucial for refining the CBR system or
for combining similar cases. This selection process could involve prioritizing explanations based on
their relevance to common diagnostic scenarios, their impact on improving model performance, or their
alignment with evolving clinical practices. The use of diferent types of prompts and a deeper study on
how to personalize the textual explanation to the query are both a challenge and an opportunity for the
impact of synergies of CBR and LLMs in the eXplainable AI (XAI) landscape.</p>
      <p>Furthermore, with an expanding dataset of diagnostic texts, another aspect to address is the dificulty
of reorganizing clusters maintaining an eficient and meaningful clustering structure as new cases are
added. For this reason, it would be necessary to decide when the image groups are recalculated to
generate better results as the system is used.</p>
      <p>Another challenge related to the dataset growth lies in the process of searching for similar images.
Given an image, this CBR system has to compare it with all the others with the same label. Too much
time could be spent on this task, so a way of speeding up this process becomes necessary.</p>
      <p>On the other hand, collaboration between domain experts and programmers will be crucial in
addressing errors generated by LLMs, which present a significant challenge. Despite their advanced
capabilities, LLMs are not immune to inaccuracies or misinterpretations, so correcting these errors is
essential for ensuring the reliability and accuracy of generated textual explanations. To make the most
of experts’ feedback and refine stored diagnostics, it’s crucial to establish a method promptly once this
CBR system gains widespread usage. In line with this, it’s essential to share explanations with experts
via meetings and surveys. This not only confirms the accuracy of the explanations but also ensures the
efectiveness of our approach for patients. This step of the process is designed specifically for medical
expert users capable of precisely drafting medical reports.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is supported by the PERXAI project PID2020-114596RB-C21 funded by
MCIN/AEI/10.13039/501100011033 and the Complutense University of Madrid (Group 921330).
Special thanks to the Orthopedic Surgery and Traumatology department at Hospital Universitario
Virgen de la Victoria and to the Maurice E. Müller Foundation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Malburg</surname>
          </string-name>
          , D. Verma (Eds.),
          <source>Proceedings of the Workshops at the 31st International Conference on Case-Based Reasoning (ICCBR-WS</source>
          <year>2023</year>
          )
          <article-title>co-located with the 31st</article-title>
          <source>International Conference on Case-Based Reasoning (ICCBR</source>
          <year>2023</year>
          ), Aberdeen, Scotland,
          <string-name>
            <surname>UK</surname>
          </string-name>
          , July
          <volume>17</volume>
          ,
          <year>2023</year>
          , volume
          <volume>3438</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3438</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Leake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McSherry</surname>
          </string-name>
          ,
          <article-title>Introduction to the special issue on explanation in case-based reasoning</article-title>
          ,
          <source>Artif. Intell. Rev</source>
          .
          <volume>24</volume>
          (
          <year>2005</year>
          )
          <fpage>103</fpage>
          -
          <lpage>108</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10462-005-4606-8.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Krogue</surname>
          </string-name>
          , K. V. Cheng,
          <string-name>
            <surname>K. M. Hwang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Toogood</surname>
            ,
            <given-names>E. G.</given-names>
          </string-name>
          <string-name>
            <surname>Meinberg</surname>
            ,
            <given-names>E. J.</given-names>
          </string-name>
          <string-name>
            <surname>Geiger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zaid</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. C. McGill</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>B. F.</given-names>
          </string-name>
          <string-name>
            <surname>Darger</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          <string-name>
            <surname>Padrez</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Ozhinsky</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Majumdar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Pedoia</surname>
          </string-name>
          ,
          <article-title>Automatic hip fracture identification and functional subclassification with deep learning</article-title>
          ,
          <source>Radiology: Artificial Intelligence</source>
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <article-title>e190023</article-title>
          . doi:
          <volume>10</volume>
          .1148/ryai.2020190023, pMID:
          <fpage>33937815</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tanzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Audisio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cirrincione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aprato</surname>
          </string-name>
          , E. Vezzetti,
          <article-title>Vision transformer for femur fracture classification</article-title>
          ,
          <source>Injury</source>
          <volume>53</volume>
          (
          <year>2022</year>
          )
          <fpage>2625</fpage>
          -
          <lpage>2634</lpage>
          . doi: https://doi.org/10.1016/j.injury.
          <year>2022</year>
          .
          <volume>04</volume>
          .013.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Twinprai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Boonrod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Boonrod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chindaprasirt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sirithanaphol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chindaprasirt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Twinprai</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence (ai) vs. human in hip fracture detection</article-title>
          ,
          <source>Heliyon</source>
          <volume>8</volume>
          (
          <year>2022</year>
          )
          <article-title>e11266</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.heliyon.
          <year>2022</year>
          .e11266.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Y. T.</given-names>
            <surname>Soh</surname>
          </string-name>
          , N. Liu,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ting</surname>
          </string-name>
          , L. T.-E. Cheng,
          <string-name>
            <surname>K. M. Wong</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Liew</surname>
            ,
            <given-names>H. C.</given-names>
          </string-name>
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Venkataraman</surname>
            ,
            <given-names>S. H.</given-names>
          </string-name>
          <string-name>
            <surname>Goh</surname>
            ,
            <given-names>Y. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Application of a deep learning algorithm in the detection of hip fractures</article-title>
          , iScience
          <volume>26</volume>
          (
          <year>2023</year>
          )
          <article-title>107350</article-title>
          . doi:https://doi.org/10.1016/j.isci.
          <year>2023</year>
          .
          <volume>107350</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Jo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Classification of femur fracture in pelvic x-ray images using meta-learned deep neural network</article-title>
          ,
          <source>Scientific reports 10</source>
          (
          <year>2020</year>
          )
          <fpage>13694</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Holcdorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>McCusker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Howe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gaillard</surname>
          </string-name>
          ,
          <article-title>Computer vs human: Deep learning versus perceptual training for the detection of neck of femur fractures</article-title>
          ,
          <source>Journal of Medical Imaging and Radiation Oncology</source>
          <volume>63</volume>
          (
          <year>2019</year>
          )
          <fpage>27</fpage>
          -
          <lpage>32</lpage>
          . doi:https://doi.org/10.1111/
          <fpage>1754</fpage>
          -
          <lpage>9485</lpage>
          .
          <fpage>12828</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Shah,</surname>
          </string-name>
          <article-title>The bayesian case model: A generative approach for case-based reasoning</article-title>
          and prototype classification,
          <year>2015</year>
          . arXiv:
          <volume>1503</volume>
          .
          <fpage>01161</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Schank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Leake</surname>
          </string-name>
          ,
          <article-title>Creativity and learning in a case-based explainer</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>40</volume>
          (
          <year>1989</year>
          )
          <fpage>353</fpage>
          -
          <lpage>385</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/0004370289900532. doi:https://doi.org/10.1016/
          <fpage>0004</fpage>
          -
          <lpage>3702</lpage>
          (
          <issue>89</issue>
          )
          <fpage>90053</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <article-title>Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1710</volume>
          .
          <fpage>04806</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Kenny</surname>
          </string-name>
          , M. T. Keane,
          <article-title>Explaining deep learning using examples: Optimal feature weighting methods for twin systems using post-hoc, explanation-by-example in XAI, Knowl</article-title>
          .
          <source>Based Syst</source>
          .
          <volume>233</volume>
          (
          <year>2021</year>
          )
          <article-title>107530</article-title>
          . doi:
          <volume>10</volume>
          .1016/J.KNOSYS.
          <year>2021</year>
          .
          <volume>107530</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Keane</surname>
          </string-name>
          ,
          <article-title>Explaining classifications to non-experts: An XAI user study of post-hoc explanations for a classifier when people lack expertise</article-title>
          , in: J.
          <string-name>
            <surname>Rousseau</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Kapralos (Eds.),
          <string-name>
            <surname>Pattern</surname>
            <given-names>Recognition</given-names>
          </string-name>
          ,
          <source>Computer Vision</source>
          , and Image Processing.
          <article-title>ICPR 2022 International Workshops</article-title>
          and Challenges - Montreal, QC, Canada,
          <source>August 21-25</source>
          ,
          <year>2022</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          , volume
          <volume>13645</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>246</fpage>
          -
          <lpage>260</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -37731-0_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Meinberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Agel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Karam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Kellam</surname>
          </string-name>
          , Fracture and dislocation classification compendium-2018
          <source>, Journal of orthopaedic trauma 32</source>
          (
          <year>2018</year>
          )
          <fpage>S1</fpage>
          -
          <lpage>S10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>T. v8</surname>
          </string-name>
          ,
          <article-title>Proximal femur fracture detection and classification dataset</article-title>
          , https://universe.roboflow.com/ thesisyolo-v8/
          <article-title>proximal-femur-fracture-detection-and-</article-title>
          <string-name>
            <surname>classification</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://universe. roboflow.com/thesisyolo-v8/
          <article-title>proximal-femur-fracture-detection-and-</article-title>
          <string-name>
            <surname>classification</surname>
          </string-name>
          ,
          <source>visited on 2024- 03-23.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <article-title>A consistent and eficient evaluation strategy for attribution methods 162 (</article-title>
          <year>2022</year>
          )
          <fpage>18770</fpage>
          -
          <lpage>18795</lpage>
          . URL: https://proceedings.mlr.press/v162/ rong22a.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Sekh</surname>
          </string-name>
          ,
          <article-title>Is grad-cam explainable in medical images?</article-title>
          ,
          <source>CoRR abs/2307</source>
          .10506 (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2307.10506. arXiv:
          <volume>2307</volume>
          .
          <fpage>10506</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Zhou</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Bovik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Sheikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Simoncelli</surname>
          </string-name>
          ,
          <article-title>Image quality assessment: from error visibility to structural similarity</article-title>
          ,
          <source>IEEE Transactions on Image Processing</source>
          <volume>13</volume>
          (
          <year>2004</year>
          )
          <fpage>600</fpage>
          -
          <lpage>612</lpage>
          . doi:
          <volume>10</volume>
          .1109/TIP.
          <year>2003</year>
          .
          <volume>819861</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>