<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IRLab@IIT_BHU at MEDIQA-MAGIC 2024: Medical Question Answering using Classification model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arvind Agrawal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper presents our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task [1]. The paper presents two types of approaches: 1) Generation-based and 2) Classificationbased. The generation-based model passed the title and content as text embeddings and images as visual embeddings as a prompt to a pre-trained LLM. The Classification model utilized the medically relevant NER tags obtained from the queries using pre-trained NER models and converted these tags and images to embeddings using CLIP text and vision encoders. These embeddings were passed through Bi-LSTM and an MLP to obtain final representations, which were combined to form query embeddings. The query and label embeddings were used to train the model using triplet loss. The answer label was predicted as the most similar label embedding to the query embedding using cosine similarity. The generative approach performs poorly compared to the classification because less training data is available. Our classification-based approach utilizes manually labeled data (160 labels) to predict the test set answers with a deltaBLEU-score of 4.829 and was ranked 2nd on the leaderboard.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Med-VQA</kwd>
        <kwd>MAGIC</kwd>
        <kwd>deltaBLEU</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Telemedicine consultation for dermatology became very common during the pandemic to lessen the
risk of human-human contact [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Patients used to consult doctors by phone, which proved a viable
solution. People have started to believe in telemedicine consultation. People have been integrating AI
to assist doctors and telemedicine consultations through Medical VQA systems, which can assist both
doctors and patients. Many medical-VQA systems have been developed utilizing both classification
and generation-based approaches. However, these models have been developed with the medical
VQA datasets available, which specifically cater to radiology[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], pathology[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], orthopedics, and the
gastrointestinal datasets[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, in the case of dermatology, not much exploration has been done
due to the datasets available being very low-resource, and the traditional systems developed will not be
able to perform well on the low-resource dataset provided by the organizers. The consumer answering
systems developed for dermatology[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] focused only on text (questions, textual context) and did not
explore the vision modality (Images, Videos). This limits the model from considering the vision features
that can provide fine visual details that are captured through images and are often dificult for the user
to explain through text.
      </p>
      <p>
        This paper focuses on developing a model capable of generating free-form text in response to a
given question asked by the user, specifically focusing on clinical dermatology. The proposed model
will be able to consider the vision modality but will not necessarily require it. The work described
in this paper is presented as a participation of the ImageCLEF-2024-MEDIQA-MAGIC[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] shared task.
The task focuses on the problem of Multimodal And Generative TelemedICine (MAGIC) in the area of
dermatology. The challenge tackled the generation of an appropriate textual response to the query[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
asked by the user, along with the clinical context provided in the form of both text and images.
      </p>
      <p>
        In this paper, we propose both classification and generation-based approaches. The generation-based
approach is based on the work by Bazi et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The classification-based approach was tried because
the generation-based approach could not produce good results because of the low resource data provided
by the organizers. The classification-based approach was used to predict answer classes corresponding
to the answers that were manually labeled into 160 classes. The predicted class was later converted
into long-form text based on a manually prepared label → Long-text answers mapping. Our approach
performed 2nd best during the competition with a deltaBleu score of 4.829.
      </p>
      <p>The paper is organized as follows; we present a literature review in Section 2. In Section 3, we provide
details of the dataset provided. In Section 4 we explain how we pre-processed the data to fit our needs.
In Section5, we provide the details of our participation in the ImageCLEF-2024 MAGIC shared task. In
Section 6, we present the results and our corresponding analysis. Following this, the thesis is concluded
with the future works in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Telemedicine consultation became a go-to option for many during the pandemic. There were many
studies describing the experiences of patients who had a consultation without human-human contact,
and most found it satisfying. This not only decreased the importance of in-person visits but also opened
many doors. Many people have started integrating AI with consumer answering systems in the medical
domain. These systems can be divided into two categories: Classification-based and Generation-based
approaches.</p>
      <p>
        Classification-based Medical-VQA categorizes questions and answers for eficient responses, utilizing
techniques like CNN, RNN, Bi-LSTM, and transformers to predict classes. Key contributions include
using CNNs for visual feature extraction from medical images alongside RNNs for question processing
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Hierarchical deep multimodal networks enhance eficacy in classification and response generation
through hierarchical attention mechanisms [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. MMBert uses a transformer-style architecture for
richer image and text representations [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Multimodal-Multihead Self Attention combines text and
image embeddings for classification [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Caption Aware Medical-VQA integrates image-captioning
models with BAN for superior performance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. A new dataset focusing on chest radiography images
introduces relation graphs for improved reasoning [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Generation-based Medical Visual Question Answering (MedVQA) approaches focus on generating
precise, contextually appropriate free-form text responses using advanced deep-learning techniques. The
Q2A transformer, though claimed generative, faces computational challenges as classes increase, utilizing
learnable answer class embeddings and a SWIFT encoder for fine-grained features [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. CGMVQA
switches between classification and generation models based on the question [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Bazi and Yakoub’s
method employs an encoder-decoder transformer architecture, integrating image and text features for
autoregressive answer generation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. MedfuseNet combines CNN and BERT embeddings with an
MFB algorithm for feature fusion [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Zhou, Yuan, and Mei use a joint encoder for image and text
embeddings without fusion, fine-tuning on the VQA dataset [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Van, Tom, and Derakhshani employ a
pretrained language encoder and CLIP visual tokens for eficient training [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        However, all these are limited to radiology, pathology, and orthopedics datasets but not dermatology.
In the case of dermatology, we have simple classification tasks which do not concern with answer
generation as a free-form text. We also found one task in which the authors[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explored the performance
of GPT-4v in diferentiating between benign lesions and melanoma. The dataset provided by the
organizers, however, includes a much larger domain problem set along with the dificulty of generating
free-form text. Thus, this is a first-of-its-kind task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Description</title>
      <p>
        The MEDIQA-M3G task organizers provided their own dataset[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to the participants, which we had to
fetch by using the Reddit developer’s API, as the dataset was part of a subreddit involving dermatologists
answering the questions asked on it. The exact count of the dataset was not fixed and varied depending
upon the time of fetching the dataset. The dataset was available in the English language. Each query in
the dataset contained four main things:
1. Encounter_id: Query_id is used to score the results.
2. Image_ids: Containing a list of image IDs uploaded by the user asking the question.
3. query_title_en: As the name suggests, it was the query title in the English language
4. query_content_en: The query has some content, which can provide some extra context
5. responses: Three medical professionals/annotators answered the queries. Their responses, along
with the annotator ID, are provided here.
      </p>
      <p>The dataset was very noisy as the queries asked were on a subreddit; there was no particular format
to ask questions. Some contained emojis; some had non-relevant information like "I don’t know how to
upload more than one image in Reddit". Relevant statistics of the dataset received are shown in Table
1. Even after fetching the data, some queries were deleted, and some image URLs didn’t exist, so the
efective count was reduced to Table 1. The count mentioned is only for the English samples.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data-Preprocessing</title>
      <p>
        The dataset[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], as depicted in Table 1, was very small, and even if we trained the model by combining
the training and validation data, the dataset was not suficient. The dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] was also quite noisy as
it included emojis in some query titles, unwanted text statements like "I don’t know how to upload two
images in Reddit" which is not medically relevant and thus needed data preprocessing to extract medically
relevant information before passing it to the model. This section includes the data-preprocessing steps
we used to process the data before passing it to the main model.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Data Augmentation</title>
        <p>We needed to augment the data by generating diferent titles and content as part of query content for
the same image. We tried a few methods, such as word synonyms and translation methods, wherein we
translated to another language and back-translated to English, but none were satisfactory. Because of
this, we used the Textgenie repository to augment the data. Textgenie is a github repository for text data
augmentations that facilitates the augmentation of text datasets and the creation of comparable samples.
Moreover, it manages labeled datasets by retaining their labels in memory while generating analogous
samples. The utilization of diverse Natural Language Processing techniques, including paraphrase
generation, BERT mask filling, and converting passive voice constructions to active voice, is integral to
its functionality. Presently, it is available only in the English language. Using this, we obtained at least
three augmented titles and contents if they exist as per the condition stated in 4.2 for each query. For
validation, we augmented the training queries only, but for the final test submission, we augmented the
training and validation data and combined both of them.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Question pre-processing</title>
        <p>The objective of this step is to extract medically relevant information from the title and content so that
it can be passed to the main model for proper learning. Each query contained a query title () and
a query content (). We first concatenated  and  to form q, the only condition being that the
query title was not deleted (displayed as [deleted by user]), and the query content was neither empty
nor deleted (displayed as an empty string or [deleted] or [removed]). Then q is passed through an emoji
remover function to remove any emojis from it as it will not be of any medical relevance to give us Q.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <p>
        This section explains the 2 approaches and their model architectures we tried:- Generative and
Classification based approach. The classification based approach is the top run submitted in the task
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <sec id="sec-5-1">
        <title>5.1. Generation-based Model</title>
        <p>
          We also tried a Generative-based Model and obtained results for the same. We fine-tune GPT2-xl[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] to
accept questions and image information as a prompt. This model was an implementation of [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The
text is encoded using GPT2-xl’s [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] encoder, and the image is encoded by using the CLIP’s [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] vision
encoder. The model architecture is described in 1.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Data-Preprocessing for Classification Model</title>
        <p>
          Due to the size and extreme-noisy nature of the dataset[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] we moved to the Classification approach but
it required manually labeling the training and validation dataset and making a label-answer mapping.
        </p>
        <sec id="sec-5-2-1">
          <title>5.2.1. Manual Labelling</title>
          <p>
            For the classification approach we need to have classification labels so the model could be trained. For
this, we did two
things:1. Classify responses to Labels:- Manually classify the responses of training and validation to answer
labels. Each query response can have multiple labels. Collectively, we form a set of 160 labels.
2. Make a  → :- For each of the 160 unique labels, we form a
descriptive answer with the help of the responses of the train and valid queries and chatGPT[
            <xref ref-type="bibr" rid="ref24">24</xref>
            ].
This manual labeling helps us reduce the task complexity by making it a classification task. However, it
will fails to answer some labels when the test set expects a response the model has never seen. The
additional details of the labels is given in Appendix A.
          </p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Preparing Answer Labels for Triplet Loss</title>
          <p>The queries provided to us sometimes had more than one positive label from the set of 160 labels. We
prepare the data such that each query has only one positive label and one negative label randomly
selected from the list of remaining labels. Because each original query can have at least three augmented
queries, we assign a single positive label to each one so that they get evenly distributed one at a time.
For example, Suppose we have four augmented queries of the original query and three answer labels
1, 2, 3 corresponding to it. In that case, we will assign 1 to 2 queries, 2 to another two queries,
and 3 to the remaining query, along with a randomly selected negative label for each. This is how we
assign labels to each query and their augmentations.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Classification-based Model</title>
        <p>
          The model converts the questions obtained in 4.2 to medically relevant NER tags. These tags are passed
along with the image as a sentence through the CLIP encoders to obtain embeddings in the same latent
space. The embeddings obtained are passed through separate Multi-Layer Perceptron (MLP) to obtain
ifnal embeddings. Separate MLPs are considered a way to make an ensemble model that benefits from
both image and question as well as counter some training or validation examples that do not contain
an image. The final embeddings are compared with the embeddings of the actual label (obtained by
passing them through the CLIP[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] text-encoder model). The model (as shown in Fig 2) can be divided
into five parts, each explained separately.
        </p>
        <sec id="sec-5-3-1">
          <title>5.3.1. Question tokenization</title>
          <p>
            When we obtained Q in Sec 4.2, it contains the actual text data entered by the user in the query, but
it still contains some medically irrelevant information as specified in Sec 4. We thought of removing
them manually, but that would not be justifiable for the task as it would create bias if I were a medical
professional. So we extracted medical NER tags {1, 2, ... } from these queries with the help of
pre-trained Medical-NER models [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ] and Clinical-AI-Apollo/Medical-NER model on hugging face. For
the first NER tokenizer, we picked the tokens that belonged to [‘Disease_disorder’, ‘Sign_symptom’,
‘Biological_structure’, ‘Coreference’, ‘Detailed_description’, ‘Color’, ‘Medication’, ‘Therapeutic_procedure’,
‘Shape’] token categories. For the second NER tokenizer, we picked the tokens belonging to
[‘DISEASE_DISORDER’, ‘BIOLOGICAL_STRUCTURE’, ‘SIGN_SYMPTOM’, ‘DETAILED_DESCRIPTION’,
‘MEDICATION’] token categories. After obtaining the NER tags from both the tokenizers, we removed
the duplicate tags and any stop-words if the tokenizers had picked them up. This list of NER tags forms
the actual question tags, and this is done separately for each augmented data example obtained in 4.1.
The tags provide essential medical information such as symptoms, description, color, shape, etc. Some
training examples did not have any medical NER tags belonging to any token categories, so we took all
the words of the query as the tokens for that example.
          </p>
        </sec>
        <sec id="sec-5-3-2">
          <title>5.3.2. CLIP Encoders</title>
          <p>
            The question tags obtained in 5.3.1 are combined to form a sentence with space as a delimiter; this
sentence and the image corresponding to the query are passed separately from the pre-trained CLIP[
            <xref ref-type="bibr" rid="ref23">23</xref>
            ]
model with ViT backbone to obtain embeddings for both of them separately. CLIP model with ViT
backbone was chosen because it is a multimodal encoder model that can encode both text and images in
the same latent space. As the embeddings belong to the same latent space, we need not pass it through
an MLP to specifically convert them to another latent space. The answer labels are also passed through
the CLIP encoder to obtain label embeddings, which are later used to calculate triplet loss.
          </p>
        </sec>
        <sec id="sec-5-3-3">
          <title>5.3.3. Bi-LSTM and MLP layers</title>
          <p>The text embedding obtained is passed through a Bi-LSTM layer, and the image embedding obtained is
passed through an MLP layer. The text embeddings are obtained from a sentence with no semantic
meaning; instead, it is the collection of medically relevant words. It is passed through a Bi-LSTM to
make the embedding more comparable to the label embeddings when using triplet loss. We pass the
image through an MLP to become comparable to the text embeddings obtained after passing through
the Bi-LSTM.</p>
        </sec>
        <sec id="sec-5-3-4">
          <title>5.3.4. Triplet Loss</title>
          <p>
            We train the model by using Triplet loss[
            <xref ref-type="bibr" rid="ref26">26</xref>
            ] through cosine similarity. We obtain the final query
embedding by taking the average of the text and the image embeddings obtained from Bi-LSTM and
MLP, respectively. If the example does not have an image associated with it, we take the text embedding
as the final query embedding. This query embedding will work as an Anchor. To obtain positive
and negative embedding, we convert the positive label and negative label, as mentioned in 5.2.2, to
label embeddings by passing them through a CLIP [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] encoder. The anchor, the positive, and the
negative embeddings are then passed through the triplet loss function that calculates the triplet loss by
taking into account the cosine similarity between the pair of embeddings. We also multiply the cosine
similarity obtained by the class weight. The class weight for class  is calculated as:
 =
          </p>
          <p>· 
(1)
Where:
•  is the total number of samples after data augmentation,
•  is the total number of classes,
•  is the number of samples in class .</p>
        </sec>
        <sec id="sec-5-3-5">
          <title>5.3.5. Answer Generation</title>
          <p>To generate the answer in the validation and test phase, we first obtain the query embeddings as
explained in 5.3.4. The query embedding is used to calculate cosine similarity with each of the 160
label embeddings. The label with the highest cosine similarity is chosen as the answer label. The final
descriptive answer is obtained through the label → Descriptive Answer mapping as explained in
5.2.1.</p>
          <p>This is the model design for our top-performing run in the shared task. The second-best-performing
model did not have any Bi-LSTM or MLP layer as mentioned in 5.3.3 and directly calculated the answer
through 5.3.5.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments and Results</title>
      <p>This section contains information about the training setup, the experiments run, and the results obtained.</p>
      <sec id="sec-6-1">
        <title>6.1. Training setup</title>
        <p>
          We used the PyTorch framework and a pre-trained CLIP model [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] with ViT backbone as our text and
image encoder. It gives us embedding of the size 512. Because we obtain the label encodings through
CLIP model [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] with ViT backbone, they are also of size 512. The selection of hyperparameters was
based initially on dataset analyses and later adjusted according to empirical observations. We opted for
the Adams optimization algorithm for the training phase. The training began with a linear warm-up of
over 500 steps, followed by a learning rate of 1e-4. All other Adam-optimizer settings were kept to a
default. We have set the maximum query tag limit to 20 for the training, validation, and test phases.
The model was trained using a batch size of 1, with gradient accumulation across up to 5 iterations
because, after that, the model was overfitting. The following section details the test results as provided
by the task organizers, providing insights into their efectiveness and applicability.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Results and Analysis</title>
        <p>
          The results, as provided by the task organizers, are given in Table 2,3. For our top run, we were ranked
second according to the Delta-Bleu score [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. The 3rd ranked run was also ours, which was a simple
CLIP encoder that gave embeddings to be compared with the label embeddings. The top-ranked run had
an 8.6293 delta-bleu score. It fine-tuned a 1.86 B parameter Vision-Language model MoonDream2. Due
to resource limitations, we could not fine-tune any large LLM. We tried the generative model approach
as mentioned in 5, and its results are mentioned in Table 3.
        </p>
        <p>Analyzing results from Table 2,3 provides valuable scientific observations about the performance of
our model on the dataset provided. The classification-based models performed better than the generative
ones. In the classification-based approach, the model with Bi-LSTM layer and MLP performed slightly
better than the pre-trained CLIP model with ViT backbone; this may suggest that the text embeddings
obtained as a sentence embedding of the tokenized words may not correctly represent all the tokens
due to the sentence not forming any semantic meaning. It may also suggest that the Bi-LSTM and MLP
help the model learn text and image embeddings better to compare the labels.</p>
        <p>The results in Table 2, 3 also suggest that data augmentation is helping as the run without data
augmentation in the classification-based model achieved a delta-bleu score of 4.231. In contrast, with
data augmentation, it achieved a score of 4.829. As we can see in Fig 3, data augmentation also helped
in the case of the generative approach as the delta-bleu score increased from 1.684 to 2.525.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future works</title>
      <p>The paper described our participation in ImageCLEF-2024-MEDIQA-MAGIC, which resorted to
classification-based Med-VQA. We began with the generation-based model, but initial results were not
promising due to the extremely noisy nature of the training data, and because of this, we quickly shifted
to a classification-based approach. The classification-based approach worked in this case but reached its
limit as we tried several models, and the score did not increase further. This was mainly due to a high
number of answer classes (160), which will only increase further as we expand our answer domain and
cause computational overhead. Thus, generative Med-VQA is the way to go, as the domain of answers
does not limit it. In future works, we can fine-tune pre-trained multimodal generative models with the
help of compute-eficient techniques and explore the feasibility of pre-training the model with other
vibrant dermatology datasets through mask training.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Manual Labelling</title>
      <p>This section contains the list of all the labels we decided manually based on the training and validation
answers. There are a total of 160 labels that we found manually by ourselves without any professional
help or prior experience. Some labels were easy to identify as they were straight forward mentioned in
the answer but some were dificult because the answer text contained multiple possibilities due to lack
of information due to single image and less query content. In the case where there was not a single
label and multiple possibilities we either had multiple labels or just a single label asking them to refer
to a dermatologist as suggested by the annotator in the answer texts provided. The method fails when
test set contains a label which was not a part of the manually picked labels.</p>
      <p>The complete list of labels is as mentioned below:- cyst, blind pimple, pimple, folliculitis, Solar
lentigo, contact eczema, eczema, common wart, sun exposure, coarse wrinkle, eczema due to dry
skin, lip dryness, itchy scalp, keratoacanthoma, Pityriasis versicolor, rosacea, tan, callus around heel,
leukoderma, hormonal acne, acne, fungal infection, ringworm, pityriasis rosea, ingrown hair, mole, skin
cancer, cherry angioma, fungal infection due to nail thickening, pupuric spot, solar keratosis, keratosis,
seborrheic keratosis, scratching, itching, birthmark, nevus, nodular melanoma, melanoma, urticaria,
bug bite, insect bite, dyshidrotic eczema, contact dermatitis, heat rash, lipoma, cat and dog fleas,
angiofibroma, ecchymosis, HTD (Habit-tic deformity), dermatologist for lesion examination, dermatologist
consultation, alopecia areata, nevus sebaceous, spider veins, photodermatoses, lymphatic malformation,
comedones, healing, milia, xanthelasma, chalazia, hyperpigmentation, longitudinal melanonychia,
Pyogenic granuloma, post inflammatory hyperpigmentation, dermatitis artifacts, dermatitis,
atrophoderma, athlete’s foot, pseudofolliculitis, subungual hematoma, Neutrophilic dermatoses, Discoid eczema,
atopic dermatitis, acneiform eruptions, keratosis pilaris, tinea versicolor, dermatofibroma, viral rash,
angular cheilitis, flushing skin because of alcohol, compund nevus, rash, morphea, inflammatory rash,
shingles, dandruf, psoriasis, trauma, lip licker’s dermatitis, aquagenic wrinkle, cystic fibrosis, eclipse
nevi, nummular eczema, eczema due to working in water, hairy tongue, syringomas, lip biting, irritated
hair follicle, cyst under skin, spider angioma, inflammatory acne, Schamberg’s purpuric dermatosis,
vasculitis, sebaceous hyperplasia, tinea corporis, granuloma annular, viral infection, hive, mast cells in
eczema
mole
fungal infection
acne
dryness
cyst
dermatologist for lesion examination
skin cancer
dermatitis
atopic dermatitis</p>
      <p>Label
Frequency
8
4
4
3
3
2
2
2
2
2</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2024: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: CLEF 2024 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pogorzelska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marcinowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chlabicz</surname>
          </string-name>
          ,
          <article-title>Understanding satisfaction and dissatisfaction of patients with telemedicine during the covid-19 pandemic: An exploratory qualitative study in primary care</article-title>
          ,
          <source>Plos one 18</source>
          (
          <year>2023</year>
          )
          <article-title>e0293089</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Giansanti</surname>
          </string-name>
          ,
          <article-title>Advancing dermatological care: A comprehensive narrative review of teledermatology and mhealth for bridging gaps and expanding opportunities beyond the covid-19 pandemic</article-title>
          , in: Healthcare, volume
          <volume>11</volume>
          ,
          <string-name>
            <surname>MDPI</surname>
          </string-name>
          ,
          <year>2023</year>
          , p.
          <year>1911</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-M. Zhan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            , L. Ma,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
          </string-name>
          ,
          <article-title>Slake: A semantically-labeled knowledgeenhanced dataset for medical visual question answering</article-title>
          ,
          <source>in: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1650</fpage>
          -
          <lpage>1654</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mou</surname>
          </string-name>
          , E. Xing,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Pathvqa:
          <volume>30000</volume>
          +
          <article-title>questions for medical visual question answering</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>10286</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Vqa-med: Overview of the medical visual question answering task at imageclef 2019</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2019</year>
          , volume
          <volume>2380</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Lugano, Switzerland,
          <year>2019</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /paper_272.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Koban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Schenck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Giunta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence in dermatology image analysis: current developments and future trends</article-title>
          ,
          <source>Journal of clinical medicine 11</source>
          (
          <year>2022</year>
          )
          <fpage>6826</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cirone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Akrout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Abid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oakley</surname>
          </string-name>
          ,
          <article-title>Assessing the utility of multimodal large language models (gpt-4 vision and large language and vision assistant) in identifying melanoma across diferent skin tones</article-title>
          ,
          <source>JMIR dermatology 7</source>
          (
          <year>2024</year>
          )
          <article-title>e55508</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Dermavqa: A multilingual visual question answering dataset for dermatology</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bazi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. A. Rahhal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bashmal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zuair</surname>
          </string-name>
          ,
          <article-title>Vision-language model for visual question answering in medical imagery</article-title>
          ,
          <source>Bioengineering</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>380</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekbal</surname>
          </string-name>
          ,
          <article-title>Hierarchical deep multi-modal network for medical visual question answering</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>164</volume>
          (
          <year>2021</year>
          )
          <fpage>113993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Khare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bagal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Devi</surname>
          </string-name>
          , U. D.
          <string-name>
            <surname>Priyakumar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Jawahar</surname>
          </string-name>
          , Mmbert:
          <article-title>Multimodal bert pretraining for improved medical vqa</article-title>
          ,
          <source>in: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1033</fpage>
          -
          <lpage>1036</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Bose, Multi-modal multi-head self-attention for medical vqa</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Caption-aware medical vqa via semantic focusing and progressive cross-modality comprehension</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3569</fpage>
          -
          <lpage>3577</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Interpretable medical image visual question answering via multi-modal relationship graph learning</article-title>
          ,
          <source>arXiv preprint arXiv:2302.09636</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhou,</surname>
          </string-name>
          <article-title>Q2atransformer: Improving medical vqa via an answer querying decoder</article-title>
          ,
          <source>in: International Conference on Information Processing in Medical Imaging</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Cgmvqa: A new classification and generative model for medical visual question answering</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>50626</fpage>
          -
          <lpage>50636</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Purushotham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <surname>Medfusenet:</surname>
          </string-name>
          <article-title>An attention-based multimodal deep learning model for visual question answering in the medical domain</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <fpage>19826</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Syeda-Mahmood,
          <article-title>Medical visual question answering using joint selfsupervised learning</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13069</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>T. Van Sonsbeek</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Derakhshani</surname>
            , I. Najdenkoska,
            <given-names>C. G.</given-names>
          </string-name>
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>Open-ended medical visual question answering through prefix tuning of language models</article-title>
          , in: International Conference on Medical Image Computing and
          <string-name>
            <surname>Computer-Assisted</surname>
            <given-names>Intervention</given-names>
          </string-name>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>726</fpage>
          -
          <lpage>736</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Chatgpt: Conversational agent, https://www.openai.com/chatgpt,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          - 05-17.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Reji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bashir</surname>
          </string-name>
          ,
          <article-title>Large-scale application of named entity recognition to biomedicine and epidemiology</article-title>
          ,
          <source>PLOS Digital Health</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <article-title>e0000152</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Saul</surname>
          </string-name>
          ,
          <article-title>Distance metric learning for large margin nearest neighbor classification</article-title>
          .,
          <source>Journal of machine learning research 10</source>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brockett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sordoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quirk</surname>
          </string-name>
          , M. Mitchell,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dolan</surname>
          </string-name>
          ,
          <article-title>deltableu: A discriminative metric for generation tasks with intrinsically diverse targets</article-title>
          ,
          <source>arXiv preprint arXiv:1506.06863</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>