IRLab@IIT_BHU at MEDIQA-MAGIC 2024: Medical
                         Question Answering using Classification model
                         Arvind Agrawal1,* , Sukomal Pal2


                                      Abstract
                                      This paper presents our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation
                                      (M3G) shared task [1]. The paper presents two types of approaches: 1) Generation-based and 2) Classification-
                                      based. The generation-based model passed the title and content as text embeddings and images as visual
                                      embeddings as a prompt to a pre-trained LLM. The Classification model utilized the medically relevant NER tags
                                      obtained from the queries using pre-trained NER models and converted these tags and images to embeddings
                                      using CLIP text and vision encoders. These embeddings were passed through Bi-LSTM and an MLP to obtain final
                                      representations, which were combined to form query embeddings. The query and label embeddings were used to
                                      train the model using triplet loss. The answer label was predicted as the most similar label embedding to the
                                      query embedding using cosine similarity. The generative approach performs poorly compared to the classification
                                      because less training data is available. Our classification-based approach utilizes manually labeled data (160
                                      labels) to predict the test set answers with a deltaBLEU-score of 4.829 and was ranked 2nd on the leaderboard.

                                      Keywords
                                      Med-VQA, MAGIC, deltaBLEU


                         1. Introduction
                         Telemedicine consultation for dermatology became very common during the pandemic to lessen the
                         risk of human-human contact [2],[3]. Patients used to consult doctors by phone, which proved a viable
                         solution. People have started to believe in telemedicine consultation. People have been integrating AI
                         to assist doctors and telemedicine consultations through Medical VQA systems, which can assist both
                         doctors and patients. Many medical-VQA systems have been developed utilizing both classification
                         and generation-based approaches. However, these models have been developed with the medical
                         VQA datasets available, which specifically cater to radiology[4], pathology[5], orthopedics, and the
                         gastrointestinal datasets[6]. However, in the case of dermatology, not much exploration has been done
                         due to the datasets available being very low-resource, and the traditional systems developed will not be
                         able to perform well on the low-resource dataset provided by the organizers. The consumer answering
                         systems developed for dermatology[7],[8] focused only on text (questions, textual context) and did not
                         explore the vision modality (Images, Videos). This limits the model from considering the vision features
                         that can provide fine visual details that are captured through images and are often difficult for the user
                         to explain through text.
                            This paper focuses on developing a model capable of generating free-form text in response to a
                         given question asked by the user, specifically focusing on clinical dermatology. The proposed model
                         will be able to consider the vision modality but will not necessarily require it. The work described
                         in this paper is presented as a participation of the ImageCLEF-2024-MEDIQA-MAGIC[1] shared task.
                         The task focuses on the problem of Multimodal And Generative TelemedICine (MAGIC) in the area of
                         dermatology. The challenge tackled the generation of an appropriate textual response to the query[9]
                         asked by the user, along with the clinical context provided in the form of both text and images.
                            In this paper, we propose both classification and generation-based approaches. The generation-based
                         approach is based on the work by Bazi et al. [10]. The classification-based approach was tried because
                         the generation-based approach could not produce good results because of the low resource data provided

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ arvind.agrawal.cse19@itbhu.ac.in (A. Agrawal); spal.cse@itbhu.ac.in (S. Pal)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
by the organizers. The classification-based approach was used to predict answer classes corresponding
to the answers that were manually labeled into 160 classes. The predicted class was later converted
into long-form text based on a manually prepared label → Long-text answers mapping. Our approach
performed 2nd best during the competition with a deltaBleu score of 4.829.
   The paper is organized as follows; we present a literature review in Section 2. In Section 3, we provide
details of the dataset provided. In Section 4 we explain how we pre-processed the data to fit our needs.
In Section5, we provide the details of our participation in the ImageCLEF-2024 MAGIC shared task. In
Section 6, we present the results and our corresponding analysis. Following this, the thesis is concluded
with the future works in Section 7.


2. Related works
Telemedicine consultation became a go-to option for many during the pandemic. There were many
studies describing the experiences of patients who had a consultation without human-human contact,
and most found it satisfying. This not only decreased the importance of in-person visits but also opened
many doors. Many people have started integrating AI with consumer answering systems in the medical
domain. These systems can be divided into two categories: Classification-based and Generation-based
approaches.
   Classification-based Medical-VQA categorizes questions and answers for efficient responses, utilizing
techniques like CNN, RNN, Bi-LSTM, and transformers to predict classes. Key contributions include
using CNNs for visual feature extraction from medical images alongside RNNs for question processing
[11]. Hierarchical deep multimodal networks enhance efficacy in classification and response generation
through hierarchical attention mechanisms [12]. MMBert uses a transformer-style architecture for
richer image and text representations [13]. Multimodal-Multihead Self Attention combines text and
image embeddings for classification [14]. Caption Aware Medical-VQA integrates image-captioning
models with BAN for superior performance [15]. A new dataset focusing on chest radiography images
introduces relation graphs for improved reasoning [16].
   Generation-based Medical Visual Question Answering (MedVQA) approaches focus on generating
precise, contextually appropriate free-form text responses using advanced deep-learning techniques. The
Q2A transformer, though claimed generative, faces computational challenges as classes increase, utilizing
learnable answer class embeddings and a SWIFT encoder for fine-grained features [17]. CGMVQA
switches between classification and generation models based on the question [18]. Bazi and Yakoub’s
method employs an encoder-decoder transformer architecture, integrating image and text features for
autoregressive answer generation [10]. MedfuseNet combines CNN and BERT embeddings with an
MFB algorithm for feature fusion [19]. Zhou, Yuan, and Mei use a joint encoder for image and text
embeddings without fusion, fine-tuning on the VQA dataset [20]. Van, Tom, and Derakhshani employ a
pretrained language encoder and CLIP visual tokens for efficient training [21].
   However, all these are limited to radiology, pathology, and orthopedics datasets but not dermatology.
In the case of dermatology, we have simple classification tasks which do not concern with answer
generation as a free-form text. We also found one task in which the authors[8] explored the performance
of GPT-4v in differentiating between benign lesions and melanoma. The dataset provided by the
organizers, however, includes a much larger domain problem set along with the difficulty of generating
free-form text. Thus, this is a first-of-its-kind task.


3. Dataset Description
The MEDIQA-M3G task organizers provided their own dataset[9] to the participants, which we had to
fetch by using the Reddit developer’s API, as the dataset was part of a subreddit involving dermatologists
answering the questions asked on it. The exact count of the dataset was not fixed and varied depending
upon the time of fetching the dataset. The dataset was available in the English language. Each query in
the dataset contained four main things:
   1. Encounter_id: Query_id is used to score the results.
   2. Image_ids: Containing a list of image IDs uploaded by the user asking the question.
   3. query_title_en: As the name suggests, it was the query title in the English language
   4. query_content_en: The query has some content, which can provide some extra context
   5. responses: Three medical professionals/annotators answered the queries. Their responses, along
      with the annotator ID, are provided here.


Table 1
MEDIQA-M3G dataset details
                     Dataset     Expected      Fetched       Examples with images
                                 Examples     Examples      and non-deleted queries
                     Training       435          347                 285
                    Validation      50            50                  44
                       Test         100           93                  78

   The dataset was very noisy as the queries asked were on a subreddit; there was no particular format
to ask questions. Some contained emojis; some had non-relevant information like "I don’t know how to
upload more than one image in Reddit". Relevant statistics of the dataset received are shown in Table
1. Even after fetching the data, some queries were deleted, and some image URLs didn’t exist, so the
effective count was reduced to Table 1. The count mentioned is only for the English samples.


4. Data-Preprocessing
The dataset[9], as depicted in Table 1, was very small, and even if we trained the model by combining
the training and validation data, the dataset was not sufficient. The dataset [9] was also quite noisy as
it included emojis in some query titles, unwanted text statements like "I don’t know how to upload two
images in Reddit" which is not medically relevant and thus needed data preprocessing to extract medically
relevant information before passing it to the model. This section includes the data-preprocessing steps
we used to process the data before passing it to the main model.

4.1. Data Augmentation
We needed to augment the data by generating different titles and content as part of query content for
the same image. We tried a few methods, such as word synonyms and translation methods, wherein we
translated to another language and back-translated to English, but none were satisfactory. Because of
this, we used the Textgenie repository to augment the data. Textgenie is a github repository for text data
augmentations that facilitates the augmentation of text datasets and the creation of comparable samples.
Moreover, it manages labeled datasets by retaining their labels in memory while generating analogous
samples. The utilization of diverse Natural Language Processing techniques, including paraphrase
generation, BERT mask filling, and converting passive voice constructions to active voice, is integral to
its functionality. Presently, it is available only in the English language. Using this, we obtained at least
three augmented titles and contents if they exist as per the condition stated in 4.2 for each query. For
validation, we augmented the training queries only, but for the final test submission, we augmented the
training and validation data and combined both of them.

4.2. Question pre-processing
The objective of this step is to extract medically relevant information from the title and content so that
it can be passed to the main model for proper learning. Each query contained a query title (𝑄𝑡 ) and
a query content (𝑄𝑐 ). We first concatenated 𝑄𝑡 and 𝑄𝑐 to form q, the only condition being that the
query title was not deleted (displayed as [deleted by user]), and the query content was neither empty
Figure 1: Model Architecture for Generation based Approach


nor deleted (displayed as an empty string or [deleted] or [removed]). Then q is passed through an emoji
remover function to remove any emojis from it as it will not be of any medical relevance to give us Q.


5. Methodology
This section explains the 2 approaches and their model architectures we tried:- Generative and Clas-
sification based approach. The classification based approach is the top run submitted in the task
[1].

5.1. Generation-based Model
We also tried a Generative-based Model and obtained results for the same. We fine-tune GPT2-xl[22] to
accept questions and image information as a prompt. This model was an implementation of [21]. The
text is encoded using GPT2-xl’s [22] encoder, and the image is encoded by using the CLIP’s [23] vision
encoder. The model architecture is described in 1.

5.2. Data-Preprocessing for Classification Model
Due to the size and extreme-noisy nature of the dataset[9] we moved to the Classification approach but
it required manually labeling the training and validation dataset and making a label-answer mapping.

5.2.1. Manual Labelling
For the classification approach we need to have classification labels so the model could be trained. For
this, we did two things:-
   1. Classify responses to Labels:- Manually classify the responses of training and validation to answer
      labels. Each query response can have multiple labels. Collectively, we form a set of 160 labels.
   2. Make a 𝑙𝑎𝑏𝑒𝑙 → 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑒𝐴𝑛𝑠𝑤𝑒𝑟𝑚𝑎𝑝𝑝𝑖𝑛𝑔:- For each of the 160 unique labels, we form a
      descriptive answer with the help of the responses of the train and valid queries and chatGPT[24].
This manual labeling helps us reduce the task complexity by making it a classification task. However, it
will fails to answer some labels when the test set expects a response the model has never seen. The
additional details of the labels is given in Appendix A.
Figure 2: Model Architecture for Classification-based Approach


5.2.2. Preparing Answer Labels for Triplet Loss
The queries provided to us sometimes had more than one positive label from the set of 160 labels. We
prepare the data such that each query has only one positive label and one negative label randomly
selected from the list of remaining labels. Because each original query can have at least three augmented
queries, we assign a single positive label to each one so that they get evenly distributed one at a time.
For example, Suppose we have four augmented queries of the original query and three answer labels
𝐿1 , 𝐿2 , 𝐿3 corresponding to it. In that case, we will assign 𝐿1 to 2 queries, 𝐿2 to another two queries,
and 𝐿3 to the remaining query, along with a randomly selected negative label for each. This is how we
assign labels to each query and their augmentations.

5.3. Classification-based Model
The model converts the questions obtained in 4.2 to medically relevant NER tags. These tags are passed
along with the image as a sentence through the CLIP encoders to obtain embeddings in the same latent
space. The embeddings obtained are passed through separate Multi-Layer Perceptron (MLP) to obtain
final embeddings. Separate MLPs are considered a way to make an ensemble model that benefits from
both image and question as well as counter some training or validation examples that do not contain
an image. The final embeddings are compared with the embeddings of the actual label (obtained by
passing them through the CLIP[23] text-encoder model). The model (as shown in Fig 2) can be divided
into five parts, each explained separately.

5.3.1. Question tokenization
When we obtained Q in Sec 4.2, it contains the actual text data entered by the user in the query, but
it still contains some medically irrelevant information as specified in Sec 4. We thought of removing
them manually, but that would not be justifiable for the task as it would create bias if I were a medical
professional. So we extracted medical NER tags {𝑡1 , 𝑡2 , ...𝑡𝑙𝑡 } from these queries with the help of
pre-trained Medical-NER models [25] and Clinical-AI-Apollo/Medical-NER model on hugging face. For
the first NER tokenizer, we picked the tokens that belonged to [‘Disease_disorder’, ‘Sign_symptom’, ‘Bi-
ological_structure’, ‘Coreference’, ‘Detailed_description’, ‘Color’, ‘Medication’, ‘Therapeutic_procedure’,
‘Shape’] token categories. For the second NER tokenizer, we picked the tokens belonging to [‘DIS-
EASE_DISORDER’, ‘BIOLOGICAL_STRUCTURE’, ‘SIGN_SYMPTOM’, ‘DETAILED_DESCRIPTION’,
‘MEDICATION’] token categories. After obtaining the NER tags from both the tokenizers, we removed
the duplicate tags and any stop-words if the tokenizers had picked them up. This list of NER tags forms
the actual question tags, and this is done separately for each augmented data example obtained in 4.1.
The tags provide essential medical information such as symptoms, description, color, shape, etc. Some
training examples did not have any medical NER tags belonging to any token categories, so we took all
the words of the query as the tokens for that example.

5.3.2. CLIP Encoders
The question tags obtained in 5.3.1 are combined to form a sentence with space as a delimiter; this
sentence and the image corresponding to the query are passed separately from the pre-trained CLIP[23]
model with ViT backbone to obtain embeddings for both of them separately. CLIP model with ViT
backbone was chosen because it is a multimodal encoder model that can encode both text and images in
the same latent space. As the embeddings belong to the same latent space, we need not pass it through
an MLP to specifically convert them to another latent space. The answer labels are also passed through
the CLIP encoder to obtain label embeddings, which are later used to calculate triplet loss.

5.3.3. Bi-LSTM and MLP layers
The text embedding obtained is passed through a Bi-LSTM layer, and the image embedding obtained is
passed through an MLP layer. The text embeddings are obtained from a sentence with no semantic
meaning; instead, it is the collection of medically relevant words. It is passed through a Bi-LSTM to
make the embedding more comparable to the label embeddings when using triplet loss. We pass the
image through an MLP to become comparable to the text embeddings obtained after passing through
the Bi-LSTM.

5.3.4. Triplet Loss
We train the model by using Triplet loss[26] through cosine similarity. We obtain the final query
embedding by taking the average of the text and the image embeddings obtained from Bi-LSTM and
MLP, respectively. If the example does not have an image associated with it, we take the text embedding
as the final query embedding. This query embedding will work as an Anchor. To obtain positive
and negative embedding, we convert the positive label and negative label, as mentioned in 5.2.2, to
label embeddings by passing them through a CLIP [23] encoder. The anchor, the positive, and the
negative embeddings are then passed through the triplet loss function that calculates the triplet loss by
taking into account the cosine similarity between the pair of embeddings. We also multiply the cosine
similarity obtained by the class weight. The class weight for class 𝑖 is calculated as:
                                                       𝑛
                                              𝑤𝑖 =                                                    (1)
                                                     𝑘 · 𝑛𝑖
  Where:

    • 𝑛 is the total number of samples after data augmentation,
    • 𝑘 is the total number of classes,
    • 𝑛𝑖 is the number of samples in class 𝑖.

5.3.5. Answer Generation
To generate the answer in the validation and test phase, we first obtain the query embeddings as
explained in 5.3.4. The query embedding is used to calculate cosine similarity with each of the 160
label embeddings. The label with the highest cosine similarity is chosen as the answer label. The final
descriptive answer is obtained through the label → Descriptive Answer mapping as explained in
5.2.1.
  This is the model design for our top-performing run in the shared task. The second-best-performing
model did not have any Bi-LSTM or MLP layer as mentioned in 5.3.3 and directly calculated the answer
through 5.3.5.


6. Experiments and Results
This section contains information about the training setup, the experiments run, and the results obtained.

6.1. Training setup
We used the PyTorch framework and a pre-trained CLIP model [23] with ViT backbone as our text and
image encoder. It gives us embedding of the size 512. Because we obtain the label encodings through
CLIP model [23] with ViT backbone, they are also of size 512. The selection of hyperparameters was
based initially on dataset analyses and later adjusted according to empirical observations. We opted for
the Adams optimization algorithm for the training phase. The training began with a linear warm-up of
over 500 steps, followed by a learning rate of 1e-4. All other Adam-optimizer settings were kept to a
default. We have set the maximum query tag limit to 20 for the training, validation, and test phases.
The model was trained using a batch size of 1, with gradient accumulation across up to 5 iterations
because, after that, the model was overfitting. The following section details the test results as provided
by the task organizers, providing insights into their effectiveness and applicability.

6.2. Results and Analysis
The results, as provided by the task organizers, are given in Table 2,3. For our top run, we were ranked
second according to the Delta-Bleu score [27]. The 3rd ranked run was also ours, which was a simple
CLIP encoder that gave embeddings to be compared with the label embeddings. The top-ranked run had
an 8.6293 delta-bleu score. It fine-tuned a 1.86 B parameter Vision-Language model MoonDream2. Due
to resource limitations, we could not fine-tune any large LLM. We tried the generative model approach
as mentioned in 5, and its results are mentioned in Table 3.

Table 2
Test Run Scores for Classification Based Approaches
 Type of Model          Features Used                    Delta-BLEU Score     BERT Score      Test-Run Rank
                        Bi-LSTM, MLP+Triplet-Loss +
                                                                4.829             0.839              2
 Classification Based   Data Augmentation
 Model                  Only CLIP Model                         4.819             0.838              3
                        Bi-LSTM, MLP + Triplet-Loss             4.231             0.838              8


Table 3
Test Run Scores for Generation Based Approaches
 Type of Model          Features Used                    Delta-BLEU Score     BERT Score      Test-Run Rank
 Generation Based       With Data Augmentation                 2.525             0.829              12
 Model                  Without Data Augmentation              1.683             0.840              16

  Analyzing results from Table 2,3 provides valuable scientific observations about the performance of
our model on the dataset provided. The classification-based models performed better than the generative
ones. In the classification-based approach, the model with Bi-LSTM layer and MLP performed slightly
better than the pre-trained CLIP model with ViT backbone; this may suggest that the text embeddings
obtained as a sentence embedding of the tokenized words may not correctly represent all the tokens
due to the sentence not forming any semantic meaning. It may also suggest that the Bi-LSTM and MLP
help the model learn text and image embeddings better to compare the labels.
Figure 3: Effect of Data Augmentation


                  (a) BERT Score                                        (b) BLEU Score

Figure 4: Comparison of Scores


   The results in Table 2, 3 also suggest that data augmentation is helping as the run without data
augmentation in the classification-based model achieved a delta-bleu score of 4.231. In contrast, with
data augmentation, it achieved a score of 4.829. As we can see in Fig 3, data augmentation also helped
in the case of the generative approach as the delta-bleu score increased from 1.684 to 2.525.


7. Conclusion and Future works
The paper described our participation in ImageCLEF-2024-MEDIQA-MAGIC, which resorted to
classification-based Med-VQA. We began with the generation-based model, but initial results were not
promising due to the extremely noisy nature of the training data, and because of this, we quickly shifted
to a classification-based approach. The classification-based approach worked in this case but reached its
limit as we tried several models, and the score did not increase further. This was mainly due to a high
number of answer classes (160), which will only increase further as we expand our answer domain and
cause computational overhead. Thus, generative Med-VQA is the way to go, as the domain of answers
does not limit it. In future works, we can fine-tune pre-trained multimodal generative models with the
help of compute-efficient techniques and explore the feasibility of pre-training the model with other
vibrant dermatology datasets through mask training.
References
 [1] W. Yim, A. Ben Abacha, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, Overview of the mediqa-magic task at
     imageclef 2024: Multimodal and generative telemedicine in dermatology, in: CLEF 2024 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
 [2] K. Pogorzelska, L. Marcinowicz, S. Chlabicz, Understanding satisfaction and dissatisfaction of
     patients with telemedicine during the covid-19 pandemic: An exploratory qualitative study in
     primary care, Plos one 18 (2023) e0293089.
 [3] D. Giansanti, Advancing dermatological care: A comprehensive narrative review of tele-
     dermatology and mhealth for bridging gaps and expanding opportunities beyond the covid-19
     pandemic, in: Healthcare, volume 11, MDPI, 2023, p. 1911.
 [4] B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, X.-M. Wu, Slake: A semantically-labeled knowledge-
     enhanced dataset for medical visual question answering, in: 2021 IEEE 18th International Sympo-
     sium on Biomedical Imaging (ISBI), IEEE, 2021, pp. 1650–1654.
 [5] X. He, Y. Zhang, L. Mou, E. Xing, P. Xie, Pathvqa: 30000+ questions for medical visual question
     answering, arXiv preprint arXiv:2003.10286 (2020).
 [6] A. Ben Abacha, S. A. Hasan, V. V. Datla, J. Liu, D. Demner-Fushman, H. Müller, Vqa-med: Overview
     of the medical visual question answering task at imageclef 2019, in: Working Notes of CLEF
     2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, Lugano, Switzerland, 2019. URL:
     https://ceur-ws.org/Vol-2380/paper_272.pdf.
 [7] Z. Li, K. C. Koban, T. L. Schenck, R. E. Giunta, Q. Li, Y. Sun, Artificial intelligence in dermatology
     image analysis: current developments and future trends, Journal of clinical medicine 11 (2022)
     6826.
 [8] K. Cirone, M. Akrout, L. Abid, A. Oakley, Assessing the utility of multimodal large language
     models (gpt-4 vision and large language and vision assistant) in identifying melanoma across
     different skin tones, JMIR dermatology 7 (2024) e55508.
 [9] W. Yim, Y. Fu, Z. Sun, A. Ben Abacha, M. Yetisgen, F. Xia, Dermavqa: A multilingual visual
     question answering dataset for dermatology, CoRR (2024).
[10] Y. Bazi, M. M. A. Rahhal, L. Bashmal, M. Zuair, Vision–language model for visual question
     answering in medical imagery, Bioengineering 10 (2023) 380.
[11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
     the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[12] D. Gupta, S. Suman, A. Ekbal, Hierarchical deep multi-modal network for medical visual question
     answering, Expert Systems with Applications 164 (2021) 113993.
[13] Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, C. Jawahar, Mmbert: Multimodal bert
     pretraining for improved medical vqa, in: 2021 IEEE 18th International Symposium on Biomedical
     Imaging (ISBI), IEEE, 2021, pp. 1033–1036.
[14] V. Joshi, P. Mitra, S. Bose, Multi-modal multi-head self-attention for medical vqa, Multimedia
     Tools and Applications (2023) 1–24.
[15] F. Cong, S. Xu, L. Guo, Y. Tian, Caption-aware medical vqa via semantic focusing and progressive
     cross-modality comprehension, in: Proceedings of the 30th ACM International Conference on
     Multimedia, 2022, pp. 3569–3577.
[16] X. Hu, L. Gu, K. Kobayashi, Q. An, Q. Chen, Z. Lu, C. Su, T. Harada, Y. Zhu, Interpretable medical
     image visual question answering via multi-modal relationship graph learning, arXiv preprint
     arXiv:2302.09636 (2023).
[17] Y. Liu, Z. Wang, D. Xu, L. Zhou, Q2atransformer: Improving medical vqa via an answer querying
     decoder, in: International Conference on Information Processing in Medical Imaging, Springer,
     2023, pp. 445–456.
[18] F. Ren, Y. Zhou, Cgmvqa: A new classification and generative model for medical visual question
     answering, IEEE Access 8 (2020) 50626–50636.
[19] D. Sharma, S. Purushotham, C. K. Reddy, Medfusenet: An attention-based multimodal deep
     learning model for visual question answering in the medical domain, Scientific Reports 11 (2021)
     19826.
[20] Y. Zhou, J. Mei, Y. Yu, T. Syeda-Mahmood, Medical visual question answering using joint self-
     supervised learning, arXiv preprint arXiv:2302.13069 (2023).
[21] T. Van Sonsbeek, M. M. Derakhshani, I. Najdenkoska, C. G. Snoek, M. Worring, Open-ended
     medical visual question answering through prefix tuning of language models, in: International
     Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2023, pp.
     726–736.
[22] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
     unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[23] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, et al., Learning transferable visual models from natural language supervision, in:
     International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[24] OpenAI, Chatgpt: Conversational agent, https://www.openai.com/chatgpt, 2024. Accessed: 2024-
     05-17.
[25] S. Raza, D. J. Reji, F. Shajan, S. R. Bashir, Large-scale application of named entity recognition to
     biomedicine and epidemiology, PLOS Digital Health 1 (2022) e0000152.
[26] K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin nearest neighbor classifica-
     tion., Journal of machine learning research 10 (2009).
[27] M. Galley, C. Brockett, A. Sordoni, Y. Ji, M. Auli, C. Quirk, M. Mitchell, J. Gao, B. Dolan, deltableu:
     A discriminative metric for generation tasks with intrinsically diverse targets, arXiv preprint
     arXiv:1506.06863 (2015).


A. Manual Labelling
This section contains the list of all the labels we decided manually based on the training and validation
answers. There are a total of 160 labels that we found manually by ourselves without any professional
help or prior experience. Some labels were easy to identify as they were straight forward mentioned in
the answer but some were difficult because the answer text contained multiple possibilities due to lack
of information due to single image and less query content. In the case where there was not a single
label and multiple possibilities we either had multiple labels or just a single label asking them to refer
to a dermatologist as suggested by the annotator in the answer texts provided. The method fails when
test set contains a label which was not a part of the manually picked labels.
   The complete list of labels is as mentioned below:- cyst, blind pimple, pimple, folliculitis, Solar
lentigo, contact eczema, eczema, common wart, sun exposure, coarse wrinkle, eczema due to dry
skin, lip dryness, itchy scalp, keratoacanthoma, Pityriasis versicolor, rosacea, tan, callus around heel,
leukoderma, hormonal acne, acne, fungal infection, ringworm, pityriasis rosea, ingrown hair, mole, skin
cancer, cherry angioma, fungal infection due to nail thickening, pupuric spot, solar keratosis, keratosis,
seborrheic keratosis, scratching, itching, birthmark, nevus, nodular melanoma, melanoma, urticaria,
bug bite, insect bite, dyshidrotic eczema, contact dermatitis, heat rash, lipoma, cat and dog fleas, angiofi-
broma, ecchymosis, HTD (Habit-tic deformity), dermatologist for lesion examination, dermatologist
consultation, alopecia areata, nevus sebaceous, spider veins, photodermatoses, lymphatic malformation,
comedones, healing, milia, xanthelasma, chalazia, hyperpigmentation, longitudinal melanonychia,
Pyogenic granuloma, post inflammatory hyperpigmentation, dermatitis artifacts, dermatitis, atropho-
derma, athlete’s foot, pseudofolliculitis, subungual hematoma, Neutrophilic dermatoses, Discoid eczema,
atopic dermatitis, acneiform eruptions, keratosis pilaris, tinea versicolor, dermatofibroma, viral rash,
angular cheilitis, flushing skin because of alcohol, compund nevus, rash, morphea, inflammatory rash,
shingles, dandruff, psoriasis, trauma, lip licker’s dermatitis, aquagenic wrinkle, cystic fibrosis, eclipse
nevi, nummular eczema, eczema due to working in water, hairy tongue, syringomas, lip biting, irritated
hair follicle, cyst under skin, spider angioma, inflammatory acne, Schamberg’s purpuric dermatosis,
vasculitis, sebaceous hyperplasia, tinea corporis, granuloma annular, viral infection, hive, mast cells in
                                            Label                  Frequency
                            eczema                                     50
                            mole                                       18
                            fungal infection                           17
                            acne                                       17
                            dryness                                    17
                            cyst                                       14
                            dermatologist for lesion examination       12
                            skin cancer                                12
                            dermatitis                                 12
                            atopic dermatitis                          11
Table 4
Training labels details

                                             Label                  Frequency
                            eczema                                      8
                            insect bite                                 4
                            fungal infection                            4
                            mole                                        3
                            dryness                                     3
                            post inflammatory hyperpigmentation         2
                            itching                                     2
                            psoriasis                                   2
                            bruise                                      2
                            bug bite                                    2
Table 5
Validation labels details


body, herpes, herpetic whitlow, comedonal acne, angioma, drug reaction, syphilis, infection on skin, tri-
chostasis spinulosa, erosive pustular dermatosis, retention hyperkeratosis, inflammation of the nail fold,
dermatitis herpetiformis, sebaceous cyst, observe, skin tag, nail trauma, dryness, molluscum, friction,
blood collection, tick bites, irritated mole, Opthalmologist consultation, hand sweating, hidradenitis
suppurativa, corn, cyst due to mucus, herpes simplex, seborrheic dermatitis, abscess, allergy due to
sun, bruise, Keratolysis exfoliativa, idiopathic guttate hypomelanosis, infection at hair follicle, eye
dark circles, erythema, diabetes, atrophic scars, periorificial dermatitis, normal, HIV, scar, skin peeling,
telangiectasia, genetic predisposition, irritated skin, chicken pox, furuncle.
   The tables 4 and 5 presents the most frequent labels encountered in training and validation data
respectively.