=Paper=
{{Paper
|id=Vol-3180/paper-106
|storemode=property
|title=IUST_NLPLAB at ImageCLEFmedical Caption Tasks 2022
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-106.pdf
|volume=Vol-3180
|authors=Malihe Hajihosseini,Yasaman Lotfollahi,Melika Nobakhtian,Mohammad Mahdi Javid,Fateme Omidi,Sauleh Eetemadi
|dblpUrl=https://dblp.org/rec/conf/clef/HajihosseiniLNJ22
}}
==IUST_NLPLAB at ImageCLEFmedical Caption Tasks 2022==
IUST_NLPLAB at ImageCLEFmedical Caption Tasks 2022 Malihe Hajihosseini1 , Yasaman Lotfollahi1 , Melika Nobakhtian1 , Mohammad Mahdi Javid1 , Fateme Omidi1 and Sauleh Eetemadi2 1 Student at School of Computer Engineering, Iran University of Science and Technology, Tehran, Islamic Republic Of Iran. 2 Assistant Professor of Computer Science, School of Computer Engineering, Iran University of Science and Technology, Tehran, Islamic Republic Of Iran. Abstract We present models implemented by the IUST_NLPLAB group for ImageCLEFmedical Caption Task 2022. This task contains two subtasks: Concept Detection and Caption Prediction. Under the first subtask, the model should extract medical concepts contained in radiology images. These concepts can be used for context-based image and information retrieval. Under the second subtask, the model predicts the caption for a medical image. This can be used for improving the diagnosis and treatment of diseases by saving time, money and helping physicians. We used Retrieval Learning, Ensemble Learning, Multi Label Classification and Deep Learning techniques to rank 1st in the caption prediction subtask with 16 BLEU points over the second ranked group. We also ranked 8th in the concept detection task with a 5 percent gap from the top ranked group in F1 score. Keywords Medical Image Captioning, Concept Detection, Caption Prediction, Deep Learning, Multi-label Classifi- cation 1. Introduction ImageCLEF[1] is part of CLEF1 . ImageCLEF was launched in 2003 and added a medical task in 2004. Although it started with four participants, in 2020 was able to attract more than one hundred and ten participants from all around the world to participate in the competition. ImageCLEF includes various sections that retrieve and classify visual information using textual and visual data and their combinations. In recent years, ImageCLEF has used the AIcrowd2 platform to publish datasets and receive submissions. In 2022, one person from each group had to register with AIcrowd, then access CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ m_hajihosseini@comp.iust.ac.ir (M. Hajihosseini); y_lotfollahi@comp.iust.ac.ir (Y. Lotfollahi); m_nobakhtian@comp.iust.ac.ir (M. Nobakhtian); mahdijavid1380@yahoo.com (M. M. Javid); f_omidi97@comp.iust.ac.ir (F. Omidi); sauleh@iust.ac.ir (S. Eetemadi) https://mohammadmahdijavid.ir/ (M. M. Javid); http://sauleh.github.io/ (S. Eetemadi) 0000-0002-0067-1184 (M. Hajihosseini); 0000-0002-4808-5141 (Y. Lotfollahi); 0000-0001-9788-2764 (M. Nobakhtian); 0000-0002-8447-7513 (M. M. Javid); 0000-0002-5886-7315 (F. Omidi); 0000-0003-1376-2023 (S. Eetemadi) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 Conference and Labs of the Evaluation Forum 2 https://www.aicrowd.com/ (last accessed: 2022-05-27) the dataset and submit results on specified dates. Each group could register up to 10 successful submissions for each task. Five unsuccessful submissions for each group in each task were also allowed. In ImageCLEFmedical 2022, two tasks were proposed: Image Captioning and Tuberculosis CT analysis. We selected the Image Captioning task from the ImageCLEFmedical section to participate in the competition. ImageCLEF medical Image Captioning task in 2022 contained two subtasks: Concepts Detection and Caption Prediction. These tasks have many uses, but their most important usage is to help physicians make accurate diagnoses and provide automatic de- scriptive reports of medical images which saves physicians’s time. Each group could participate in one or both subtasks. In this paper, we present the methods our group, IUST_NLPLAB, from the Iran University of Science and Technology3 , School of Computer Engineering 4 , Natural Language Pocessing Laboratory5 used in both subtasks. This is our first time participating in the ImageCLEF competition. We participated in both subtasks and registered ten successful submissions in the concept detection and caption prediction subtasks [2]. We were able to win first place in the caption prediction task with a margin of 16 BLEU points from the second group. Also, in the concept detection task, we were able to win the eighth place in the competition with a gap of about five percent in F1 measure from the first ranked group. In the following sections, we will describe the datasets used, models developed, and the results we achieved in detail. 2. Task description This year the ImageCLEF evaluation campaign hosted the 6th edition of the medical image caption task. Unlike some of the previous editions which only contained the caption prediction task (e.g., 2016 [3]) or only the concept detection task (e.g., 2019 [4]), the 6th edition contained both subtasks as described below. 2.1. Concept Detection The goal for this task is to train a model based on the training data provided for extraction of UMLS6 [5] Concept Unique Identifiers (CUIs) from medical images. This helps to better understand the medical concepts contained in medical images and can be used in other jobs such as caption generation. Table 1 lists the top 15 most frequent concepts in the training data. The 2022 dataset includes 8374 medical concepts, which is a significant increase compared to 2021. 2.2. Caption Prediction The goal of caption prediction is to train a model based on the training data provided to predict a suitable caption for medical images. It is essential for the model to correctly diagnose and 3 http://www.iust.ac.ir/en (last accessed: 2022-05-27) 4 http://ce-inter.iust.ac.ir/ (last accessed: 2022-05-27) 5 https://nlplab.iust.ac.ir (last accessed: 2022-05-27) 6 Unified Medical Language System® Table 1 Most frequent concepts in the training data UMLS CUI UMLS Meaning frequency C0040405 X-Ray Computed Tomography 25989 C1306645 Plain x-ray 24389 C0024485 Magnetic Resonance Imaging 14622 C0041618 Ultrasonography 11147 C0817096 Chest 7720 C0002978 angiogram 6027 C0000726 Abdomen 5772 C0037303 Bone structure of cranium 5144 C0221198 Lesion 3845 C0205131 Axial 3187 C0030797 Pelvis 3176 C0023216 Lower Extremity 2739 C0238767 Bilateral 2722 C0577559 Mass of body structure 2341 C0205129 Sagittal 2012 extract sufficient information from medical images to be able to correctly predict the appropriate caption. This task is inherently more complex since it requires combining image processing and natural language processing techniques to generate captions for medical images. 3. Data The dataset introduced for the ImageCLEFmedical Caption 2022 is a subset of the Radiology Objects in COntext (ROCO)[6] dataset. In this version of the dataset, imaging modality infor- mation is not mentioned. Also, as in previous versions, the dataset originates from biomedical articles of the PMC OpenAccess[7] subset. This dataset is used for both subtasks: Concept Detection and Caption Prediction. The published dataset consists of train, validation, and test images. Also, five Excel files were attached, including the names of concepts, concepts per train image, concepts per valid image, caption per train image, and caption per valid image. This dataset includes 83275 radiology images as training set, 7645 radiology images as validation set, and 7601 radiology images as test set. Figure 1 compares the data size presented in the last four years for the Medical Image Captioning task at the ImageCLEF[8, 9, 10] evaluation campaign. In 2021, the number of data has decreased significantly compared to previous years. This was due to only using radiology images described by medical experts[10]. Table 2 shows some training data examples with their corresponding concepts and captions. 3.1. Image Concepts In this task, for each image, a number of concepts are defined by the Unified Medical Language System® (UMLS)[5] Concept Unique Identifiers (CUIs) are specified. The number of concepts is Figure 1: Comparison of ImageCLEFmedical Caption data in the last four years. As shown in the chart, the number of data in 2022 compared to 2021 has grown significantly. different for each image. In training set, 3718 images have only one concept, while the maximum number of concepts for an image was 50. On average, five concepts are specified for each image. 3.2. Image Captions A caption is provided for each image in the training and validation sets in this task. The organizers mentioned that the captions are pre-processed in the following four steps: • Numbers and words containing numbers have been removed • All punctuation was removed. • Lemmatization was applied using spaCy. • Captions were converted to lower-case. The length of captions in the training set varies. According to surveys, in training set, 194 images have one-word captions, while the maximum caption length is 391. The most common caption length is ten words, of which 3771 images have captions of this length. The average length of a caption is 19 words. Figure 2 shows the most repetitive words in training set captions and their frequency with stop words, and without them. We also calculated the TTR7 for this caption dataset. TTR is obtained by dividing the number of unique words by the size of the text and is a simple measure of lexical diversity[11]. Considering the stop words, the value of TTR in this dataset is 0.022, and without considering the stop words, it is 0.031. 7 Type-token ratio Table 2 Sample images from the training set along with their concepts and captions [12, 13, 14, 15]. Image Concepts Caption • C1306645 (Plain x-ray) • chest radiograph show mul- • C0817096 (Chest) tiple tiny nodule white arrow • C0225759 (Lung field) in both lung field. • C0024485 (Magnetic Reso- • mri brain show cerebellar at- nance Imaging) rophy. • C0006104 (Brain) • C0740279 (Atrophy of cere- bellum) • C1306645 (Plain x-ray) • hand of a patient with • C1140618 (Upper Extremity) acrodysostosbe and mul- • C0018563 (Hand) tihormonal resbetance • C0205082 (Severe (severity severe and generalized modifier)) brachydactyly through very short and broad tubular • C5194734 (Tubular bones) bone include ulna can • C0041600 (Bone structure of be observe metacarpals ulna) iiv be proximally pointed • C1441672 (Observed) and coneshape proximal • C0025526 (Metacarpal bone) phalangeal epiphysbe be • C0699952 (Fused) prematurely fuse the general appearance of the hand be bulky and stocky courtesy of prof dr jess argente. • C0024485 (Magnetic Reso- • mri spine show hyperinten- nance Imaging) sity in the thoracic cord till • C0037949 (Vertebral column) level • C0522510 (With intensity) • C3853028 (Thoracic Cord ) • C0054967 (CD6 antigen) (a) Ten most frequent words in the training (b) Ten most frequent words in the training set set with considered stop words. Given the without considered stop words. By looking widespread use of stop words in texts, it at words, it is clear that most of the words is natural for stop words to have a main are widely used in describing images or place in the chart. However, a few non-stop corrections in medical. words like "show" have a significant number, which seems natural considering the use of this word to describe images. Figure 2: Ten most frequent words in the training set 4. Methods We first present image preprocessing techniques used for both subtasks, Next, we introduce models developed for the concept detection followed by caption prediction models. 4.1. Image Pre-processing We used various techniques to improve the quality of medical images. Two of the most important are as follows. • Histogram Equalization: Histogram Equalization is an image processing method that uses a contrast enhancement technique[16]. In this method, the image histogram is flattened as much as possible and the probability distribution is mapped to a uniform probability distribution. However, this is not the best way to improve image quality, and in some cases may not have a good output because the average brightness of the output image is significantly different from the input image. • Contrast Limited Adaptive Histogram Equalization (CLAHE): CLAHE[17] is also a type of Histogram Equalization, in which contrast amplification is limited. In a typical Histogram Equalization, we see an increase in noise in near-constant regions. To 𝐶𝐶𝐵𝑌 [𝑋𝑖𝑎𝑛𝑔𝑒𝑡𝑎𝑙.(2020)] (a) Normal (b) Equalization histogram (c) CLAHE Figure 3: The first column shows the normal image of the dataset. The second and third columns show the images after the Histogram Equalization and CLAHE technique applied on the normal image. As it is clear, the image quality has improved in some areas [18]. solve this problem and improve feature extraction, we use CLAHE. CLAHE equalizes the brightness and contrast of the images. This technique divides the image into sections and applies histogram equalization to each section. Then the contrast amplification limit, also known as clip limit, is applied. We use a clip limit of 2 in our models. Cropping and flipping were also used for data augmentation. For models that do not use data augmentation techniques, the CLAHE method is used to improve image quality. Figure 3 shows the normal images of the dataset and images of histogram equalization and CLAHE operations on them. 4.2. Concept Detection Concept detection is a classification problem. We examine the two main approaches we adopted to solve this problem. 4.2.1. Information Retrieval Approach We studied and implemented the methods presented by Jacutprakart et al. [19]. Jacutprakart et al. received the second-best F1 score in the concept detection challenge in 2021. Their best approach was to extract image features from a CNN8 and use 𝑘-NN9 to extract the concepts. That is, for a test image, the closest 𝑘 training images are found. The concepts of the closest images are then used to assign concepts to the test image. They got the best results by using cosine similarity to calculate the distance between two images, DenseNet121[20] to extract features, and setting 𝑘 to 1. We attempted to implement this approach. However, because this year’s dataset size is much larger than the previous year, we had trouble getting results from 𝑘-NN. We tried to train a similar model on a much smaller subset of the dataset, but the results were underwhelming. Using the output to extract labels, similar to a multilabel classification 8 Convolutional Neural Network 9 𝑘-Nearest-Neighbor task, produced significantly better results. Thus we stopped following this approach and moved on to trying a 1-NN ensemble. In the 1-NN ensemble, we used a retrieval approach based on the AUEB NLP group model in 2021[21] that achieved first place in the concept detection subtask. At first, we employed three kinds of different CNN encoders including a ResNet-50[22], a DenseNet-201[20], and an EfficientNet-B0[23] that made up our primary model. All encoders were pre-trained on ImageNet[24]. These encoders were fine-tuned on the training set for five epochs and the best weights were saved according to loss values on the training set. In the next step, we trained five models for each encoder as part of the validation set and we got 15 encoders. Each of these encoders were trained for three epochs. After working on encoders, we used trained encoders to get image embedding of training examples. Image embeddings were retrieved from the last average pooling layer of each encoder. To detect concepts in test images, we also find image embeddings of test images. After computing similarity between train embeddings and test embeddings, we find the most similar training image to the test image. In the end, 15 training images will be chosen and each image corresponds to an encoder. To assign concepts to test images, we used a majority-voting mechanism. In this mechanism, among the 15 chosen images, concepts that appear more than 𝑁 times will be assigned to the test image. After trying different values for 𝑁 and evaluating results with accuracy and F1 score, results showed the best value for 𝑁 to be 8. Adding data augmentation to this model also improved its performance. Finally, after computing similarity between image embeddings with different methods, cosine similarity showed better results than other methods. In Table 3 we show a number of validation set images with the ground truth and predicted concepts along with their F1 score calculated by this approach. 4.2.2. Multi-Label Classification Method In this method, we used image concepts as labels and built a multi-label classification model. We used CNNs with ImageNet[24] pre-trained weights, removed their last layer, and added a classification layer. The output layer has 8374 units and uses sigmoid as the activation function. The final model was then fine-tuned on the target dataset. We experimented with different pre-trained models and different configurations. Models were compiled with Adam[25] as the optimizer. Results on validation and test set were generated after every five epochs. We tried different thresholds on the output layer’s activation function to classify the image concepts and used their F1 score to evaluate and find the best one. Table 4 shows the details of the implemented MLC10 methods. Figure 4 also shows the architecture of the MLC-based concept detection model. 10 Multi-Label Classification Table 3 Example of concept detection with ground truth and predicted concepts with F1 scores of them [26, 27, 28, 29]. Image Ground Truth Prediction F1 score • C0024485 (Magnetic Res- • C0024485 (Magnetic Res- • 1.0 onance Imaging) onance Imaging) • C0346308 (Pituitary • C0346308 (Pituitary macroadenoma) macroadenoma) • C0205129 (Sagittal) • C0205129 (Sagittal) • C0041618 (Ultrasonogra- • C0041618 (Ultrasonogra- • phy) phy) 0.571 • C0016823 (Structure of • C0016823 (Structure of fundus of eye) fundus of eye) • C0439828 (Variable (uni- formity)) • C0205396 (Identified) • C0013938 (Embryo trans- fer (procedure)) • C1306645 (Plain x-ray) • C1306645 (Plain x-ray) • • C0442808 (Increasing) • C0817096 (Chest) 0.285 • C0032227 (Pleural effu- • C0032326 (Pneumotho- sion disorder) rax) • C0008034 (Chest Tubes) • C1707489 (Connectivity) • C0221198 (Lesion) • 0.0 • C0205202 (Corrected) • C0024485 (Magnetic Res- onance Imaging) Figure 4: Architecture of the MLC-based concept detection model Table 4 Description of our concept detection models Name Base model Regularization Learning rate Data augmentation v2.1 Resnet50 None 0.001 None v2.2 Resnet50 None 0.001 CLAHE, equalizeHist, hflip, original v2.3 Resnet101 None 0.001 None v2.4 Resnet50 Dropout(0.5) 0.001 None v2.5 DenseNet121 None 0.001 None v3.1 InceptionV3 None 0.009 None v3.2 InceptionV3 None 0.009 None v3.3 InceptionV3 None 0.009 CLAHE, equalizeHist, random crop v3.4 InceptionV3 L2 0.009 equalizeHist, random crop v3.5 InceptionV3 Dropout (0.5) 0.009 CLAHE, random crop 4.3. Caption Prediction For the caption prediction subtask, we studied the method implemented in [30], which achieved first place in last year’s caption prediction challenge. This team’s best approach was to use a multi-label classification model. In this approach, each word is considered to be a label. A classification model is trained to predict the words that will later create a caption for the given image. We used a CNN pre-trained on ImageNet[24] and fine-tuned it on the subtask training set to extract image features, similar to the multi-label classification method used in the concept detection subtask. For fine-tuning, the last layer of the CNN was removed. A dropout layer, an activation layer, and a dense layer were added. We tried different CNN models and different configurations. To generate a caption for an image, the model will predict its corresponding words. Probability of each word is calculated in the output layer using sigmoid activation function. Then, the top 𝑁 words with the highest probability are chosen. 𝑁 is a hyper-parameter that will define the length of captions. Different values of 𝑁 in the range of 15 to 27 were tested on the validation set, and the best 𝑁 for each model was chosen using the BLEU score [31]. Two methods were used to turn the generated words into full captions: 1. Words are ordered from highest to lowest probability. 2. Words are ordered based on their statistics in the training set. Each word is assigned to its most common position in the caption. Overall, we focused on predicting the correct words rather than finding their correct order. These two methods were not able to find the right order of words. That is why the final prediction may not be grammatically correct. But this approach was able to predict words well, which led to a high BLEU score. Table 5 shows the details of the implemented classification models for the caption generation subtask. Figure 5 also shows the architecture of the MLC-based caption prediction models. Table 6 shows some examples of validation images with ground truth caption and predicted caption by our best submission. Their BLEU and ROUGE scores are also mentioned in the table. Because we used stemming while creating our vocabulary, some words, like "image", are stemmed in the final caption. Figure 5: Architecture of the MLC-based caption prediction model Table 5 Description of our caption prediction models Name CNN Data Augmentation Activation Freeze CNN Learning rate v1.1 ResNet50 None PReLU No 5e-4 v1.2 ResNet50 CLAHE PReLU No 5e-4 v1.3 ResNet50 None PReLU Yes 5e-4 v1.4 ResNet50 None ReLU No 5e-4 v1.5 ResNet50 None PReLU Yes 5e-4, decay every 2500 steps Table 6 Example of caption prediction with different scores. In first row model had a good score but in second row it could not predict well [32, 33, 34, 35]. Image Ground Truth Prediction BLEU ROUGE axial ct image of the arrow show ct axial 0.806 0.468 neck with intravenous imag tomographi en- contrast at the level of hanc scan comput left the parotid gland show right mass muscl lesion asymmetric left parotid enlarg gland red neck gland enlargement with view contrast tissu soft replacement by a soft demonstr white nerv tu- tissue mass white arrow mor horizontal section show axial show imag ct com- 0.526 0.121 bony deficit at implant put scan right tomo- site graphi lesion left patient arrow bone view cortic measur treatment frac- tur month later head area cbct plate section margin chest xray show bilat- chest xray show left 0.140 0.193 eral pneumonia patient leav right ar- row lung pleural ef- fus hemithorax admiss radiograph lobe medi- astin mass tip day medi- astinum postop elev up- per enlarg tube imag result of roi extraction arrow show right imag 0.0 0.0 pixel panoram left maxillari radiograph lesion coron bone patient bilater im- pact mandibular molar later view side white cor- tic case area fractur si- nus first 5. Results In this part, we review the results of the models implemented in the two subtasks of concept detection and caption prediction. In the concept detection subtask, F1 score was used to evaluate the models and the ranking was based on this metric. Also reported is the Secondary F1 score, which is calculated using only a subset of manual validated concepts. In the caption prediction subtask, BLEU[31], ROUGE[36], METEOR[37], CIDR[38], SPICE[39] and BERTScore[40] were used to evaluate the models. The ranking was based on BLEU score. ROUGE scores were also reported during the competition. Other metrics were reported after the challenge. Before presenting the results of our implemented models, we review the results obtained in the last six years in this task. Table 7 shows information about the size of the dataset, the number of concepts, and the results of the first three groups in each subtask[2, 10, 9, 8, 41, 42, 43, 44]. Note that the purpose of presenting this table is to express statistical information on this task in the last few years. Due to the differences in the data sets of different years, it is not correct to compare their results with each other. Table 7 This table shows information about the datasets and the results obtained in ImageCLEFmedical Caption in the last six years. In the data set section, the number of training data, validation and test is specified. The number of concepts in the dataset per year is also shown. In the section on concept detection and caption prediction, the results of the top three groups are mentioned. As mentioned in the text, the purpose of presenting this table is to show the statistical information of the imageCLEFmedical caption task in recent years. The datasets of the years denoted by * are different, so comparing the results of the years with each other does not provide accurate information[2, 10, 9, 8, 41, 42, 43, 44]. Year Dataset Concept Detection (F1) Caption Prediction (BLEU) train valid test concepts 1st 2nd 3rd 1st 2nd 3rd 2022* 83275 7645 7601 8374 0.451 0.450 0.447 0.482 0.322 0.311 2021 2756 500 444 1586 0.505 0.468 0.419 0.509 0.461 0.431 2020* 65753 15970 3534 3047 0.394 0.392 0.380 - - - 2019* 56629 14157 10000 5528 0.282 0.265 0.223 - - - 2018 222305 - 10000 111156 0.110 0.009 0.050 0.250 0.179 0.172 2017 164614 10000 10000 20464 0.171 0.164 0.143 0.563 0.321 0.260 5.1. Concept Detection We chose our best models based on F1 score on the validation set and results with the highest score were submitted. One of our submissions used information retrieval method (submission v1) and other nine submissions followed multi-label classification approach. Table 4 has information about different MLC models. Table 8 shows our scores on the test set and the number of epochs and threshold for each model submission. The best result had 0.398 as F1 score. It also achieved 0.673 for secondary F1. Secondary F1 score was calculated using a subset of manually validated concepts (anatomy and image modality) only. This submission ranked 8 among all submitted group results. This system used ResNet50 as its base model with dropout[45] and no data augmentation. Information retrieval model had close scores to best MLC methods but at last Table 8 IUST_NLPLAB concept detection submissions details and test results. ES stands for "Extra submission". These submissions were sent after the competition’s deadline. Run ID Name Epochs Threshold F1 score Secondary score 181667 v1 - - 0.394 0.750 181948 v2.1 16 0.1 0.281 0.355 182279 v2.2 16 0.1 0.252 0.352 182280 v2.3 12 0.1 0.255 0.352 182291 v2.4 48 0.1 0.387 0.611 182292 v2.5 4 0.12 0.242 0.332 182293 v2.5 8 0.1 0.244 0.318 182302 v2.3 48 0.1 0.243 0.305 182304 v2.4 48 0.12 0.394 0.656 182307 v2.4 48 0.13 0.398 0.673 ES1 v2.4 60 0.4 0.348 0.730 ES2 v2.4 96 0.25 0.411 0.785 ES3 v3.1 20 0.3 0.240 0.356 ES4 v3.2 20 0.4 0397 0.668 ES5 v3.3 40 0.4 0.385 0.623 ES6 v3.4 40 0.4 0.302 0.634 ES7 v3.5 20 0.4 0.419 0.721 MLCs ranked higher. It’s interesting that information retrieval earned higher secondary F1 than MLC models. Although our information retrieval model had good results among our different submissions, but we submitted just one model from this type. It happened because the information retrieval model was more complex than our MLC models, so it needed more time and resources to train. For example, MLC models approximately required three hours to train on server 3, but the information retrieval model took seven days to train on server 2. We tried to improve this model with different methods, but we were unable to get the final results due to time and resource constraints. Thus we decided to train simpler models with fewer parameters in both subtasks. This enabled us to have a faster turn-around time and iterate on more model improvement ideas. After trying different models and configurations, we focused on different settings of v2.4 before the deadline, which produced the best result. The last submissions of v2 were from this particular version. The submission limit allows us to submit only 10 runs but we still had some results that were not evaluated. After the submission deadline we asked ImageCLEF organizers to evaluate few more models for us which they generously accepted. These extra models showed improvement in concept detection results both in F1 score and secondary F1. Submission ES7, which used InceptionV3[46] as its base model and dropout for regularization, achieved 0.419 F1 score, which was the best score for us. This system also used data augmentation techniques including CLAHE and random crop. The best secondary F1 belongs to submission ES2. This system is like our best model in this competition but it trained for 96 epochs and set its threshold to 0.25. Table 9 Caption prediction submissions’ details Run ID Name Epochs N Sorting method 181670 v1.1 20 20 1 181951 v1.2 10 27 1 182249 v1.1 10 26 1 182250 v1.3 10 26 1 182275 v1.4 5 26 1 182290 v1.5 10 25 2 182314 v1.5 10 17 1 182315 v1.4 5 26 2 182319 v1.1 15 26 1 182327 v1.5 15 15 1 5.2. Caption Prediction Our MLC approach could predict captions well, as all our ten submissions had BLEU scores higher than 0.430 and ranked from 1 to 10 on the released leaderboard. Using different settings, like using a different activation function, could improve the BLEU score. Overall, choosing the parameter 𝑁 , which determined the length of predicted captions, was a trade-off between the BLEU score and the ROUGE score. A higher 𝑁 resulted in a higher BLEU score and lower ROUGE score, and vice versa. As the primary score in this competition was BLEU, we focused on using a setting that would give us the highest BLEU score. As we mentioned in the Methods sections, we used two different methods to sort the predicted words. In the first method, words were sorted from highest to lowest probability. For the second method, we studied the training set and noted the positions each word appears in. Then, we tried sorting the generated words using this data, but it did not improve the scores. The first method had better results. So, we focused on using this method in most submissions. The best result was achieved by setting 𝑁 to 26 and using ReLU as the activation function. In this submission, words were ordered from highest to lowest probability. Table 5 shows the details of the submitted runs and Table 10 shows our submission scores on the test set. 6. AIOps Running a large number of experiments in a short amount of time with limited resources requires meticulous planning and operations. Given our limitations in time and resources, we believe our operations’ strategy played a significant role in our success. This section elaborates on what worked and did not work for training models faster with fewer resources, consisting of (GPU, CPU, RAM, and Disk Space). 6.1. Memory Optimization The first challenge our team faced was running out of memory. One of the bottlenecks in Artificial Intelligence projects is large datasets, which do not fit into memory. Data Generators Table 10 Caption prediction submissions’ test results Run ID BLEU ROUGE METEOR CIDEr SPICE BERTScore 181670 0.457 0.140 0.082 0.045 0.013 0.570 181951 0.474 0.138 0.092 0.026 0.006 0.554 182249 0.480 0.138 0.090 0.027 0.005 0.553 182250 0.482 0.139 0.089 0.026 0.005 0.557 182275 0.483 0.142 0.092 0.030 0.007 0.561 182290 0.481 0.142 0.091 0.031 0.014 0.567 182314 0.462 0.158 0.085 0.062 0.010 0.578 182315 0.480 0.142 0.093 0.030 0.013 0.570 182319 0.469 0.136 0.089 0.029 0.006 0.551 182327 0.440 0.162 0.083 0.071 0.013 0.574 can help; They will generate values lazy (on demand). It is not efficient or sometimes feasible to load all the data into memory at once. Another benefit is that our Model does not have to wait until all the data is processed before using them. Generators save the internal state without holding the entire data in memory. When the new data is requested, they continue from their previous saved state by providing the next batch of data, which are tiny portions of a larger dataset, to the requestor. There are multiple ways to achieve this. We have used Sequence[47] from Keras API[48] because it is safer in multiprocessor environments. According to Keras documentation[47]: "This structure guarantees that the network will only train once on each sample per epoch, which is not the case with generators." There are some pitfalls when using a generator. For example, using a mutable global variable in a data generator, which is called multiple times, can result in unexpected behavior, and some anti-patterns can invert the purpose of generators and cause incremental memory growth. 6.2. Faster Execution To accomplish faster code execution, first, we need to determine why our code is time-consuming and which sections have the most significant impact on it. We used the TensorBoard Profiling tool[49] to analyze our model. Profiling is the study of hardware resource consumption based on information gathered during the program’s execution to identify which part of our program needs optimization and how we can speed up the overall program while minimizing resources. After running the Profiler, TensorBoard[49] will offer us a visual representation of the gathered information, which consists of: 1. Recommendations for next steps in model improvement. These suggestions range from determining whether our Model is input-bound, how much time is spent on Kernel Launch, and what percentage of the operations performed are 16 or 32-bit. 2. to figure out which GPU operations take the longest, TensorFlow Stats are used, and then We should improve the most time-consuming parts to notice substantial changes in execution time. To achieve higher throughput and better GPU utilization, we can raise the batch size suffi- ciently without exhausting the resources and running Out of Memory (OOM). To prevent the Model’s accuracy from decreasing, we should scale the Model by tuning hyperparameters. Parallel execution and multi-threading can also be used, depending on whether the tasks are CPU-Bound or I/O-Bound. 6.2.1. I/O-Bound 1. Pre-fetching and caching the data at the cost of higher memory usage will enhance the Model’s throughput. The input pipeline prepares the data for the next phase before the data is requested. 2. For I/O operations, multi-threading is highly recommended. It involves adding a new thread to an existing process, and memory is shared among them. Because of the shared memory, we need to use locks to control access to the shared data and prevent race conditions. 6.2.2. CPU-Bound Multiprocessing is used for intensive CPU-limited tasks to achieve full CPU utilization. Each process has its own address space, and it is only applicable when we have multiple CPU cores. Inter-process communication (IPC) with Pipes and Queues is also possible. Since multiprocessing comes with full CPU utilization, it is ideal for the jobs with the least amount of data but the most operations. "The more processes or threads we have, the faster it is." Another vital observation to note is that this sentence is not entirely accurate. This is because OS has to manage all these processes, and when there are too many, it may face scheduling overhead and reduce its overall speed. 6.3. Disk Usage Best Practices Models will be trained regularly, and for having reproducible projects, we should version our Data and Models so that they can be easily shared, compared, and repeatedly reconstructed in our experiments. These tasks will be more manageable using the Version Control System. • When choosing VC11 , it should support both on-premises and remote Cloud Storage Services (Azure, S3, GC) • When facing a lack of storage, try to use Symlinks Instead of duplicate files, we can compress files that currently are not needed to free up some space and keep the files simultaneously. 11 Version Control 6.4. Hardware Most of our development was done on a system with GTX 1080 Ti GPU and another system with a RTX 2060 GPU provided by the computer engineering department. We also had limited access to an A100 GPU for final runs provided by the Simorgh Cloud. Table 11 Information about the hardware used. Server ID GPU CPU RAM Disk Duration Model Memory Size Memory Type VCores Memory Size Disk Space Days ID : 1 GeForce RTX 2060 Super 8 GB GDDR6 6 24 GB 200 GB 10 ID : 2 GeForce GTX 1080 Ti 11 GB GDDR5X 6 32 GB 200 GB 20 ID : 3 A100 40 GB HBM2e 8 96 GB 200 GB 7 7. Conclusion This paper describes the participation of IUST_NLPLAB at Iran University of Science and Technology at ImageCLEF caption 2022 task. In the Concept Detection subtask, we ranked 8 among 11 participating teams. We used MLC and information retrieval approaches in this subtask. Our MLC methods with adding dropout had better overall score. In the Caption Prediction subtask, all of our 10 submissions ranked higher than other groups’ submissions and we achieved first rank in this subtask. We performed multi-label classification in this subtask as well. We used a classification model to predict the constructing words of a caption. Then, we used two different methods to create a caption. We hope to be able to participate in this competition in the future and achieve better results. Acknowledgments This work has been supported by the Simorgh Supercomputer - Amirkabir University of Tech- nology under Contract No ISI-DCE-DOD-Cloud-900808-1700. We also thank the support of the School of Computer Engineering of Iran University of Science and Technology and Iran’s National Elites Foundation12 for participating in this competition. References [1] B. Ionescu, H. Müller, R. Peteri, J. Rückert, A. Ben Abacha, A. G. S. de Herrera, C. M. Friedrich, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, S. Kozlovski, Y. D. Cid, V. Ko- valev, L.-D. Ştefan, M. G. Constantin, M. Dogariu, A. Popescu, J. Deshayes-Chossart, 12 https://en.bmn.ir/ (last accessed: 2022-05-27) H. Schindler, J. Chamberlain, A. Campello, A. Clark, Overview of the ImageCLEF 2022: Multimedia retrieval in medical, social media and nature applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 13th Interna- tional Conference of the CLEF Association (CLEF 2022), LNCS Lecture Notes in Computer Science, Springer, Bologna, Italy, 2022. [2] J. Rückert, A. Ben Abacha, A. García Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi- Yaghir, H. Schäfer, H. Müller, C. M. Friedrich, Overview of ImageCLEFmedical 2022 – caption prediction and concept detection, in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022. [3] A. García Seco de Herrera, R. Schaer, S. Bromuri, H. Müller, Overview of the ImageCLEF 2016 medical task, in: Working Notes of CLEF 2016 (Cross Language Evaluation Forum), 2016. [4] B. Ionescu, H. Müller, R. Péteri, Y. D. Cid, V. Liauchuk, V. Kovalev, D. Klimuk, A. Tarasau, A. B. Abacha, S. A. Hasan, V. Datla, J. Liu, D. Demner-Fushman, D.-T. Dang-Nguyen, L. Pi- ras, M. Riegler, M.-T. Tran, M. Lux, C. Gurrin, O. Pelka, C. M. Friedrich, A. G. S. de Herrera, N. Garcia, E. Kavallieratou, C. R. del Blanco, C. C. Rodríguez, N. Vasillopoulos, K. Karam- pidis, J. Chamberlain, A. Clark, A. Campello, ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Science, Springer, Lugano, Switzerland, 2019. [5] O. Bodenreider, The unified medical language system (umls): integrating biomedical terminology, Nucleic acids research 32 (2004) D267–D270. [6] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology objects in context (ROCO): a multimodal image dataset, in: Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, Springer, 2018, pp. 180–189. [7] R. J. Roberts, Pubmed central: The genbank of the published literature, 2001. [8] O. Pelka, C. M. Friedrich, A. Seco De Herrera, H. Müller, Overview of the ImageCLEFmed 2019 concept detection task, in: CEUR Workshop Proceedings, volume 2380, CEUR Workshop Proceedings, 2019. [9] O. Pelka, C. M. Friedrich, A. García Seco de Herrera, H. Müller, Overview of the Image- CLEFmed 2020 concept prediction task, in: Proceedings of the CLEF 2020-Conference and labs of the evaluation forum, CONFERENCE, 22-25 September 2020, 2020. [10] O. Pelka, A. B. Abacha, A. G. S. de Herrera, J. Jacutprakart, C. M. Friedrich, H. Müller, Overview of the ImageCLEFmed 2021 concept & caption prediction task, in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS. org, Bucharest, Romania, 2021. [11] K. Kettunen, Can type-token ratio be used to show morphological complexity of languages?, Journal of Quantitative Linguistics 21 (2014) 223–245. [12] H. Jo, J. Baek, Case of pulmonary benign metastasizing leiomyoma from synchronous uterine leiomyoma in a postmenopausal woman, Gynecologic oncology reports 26 (2018) 33–36. [13] A. Seshachalam, S. Cyriac, N. Reddy, S. T. Gnana, Ataxia telangiectasia: Family manage- ment, Indian Journal of Human Genetics 16 (2010) 39. [14] A. Pereda, I. Garin, M. Garcia-Barcina, B. Gener, E. Beristain, A. M. Ibañez, G. Perez de Nanclares, Brachydactyly e: isolated or as a feature of a syndrome, Orphanet journal of rare diseases 8 (2013) 1–14. [15] S. R. Sudulagunta, M. B. Sodalagunta, H. Khorram, M. Sepehrar, J. Gonivada, Z. Noroozpour, N. Prasad, Autoimmune thyroiditis associated with neuromyelitis optica (nmo), GMS German Medical Science 13 (2015). [16] O. Patel, Y. P. Maravi, S. Sharma, A comparative study of histogram equalization based image enhancement techniques for brightness preservation and contrast enhancement, arXiv preprint arXiv:1311.4033 (2013). [17] K. Zuiderveld, Contrast limited adaptive histogram equalization, Graphics gems (1994) 474–485. [18] C. Xiang, L. Huang, L. Xia, Mobile chest x-ray manifestations of 54 deceased patients with coronavirus disease 2019: Retrospective study, Medicine 99 (2020). [19] J. Jacutprakart, F. P. Andrade, R. Cuan, A. A. Compean, G. Papanastasiou, A. G. S. de Her- rera, Nlip-essex-itesm at imageclefcaption 2021 task: deep learning-based information retrieval and multi-label classification towards improving medical image understanding, in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS. org, 2021. [20] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolu- tional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708. [21] F. Charalampakos, V. Karatzas, V. Kougia, J. Pavlopoulos, I. Androutsopoulos, Aueb nlp group at imageclefmed caption tasks 2021, in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS. org, Bucharest, Romania, 2021. [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [23] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International conference on machine learning, PMLR, 2019, pp. 6105–6114. [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255. [25] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [26] H. K. Rai, G. John, M. Anton, Atypical presentation of panhypopituitarism, Cureus 12 (2020). [27] R. Davar, S. M. Poormoosavi, F. Mohseni, S. Janati, Effect of embryo transfer depth on ivf/icsi outcomes: A randomized clinical trial, International Journal of Reproductive BioMedicine 18 (2020) 723. [28] R. B. Jazia, J. Ayachi, F. Chatbouri, A. Kacem, A. Faidi, D. B. Braiek, A. Maatallah, Unusual case of spontaneous hemopneumothorax in a tunisian pulmonology department: a case report, The Pan African Medical Journal 38 (2021). [29] M. T. van Kesteren, P. Rignanese, P. G. Gianferrara, L. Krabbendam, M. Meeter, Congru- ency and reactivation aid memory integration through reinstatement of prior knowledge, Scientific Reports 10 (2020) 1–13. [30] V. Castro, P. Pino, D. Parra, H. Lobel, Puc chile team at caption prediction: Resnet visual encoding and caption classification with parametric relu, in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS. org, Bucharest, Romania, 2021. [31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [32] M. R. Povlow, M. Streiff, S. Madireddi, C. Jaramillo, A primary parotid mucosa-associated lymphoid tissue non-hodgkin lymphoma in a patient with sjogren syndrome, Cureus 13 (2021). [33] I. Trisnawati, R. El Khair, D. A. Puspitarani, A. R. Fauzi, et al., Prolonged nucleic acid conversion and false-negative rt-pcr results in patients with covid-19: A case series, Annals of Medicine and Surgery 59 (2020) 224–228. [34] L. Wimmer, P. Petrakakis, K. El-Mahdy, S. Herrmann, D. Nolte, Implant-prosthetic rehabil- itation of patients with severe horizontal bone deficit on mini-implants with two-piece design—retrospective analysis after a mean follow-up of 5 years, International Journal of Implant Dentistry 7 (2021) 1–14. [35] E. Choi, D. Kim, J.-Y. Lee, H.-K. Park, Artificial intelligence in detecting temporomandibular joint osteoarthritis on orthopantomogram, Scientific Reports 11 (2021) 1–7. [36] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81. [37] M. Denkowski, A. Lavie, Meteor universal: Language specific translation evaluation for any target language, in: Proceedings of the ninth workshop on statistical machine translation, 2014, pp. 376–380. [38] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based image description evaluation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575. [39] P. Anderson, B. Fernando, M. Johnson, S. Gould, Spice: Semantic propositional image caption evaluation, in: European conference on computer vision, Springer, 2016, pp. 382–398. [40] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with bert, arXiv preprint arXiv:1904.09675 (2019). [41] A. García Seco de Herrera, C. Eickhof, V. Andrearczyk, H. Müller, Overview of the ImageCLEF 2018 caption prediction tasks, in: Working Notes of CLEF 2018-Conference and Labs of the Evaluation Forum (CLEF 2018), Avignon, France, September 10-14, 2018., volume 2125, CEUR Workshop Proceedings, 2018. [42] C. Eickhoff, I. Schwall, A. Garcia Seco De Herrera, H. Müller, Overview of ImageCLEFcap- tion 2017–image caption prediction and concept detection for biomedical images, CLEF 2017 working Notes 1866 (2017). [43] K. Dimitris, K. Ergina, Concept detection on medical images using deep residual learning network, Working Notes CLEF (2017). [44] Y. Zhang, X. Wang, Z. Guo, J. Li, Imagesem at imageclef 2018 caption task: Image retrieval and transfer learning, in: CLEF CEUR Workshop, Avignon, France, 2018. [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 15 (2014) 1929–1958. [46] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. [47] Tensorflow, Sequence class from keras api, https://www.tensorflow.org/api_docs/python/ tf/keras/utils/Sequence, 2022. Last Accessed: 2022-05-27. [48] K. API, Keras api documentation, https://keras.io/, 2022. Last Accessed: 2022-05-27. [49] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze- fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van- houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL: https://www.tensorflow.org/, last Accessed: 2022-05-27, Software available from tensorflow.org.