<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Predicting Captions and Detecting Concepts for Medical Images: Contributions of the DBS-HHU Team to ImageCLEFmedical Caption 2024</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Heiko</forename><surname>Kauschke</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Heinrich-Heine-Universität Düsseldorf</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Universitätsstraße</orgName>
								<address>
									<postCode>40225</postCode>
									<settlement>Düsseldorf</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kirill</forename><surname>Bogomasov</surname></persName>
							<email>bogomasov@hhu.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Heinrich-Heine-Universität Düsseldorf</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Universitätsstraße</orgName>
								<address>
									<postCode>40225</postCode>
									<settlement>Düsseldorf</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stefan</forename><surname>Conrad</surname></persName>
							<email>stefan.conrad@hhu.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Heinrich-Heine-Universität Düsseldorf</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Universitätsstraße</orgName>
								<address>
									<postCode>40225</postCode>
									<settlement>Düsseldorf</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Predicting Captions and Detecting Concepts for Medical Images: Contributions of the DBS-HHU Team to ImageCLEFmedical Caption 2024</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">AD46F86BDE860CC107FE300A9DA8B34C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Multi-Label-Classification</term>
					<term>Image Captioning</term>
					<term>Deep Learning</term>
					<term>CNN Ensemble</term>
					<term>Hierarchical Model</term>
					<term>GIT</term>
					<term>Image-CLEFmedical 2024 Caption</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the work of the team DBS-HHU in the ImageCLEFmedical Caption 2024 in both sub-tasks Concept Detection and Caption Prediction. The goal of the Concept Detection sub-task is to extract the correct UMLS terms from medical images, while Caption Prediction aims to generate descriptions for them. For both sub-tasks are images from the Radiology Objects in COntext Version 2 dataset used. We preprocessed these images by removing their white borders and upscaling small images to improve the performance of our models. For Concept Detection we used two different architectures, the first one being an ensemble of four different Convolutional Neural Network (CNN) and the second being a hierarchical model consisting of two CNN. All models in this sub-task are compared by using the 𝐹1-score. For Caption Prediction we experimented with two different version of the GIT architecture. These were compared to other models using the BERTScore as primary and ROUGE as secondary metric. Our ensemble scored the first place in Concept Detection with a 𝐹1-score of 0.6374, while our GIT model placed tenth in Caption Prediction.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Analyzing and summarizing information derived from medical images, such as those produced in radiology, is a complex and time-intensive task requiring specialized expertise. This process often creates a bottleneck in clinical diagnosis workflows and therefore requires special attention.</p><p>As a result, there is a significant demand for automated methods that can translate visual data into concise textual descriptions. Improved knowledge of image features leads to more organized radiology scans, thereby enhancing the efficiency of radiologists in their interpretative work. Challenging tasks and unresolved issues in the field of visual analysis and interpretation often hold significant societal value and are rightfully of great interest to society, research, and industry. Particularly medical imaging is both demanding and valuable in interpretation due to the informational content. Challenging questions and the search for answers and solutions in image material is where ImageCLEF begins. ImageCLEF is the multimedia retrieval lab of CLEF (Conference and Labs of the Evaluation Forum). Since 2004 ImageCLEFmedical has consisted of various tasks. ImageCLEF2024 <ref type="bibr" target="#b0">[1]</ref> included, among other tasks, the ImageCLEFmedical 2024 Caption <ref type="bibr" target="#b1">[2]</ref> task. The task took place for the eighth time. On one hand, the fact that no satisfactory solution had been found in eight years (otherwise the challenge would be considered finished) suggests the complexity of the task. On the other hand, it indicates the significant interest of the research community in the problem, which has piqued our interest. The task itself is split into two sub-tasks: Concept Detection and Caption Prediction. The first sub-task can be considered as a multi-label classification problem. Each image is associated with at least one manually annotated Unified Medical Language System (UMLS) concept, which we will refer to as a concept or label throughout the subsequent discussion. These need to be detected and further applied for information retrieval purposes or image analysis. The second sub-task can be viewed as an image captioning problem. Each image has a caption and the model is tasked with generating a comparable description of the images content.</p><p>Below, we detail our observations, considerations, and experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data</head><p>The annotated dataset ROCOv2 <ref type="bibr" target="#b2">[3]</ref> was provided by the ImageCLEFmedical organizers and used for both sub-tasks. For an example see Tab. The number of unique concepts in the training dataset was reduced from 2125 to 1945 and in the validation dataset from 1945 to 1751. These were mainly concepts which were used very rarely. This was done by the organizers, because of the suggestions of last years participants. When considering the distribution of the number of labels within the images of the training dataset, it is observed that the majority of images can be assigned up to five labels, with the absolute majority having exactly two labels assigned (see Fig. <ref type="figure">2</ref>).</p><p>The frequency of the concepts varies greatly. When combining the validation and test sets, the most frequently occurring concept is 'C0040405'/'X-Ray Computed Tomography', used 27,852 times. Conversely, the least used concepts include 'C1962945'/'Radiographic imaging procedure', 'C1690005'/'MRI venography', 'C0243032'/'Magnetic Resonance Angiography', 'C0412650'/'Computed tomography of the cervical spine', 'C0011906'/'Differential Diagnosis', and 'C0202657'/'CT follow-up', each appearing only once (refer to Fig. <ref type="figure">3</ref>).</p><p>As for the caption prediction task, nothing changed with the way the captions were handled in comparison to last year. The captions for the caption prediction task have already been preprocessed, resulting in the absence of any links within the captions. These captions exhibit significant variation in length. The median caption length for the training set is 17 tokens, with the largest caption containing 633 tokens and the smallest containing only one token (refer to Fig. <ref type="figure" target="#fig_2">4</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>In this section, we will initially explore the preprocessing steps applied to the dataset, followed by an explanation of the various approaches utilized for the different tasks. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Preprocessing</head><p>After examining the dataset, we observed that many images feature a white border. Consequently, we decided to trim the white borders of all the images in the dataset, as we did not anticipate that our networks could extract significant information from them. Additionally, during our data analysis, we noted that there are 1251 images with dimensions less than 300 × 300 pixels. For the most part, the resolution is way bigger with a mean of 646.69×593.50 and a median of 657×563 pixels. Experience has shown that an imbalance in sizes can negatively impact the performance of a deep learning architecture. Therefore, it was crucial to tackle this issue. This prompted us to consider leveraging a pre-trained network specialized in upscaling medical images. For this purpose, we utilized a feedback adaptive weighted dense network (FAWDN) <ref type="bibr" target="#b3">[4]</ref> <ref type="foot" target="#foot_0">1</ref> . The architecture is visualized in 5. Since FAWDN utilizes a strict feedback mechanism, its implementation is based on recurrent neural networks (RNN) which means that the network consists of sub-networks equal to the used number of time steps. This feedback mechanism is used to produce better high-resolution images in each time step by correcting the errors from the preceding one. Another part of the feedback mechanism is that information needs to flow from the output to the input of the network. The networks consist of the input-, hidden-and output units, whose parameters are shared across time-steps. The hidden state receives the output of the previous hidden state and the current input state to enable a flow of information. A loss function is applied in every time step to make the hidden states contain information about the output image. An output image is created by adding the result of the output unit to a bilinear upsampled version of the input image.</p><p>Ultimately, the image generated in the last time step is chosen as the final reconstructed high-resolution image. Another interesting aspect of the architecture is the design of the hidden unit. As previously By applying FAWDN to the provided data, we created a new dataset in which those small images were upscaled to twice their size. Especially small images with a size of 150 × 150 pixels and below were upscaled to triple their size. This would ensure that no classical upscaling methods would be needed when using a random crop size of 224 × 224 for training. All of our models in both sub tasks were trained on the new dataset. One concept detection run was uploaded which used the original  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Concept Detection</head><p>In the last years, Convolutional Neural Networks (CNNs) proved outstanding results in multi-label classification problems. However, to get the most out of the models, ensembles are commonly created. Several studies have demonstrated the benefits of using ensemble methods for improving performance on computer vision tasks. By combining predictions from multiple models, the variance errors can be reduced, the generalization increased and the overall accuracy improved. In medical image analysis, ensemble learning helps to address the variability in annotations (caused by the inconsistency of annotators) and observer interpretations and build more robust diagnostic predictions. <ref type="bibr" target="#b4">[5]</ref>. Finally, ensemble learning can also improve generalization across different datasets, which in particular is important while working on computer vision challenges since commonly the data originates from  different sources. The benefits of model ensembles also apply to this challenge. This confirms the fact that the winning team of last year relied on an ensemble model <ref type="bibr" target="#b5">[6]</ref>. Another way to leverage the strengths of multiple modules is to build a complex model in a hierarchical way. In particular, this is often beneficial when working with imbalanced, distributed data.</p><p>In the following, we describe our two approaches the CNN ensemble and the hierarchical model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1.">CNN Ensemble</head><p>The ensemble we built consisted of four different CNNs: ResNet152 <ref type="bibr" target="#b4">[5]</ref>, EfficientNetB0 <ref type="bibr" target="#b5">[6]</ref>, DenseNet201 <ref type="bibr" target="#b6">[7]</ref>, and Wide ResNet-101-2 <ref type="bibr" target="#b7">[8]</ref>. All models utilized pre-trained weights from ImageNet and were followed by different feed forward neural networks (FFNNs) composed of fully connected layers, dropout layers, and ReLU layers. We re-trained each model separately either with binary cross-entropy or multi-label soft margin loss. During training, we normalized the images with the channel-wise mean and standard deviation of the used dataset and applied a random crop with size 224 × 224, random horizontal flip with 50% probability and random rotations up to 10 ∘ as transformation steps. An Adam optimizer with an initial learning rate of 1 × 10 −4 was used, but the rate was reduced when the loss reached a plateau. During training we used the validation set to monitor the F1-score so we could use the model which had the best metric score after training. The training was capped at 50 epochs, with early stopping employed to safe computational time, if the validation metric did not make any changes above 5 × 10 −3 for ten consecutive epochs. For the final prediction, we used the union of concepts predicted by each model, meaning every predicted concept was included. To properly evaluate the prediction, we included concepts that were predicted by more than one model only once. The architecture that demonstrated the most outstanding performance, achieving first place, is schematically depicted in Figure <ref type="figure" target="#fig_5">6</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2.">Hierarchical Model</head><p>In this approach, we aimed to improve the design of last year's CNN+FFNN-based Multi-task Classifier from the AUEB NLP Group <ref type="bibr" target="#b5">[6]</ref> by expanding the architecture. We also hypothesized that utilizing the hierarchical relationship between concepts could lead to better results. To enhance last year's design we used two separate backbones instead of two task-specific classification heads, as illustrated in 7. The used backbones are ResNet152 and the FFNN constructed as described in the the previous subsection. One network is responsible for predicting the image modalities and the other for the remaining concepts, with the connection that the output of the modality network is concatenated into the network for the remaining concepts before its FFNN. The modality model is trained with cross entropy loss and the other model with multi-label soft margin loss. To utilize all available images, we introduced an 'empty' label during training, because some images did not have a modality concept and others did not have concepts but modalities. However, we also experimented without the empty labels and discarded the images that would have been labeled as empty. The training parameters remained the same as in the previous approach. By implementing these modifications, we sought to leverage the strengths of both CNN and FFNN architectures and the hierarchical relationships between different concepts to improve the overall performance of the multi-task classifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Caption Prediction</head><p>The field of image captioning is currently dominated by transformer architectures. They are commonly known to a have exceptional capabilities when it comes to handling language tasks. Last year's competition also underlined the strength and variety of these kind of architectures. This prompted us to experiment with a new transformer architecture to examine how it would perform in the medical context and if it would yield any new significant results. The network we chose is called Generative Image-to-text Transformer (GIT) <ref type="bibr" target="#b6">[7]</ref>. Its architecture is designed to handle both image/video captioning and visual question-answering tasks. Despite its versatile applications, GIT is fundamentally composed of an image encoder and a text decoder, as illustrated in Figure <ref type="figure" target="#fig_7">8</ref>. At a high level, GIT processes an image using the image encoder, transforming it into a 2D feature map that is then flattened into a list of features. An additional linear layer and a layernorm layer <ref type="bibr" target="#b7">[8]</ref> project these image features so that they can be used as input for the text decoder. The pre-training involves first using a contrastive task to pre-train the image encoder, followed by a generation task to pre-train both the image encoder and the text decoder. The choice of the image encoder depends on the specific model variant. In the original GIT model, a Florence/CoSwin image encoder is used <ref type="bibr" target="#b8">[9]</ref>. We experimented with the GIT-base and GIT-large variants. GIT-base employs a CLIP/ViT-B/16 encoder <ref type="bibr" target="#b9">[10]</ref>, while GIT-large uses a CLIP/ViT-L/14 encoder <ref type="bibr" target="#b9">[10]</ref>. Another difference between these variants is the datasets used for pre-training. GIT-base is pre-trained on 10 million image-text pairs or 4 million images, sourced from a combination of COCO <ref type="bibr" target="#b10">[11]</ref>, SBU <ref type="bibr" target="#b11">[12]</ref>, CC3M <ref type="bibr" target="#b12">[13]</ref>, and VG <ref type="bibr" target="#b13">[14]</ref> datasets. GIT-large is pre-trained on 20 million image-text pairs or 14 million images, which includes the 10 million image-text pairs from GIT-base supplemented with the CC12M <ref type="bibr" target="#b14">[15]</ref> dataset. The text decoder is consistent across all variants and consists of a transformer module with multiple transformer blocks. Each block includes a self-attention layer and a feed-forward layer. First the text needs to be tokenized and embedded in the same number of dimensions as the image features. Then follows the addition of the positional embedding and a layernorm layer. To finalize the input for the text decoder the image features are concatenated with the text embeddings with a BOS token between them. Now the decoder can starting from the BOS token decode the next token in an auto-regressive way until the EOS token or reaching the maximum steps. The sequence-to-sequence attention mask is configured in such a way that a text token only depends on its predecessor and all image tokens, while image tokens can attend to each other. We fine-tuned the two variants in the same way, using a inital learning rate of 5 × 10 −5 for 50 epochs. We used AdamW as optimizer with standard parameters and trained with 16-bit (mixed) precision training instead of 32-bit training. Because of the size of the model we could not evaluate the model during training which is why we used the one we obtained after the last epoch.</p><p>In our experiments, we aimed to leverage GIT's capabilities to generate meaningful and accurate medical image captions, hypothesizing that the transformer-based approach would enhance performance over traditional methods. The results of these experiments could provide insights into the applicability of advanced transformer architectures in the specialized field of medical image captioning, potentially setting a new benchmark for future research and applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Evaluation</head><p>In this section, we will present the results of our submissions and explain the used metrics for each sub task. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Concept Detection</head><p>For this task, the 𝐹 1 -score between the predictions and the ground truth is used as a primary evaluation metric. It is calculated by averaging over all 𝐹 1 -scores for every image. The score for an image is calculated by creating multi-one hot encoded vectors for the prediction and ground truth and calculating a harmonic mean of the precision and recall. As a secondary metric, the 𝐹 1 -score is calculated with a ground truth set of manually validated concepts. We submitted three different versions of our ensemble model and two different versions of our hierarchical model. Our best model, which also won this year's challenge, is an ensemble trained on our preprocessed dataset using a multi-label soft margin loss (ID 603). Following this, the next best was our ensemble trained on the preprocessed dataset with binary cross-entropy (BCE) loss (ID 625), and then the ensemble trained on the normal dataset with BCE loss (ID 604). Our proposed hierarchical models did not perform well. This is probably due to the way the information of the modality part of the model is fed into the model for the remaining classes. We firstly suspected that the mass of empty labels led the model (ID 610) to primarily classify the images as empty, but our run without the empty labels (ID 616) performed worse. The results can be seen in Table <ref type="table" target="#tab_2">2</ref> with an additional comparison to our validation results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Caption Prediction</head><p>The primary evaluation metric for this task is the BERTScore. As preprocessing for the evaluation all captions were turned into lowercase, had their punctuation removed and their numbers replaced by the token number so that the focus of the evaluation lies on the linguistic content. The metric uses the contextualized word embeddings of the Microsoft/deberta-xlarge-mnli model. The BERTScore for a single sentence is calculated by matching each token in the candidate sentence to the most similar token in the reference sentence in terms of cosine similarity, and vice versa, to compute Recall and Precision, which are then combined to calculate the 𝐹 1 score. The final score is the sum of all sentence scores divided by the number of captions. Since the BERTScore is more focused on imitating human judgment the ROUGE score was used as a secondary metric. This metric is computed by comparing which n-grams can be found in one sentence in the other and vice versa. This combination of a more human-oriented and a classical metric should give a good comparison between models. Outside of the primary and secondary metrics were other metrics calculated, for further comparison, as seen in Table <ref type="table">3</ref> We submitted two models for this task: a fine-tuned version of the GIT-base model and a finetuned version of the GIT-large model. Our best run, the GIT-large model, achieved tenth place. The performance difference between the GIT-large and GIT-base models is negligible, as indicated by a BERTScore difference of only 1 × 10 −10 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>At the end of this paper, we will summarize the insights gained from our experiments and their results, and we will also propose ideas for possible future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Discussion</head><p>Starting with the concept detection sub task, even though our ensemble approach performed very well, it needs a considerable amount of resources since four different networks need to be trained. This also slows down the evaluation process since an image must pass through all four networks. While very effective, it still is a time-intensive approach. Our hierarchical model did not perform well, due to potentially not optimal network design. The information is only available in the FFNN and does not get back-propagated to the CNN, which is the reason why it does not learn a connection between the modalities and their related concepts. Nevertheless, we remain convinced that an approach in this direction has the potential to achieve good results. This conviction stems from the fact that a model that utilizes the concept of hierarchy works with more information than just images, which should confer an advantage.</p><p>As noted in the previous section, our models for the caption prediction sub task did perform the same. Since both were trained for 50 epochs, both models may be equally overfitted. Both of these models were pre-trained on a large amount of data that is not medical related which may cause them to have problems to adapt to the development dataset which is small in comparison. GIT's strenght seems to lie in its versatile use cases and not in its ability to perform highly specialised tasks like medical image captioning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Future Work</head><p>In the previous discussion, we highlighted the potential of hierarchical models for concept detection. A different method to transfer the information from the modality network into the the network for the remaining classes could make big improvements to the model. Another idea would be to further split up the model and use a sub-network for every modality. That means a modality network predicts the image modality. This prediction determines to which network the image is passed next so that it can predict the remaining concepts.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Boxplots of the pixel width and height of the train dataset</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :Figure 3 :</head><label>23</label><figDesc>Figure 2: Distribution of the number of labels per image in the training dataset</figDesc><graphic coords="4,155.91,65.61,283.47,213.64" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Distribution of the number of tokens per caption in the training dataset</figDesc><graphic coords="5,155.91,65.61,283.47,223.46" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Architecture of the FAWDN[4]</figDesc><graphic coords="5,72.00,328.49,451.28,218.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Schema of the ensemble architecture</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Schema of the hierarchical architecture</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Schema of the Generative Image-to-text Transformer architecture, derived from the original [7]</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Example of an image with corresponding CUIs and caption from the ImageCLEFmedical 2024 caption task dataset.</figDesc><table><row><cell>Image</cell><cell>Concepts</cell><cell>Caption</cell></row><row><cell></cell><cell>• C0040405 (X-Ray Computed</cell><cell></cell></row><row><cell></cell><cell>Tomography) • C0332558 (Calcified nodule)</cell><cell>Sagittal view of the calcified nasal packing.</cell></row><row><cell></cell><cell>• C0028429 (Nose)</cell><cell></cell></row><row><cell>CC BY [Kelesidis et al. (2010)]</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Evaluation results: DBS-HHU Concept Detection Task Affiliation ID 𝐹 1 -score (Dev) 𝐹 1 -score (Test) 𝐹 1 -score manual</figDesc><table><row><cell></cell><cell cols="2">DBS-HHU 603</cell><cell>0.5969</cell><cell></cell><cell>0.6375</cell><cell></cell><cell>0.9534</cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="2">DBS-HHU 625</cell><cell>0.5928</cell><cell></cell><cell>0.6309</cell><cell></cell><cell>0.9488</cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="2">DBS-HHU 604</cell><cell>0.5938</cell><cell></cell><cell>0.6269</cell><cell></cell><cell>0.9461</cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="2">DBS-HHU 610</cell><cell>0.3300</cell><cell></cell><cell>0.3417</cell><cell></cell><cell>0.4477</cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="2">DBS-HHU 616</cell><cell>0.2332</cell><cell></cell><cell>0.3413</cell><cell></cell><cell>0.4340</cell><cell></cell><cell></cell></row><row><cell cols="2">Table 3</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">DBS-HHU: Best run on the Caption Prediction Task</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Team</cell><cell cols="9">BERTScore (Dev) BERTScore (Test) ROUGE BLEU-1 BLEURT METEOR CIDEr CLIPScore RefCLIPScore</cell></row><row><cell>DBS-HHU</cell><cell>0.5917</cell><cell>0.5769</cell><cell>0.1531</cell><cell>0.1493</cell><cell>0.2710</cell><cell>0.0559</cell><cell>0.0644</cell><cell>0.7842</cell><cell>0.7750</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">code available at https://github.com/Lihui-Chen/FAWDN, last visited:</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="24" xml:id="foot_1">.05.2024</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEF 2024: Multimedia retrieval in medical applications</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-M</forename><surname>Drăgulinescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M G</forename><surname>Pakull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Damm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bracke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-G</forename><surname>Andrei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Prokopchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Karpenka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radzhabov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Macaire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schwab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lecouteux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Esperança-Rodier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yetisgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hicks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Thambawita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Storås</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Halvorsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Heinrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kiesel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024</title>
		<title level="s">Springer Lecture Notes in Computer Science LNCS</title>
		<meeting><address><addrLine>Grenoble, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEFmedical 2024 -Caption Prediction and Concept Detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bracke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Damm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M G</forename><surname>Pakull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF2024 Working Notes, CEUR Workshop Proceedings</title>
				<meeting><address><addrLine>Grenoble, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Koitka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Pelka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Horn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Nensa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<idno type="DOI">10.1038/s41597-024-03496-6</idno>
		<ptr target="https://arxiv.org/abs/2405.10004v1.doi:10.1038/s41597-024-03496-6" />
		<title level="m">ROCOv2: Radiology Objects in COntext version 2, an updated multimodal image dataset</title>
				<imprint>
			<publisher>Scientific Data</publisher>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A trusted medical image super-resolution method based on feedback adaptive weighted dense network</title>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Jeon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Anisetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Liu</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.artmed.2020.101857</idno>
		<ptr target="https://doi.org/10.1016/j.artmed.2020.101857" />
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence in Medicine</title>
		<imprint>
			<biblScope unit="volume">106</biblScope>
			<biblScope unit="page">101857</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Analyzing inter-reader variability affecting deep ensemble learning for covid-19 detection in chest radiographs</title>
		<author>
			<persName><forename type="first">S</forename><surname>Rajaraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sornapudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">O</forename><surname>Alderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">R</forename><surname>Folio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Antani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PloS one</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page">e0242301</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">AUEB NLP group at ImageCLEFmedical Caption</title>
		<author>
			<persName><forename type="first">P</forename><surname>Kaliosis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Moschovis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Charalampakos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pavlopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Androutsopoulos</surname></persName>
		</author>
		<idno>WS.org</idno>
		<ptr target="https://ceur-ws.org/Vol-3497/paper-126.pdf" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">M</forename><surname>Aliannejadi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Vlachos</surname></persName>
		</editor>
		<meeting><address><addrLine>Thessaloniki, Greece</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023-09-18">2023. September 18th to 21st, 2023. 2023</date>
			<biblScope unit="volume">3497</biblScope>
			<biblScope unit="page" from="1524" to="1548" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">GIT: A generative imageto-text transformer for vision and language</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Gan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=b4tMhpN0JC" />
	</analytic>
	<monogr>
		<title level="j">Transactions on Machine Learning Research</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Ba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1607.06450</idno>
		<title level="m">Layer normalization</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Codella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2111.11432</idno>
		<title level="m">Florence: A new foundation model for computer vision</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Microsoft COCO: Common objects in context</title>
		<author>
			<persName><forename type="first">T.-Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Maire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Belongie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hays</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Perona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-10602-1_48</idno>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="740" to="755" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Im2text: Describing images using 1 million captioned photographs</title>
		<author>
			<persName><forename type="first">V</forename><surname>Ordonez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kulkarni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Berg</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Shawe-Taylor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Zemel</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Bartlett</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">24</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning</title>
		<author>
			<persName><forename type="first">P</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Goodman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Soricut</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:51876975" />
	</analytic>
	<monogr>
		<title level="m">Annual Meeting of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Visual genome: Connecting language and vision using crowdsourced dense image annotations</title>
		<author>
			<persName><forename type="first">R</forename><surname>Krishna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Groth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kravitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kalantidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Shamma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11263-016-0981-7</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">123</biblScope>
			<biblScope unit="page" from="32" to="73" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts</title>
		<author>
			<persName><forename type="first">S</forename><surname>Changpinyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">K</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Soricut</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:231951742" />
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="3557" to="3567" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
