<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">AEHRC CSIRO at ImageCLEFmed Caption 2021</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Aaron</forename><surname>Nicolson</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Australian e-Health Research Centre</orgName>
								<orgName type="institution">Commonwealth Scientific and Industrial Research Organisation</orgName>
								<address>
									<addrLine>Herston 4006</addrLine>
									<region>Queensland</region>
									<country key="AU">Australia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jason</forename><surname>Dowling</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Australian e-Health Research Centre</orgName>
								<orgName type="institution">Commonwealth Scientific and Industrial Research Organisation</orgName>
								<address>
									<addrLine>Herston 4006</addrLine>
									<region>Queensland</region>
									<country key="AU">Australia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Bevan</forename><surname>Koopman</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Australian e-Health Research Centre</orgName>
								<orgName type="institution">Commonwealth Scientific and Industrial Research Organisation</orgName>
								<address>
									<addrLine>Herston 4006</addrLine>
									<region>Queensland</region>
									<country key="AU">Australia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">AEHRC CSIRO at ImageCLEFmed Caption 2021</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">9ACB55D958AD074C0418A6A9DC3A28D9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:43+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Medical Image Captioning</term>
					<term>Diagnostic Captioning</term>
					<term>Medical Images</term>
					<term>Image Captioning</term>
					<term>Multi-modal</term>
					<term>Sequence-to-sequence</term>
					<term>Vision Transformer</term>
					<term>PubMedBERT</term>
					<term>Multi-label Classification</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We describe our participation in the ImageCLEFmed Caption task of 2021. The task required participants to automatically compose coherent captions for a set of medical images. To this end, we employed a sequence-to-sequence model for caption generation, where its encoder and decoder were initialised with pre-trained Transformer checkpoints. In addition, we investigated the use of Self-Critical Sequence Training (SCST) (which offered a marginal improvement) and pre-training on five external medical image datasets. Overall, our approach was kept intentionally general so that it might be applied to tasks other than medical image captioning. AEHRC CSIRO placed third amongst the participating teams in terms of BLEU score-with a score 0.078 worse than the first placed participant. Our best-performing submission had the simplest configuration-it did not use SCST or pre-training on any of the external datasets. An overview of ImageCLEFmed Caption 2021 is available at: https://www.imageclef.org/2021/ medical/caption.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>For ImageCLEFmed caption 2021, teams were tasked with developing systems that could automatically produce coherent captions for the entirety of a medical image <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. To succeed, a system must not only identify medical concepts but also their interplay. A system that can achieve this could improve the efficiency of radiologists' interpretation. An example of an image and its ground truth caption for ImageCLEFmed Caption 2021 is provided in Figure <ref type="figure" target="#fig_1">1</ref>. Typical medical image captioning approaches make use of either a sequence-to-sequence (seq2seq) model to generate a caption for a medical image <ref type="bibr" target="#b2">[3]</ref>, or image retrieval, where it is assumed that similar images have similar captions <ref type="bibr" target="#b3">[4]</ref>. For our submissions, a seq2seq model was considered only.</p><p>While medical data has many unique characteristics, general-purpose Natural Language Processing (NLP) and Computer Vision (CV) methods have proven effective in many domain-specific medical tasks. In NLP, for example, general-domain self-supervised pre-training strategies-such as Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) used to produce  Bidirectional Encoder Representations from Transformers (BERT) <ref type="bibr" target="#b4">[5]</ref>-have been successfully adapted to medical text. One instance is PubMedBERT <ref type="bibr" target="#b5">[6]</ref>-a Transformer encoder <ref type="bibr" target="#b6">[7]</ref> pre-trained on PubMed articles using domain-specific self-supervised pre-training strategies. Another example is the use of Transfer Learning (TL) to significantly improve medical image classification accuracy on small datasets <ref type="bibr" target="#b7">[8]</ref>. Here, a portion of a Convolutional Neural Network (CNN) trained on ImageNet 2012 <ref type="bibr" target="#b8">[9]</ref> (a general-domain image classification dataset) is fine-tuned on the small amount of data for the medical image classification task.</p><p>A number of more recent NLP and CV machine learning techniques have not been investigated on medical data. One such approach for sequence generation is the use of pre-trained Transformer checkpoints to initialise both the encoder and the decoder of a seq2seq model <ref type="bibr" target="#b9">[10]</ref>. Another method is the Vision Transformer (ViT)-a pre-trained Transformer checkpoint for image classification, which takes 16x16 patches of the image as input <ref type="bibr" target="#b10">[11]</ref>. Building from this, a pre-trained ViT checkpoint was paired with a Transformer decoder to form a seq2seq model for image captioning <ref type="bibr" target="#b11">[12]</ref>.</p><p>Motivated by previous adaptations of general-domain NLP and CV machine learning techniques to medical data and the slew of recent techniques that have not been investigated on medical data, we investigate a seq2seq model for medical image captioning that employs a pre-trained ViT checkpoint as the encoder and the pre-trained PubMedBERT checkpoint as the decoder. We also experiment with various pre-training and fine-tuning strategies, such as additionally pre-training the encoder on a multi-label medical image classification task, as well as fine-tuning the seq2seq model with Self-Critical Sequence Training (SCST) <ref type="bibr" target="#b12">[13]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Task Description</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Dataset</head><p>The focus for ImageCLEFmed Caption 2021 was to use real medical images and have participants develop automated systems to predict natural language captions; evaluation was performed by comparing the predicted captions to the annotations provided by medical doctors (i.e. the ground truth captions). Each example from the dataset consisted of a medical image and its associated ground truth caption, as shown in Figure <ref type="figure" target="#fig_1">1</ref>. The training, validation, and test sets comprise of 2,756, 500, and 444 examples, respectively. We refer to the ImageCLEFmed Caption 2021 dataset as the task's dataset henceforth.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Metrics</head><p>Each caption (predicted and ground truth) was pre-processed in the following way: The caption was first converted to lower-case. All punctuation was then removed and the caption was tokenized into its individual words. Stopwords were then removed using NLTK's English stopword list (NLTK v3.2.2). Stemming was next applied using NLTK's Snowball stemmer (NLTK v3.2.2). The score was then calculated as the average score of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 <ref type="bibr" target="#b13">[14]</ref>. Note that the caption was always considered as a single sentence, even if it contained several sentences. No smoothing function was used. All scores were summed and averaged over the number of captions, giving the final score. One downside of using BLEU for medical image caption evaluation is that it is a word overlap measure and may not capture clinical correctness, as noted in <ref type="bibr" target="#b3">[4]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Medical image</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>The nine submissions for the AEHRC CSIRO team are described in Figure <ref type="figure">2</ref>. In this section, we describe the model architecture, the pre-training and fine-tuning strategy, as well as the external pre-training datasets for each submission. Figure <ref type="figure">2</ref> helps the reader identify the stages of pre-training and fine tuning for the encoder, decoder, and the seq2seq model that were used for each submission, along with the epoch chosen for each stage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Model</head><p>The same model architecture was used for each submission. An overview of the model is shown in Figure <ref type="figure" target="#fig_2">3</ref>. In terms of architecture, the encoder is identical to ViT <ref type="bibr">[11,</ref> ViT-Base, Table <ref type="table" target="#tab_1">1</ref>] and the decoder is identical to PubMedBERT <ref type="bibr" target="#b5">[6]</ref>. Both ViT and PubMedBERT use 12 hidden layers, each with a size of 768, an intermediate size of 3,072 [7, see 𝑑 𝑓 𝑓 in Section 3.3], and 12 scaled dot-product attention heads. Next, we describe the medical image pre-processing, followed by the encoder and decoder.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Medical image pre-processing:</head><p>A given medical image 𝑋 ∈ ℝ 𝐶×𝑊 ×𝐻 (where 𝐶, 𝑊, and 𝐻 denote the number of channels, the width, and the height of the medical image, respectively) is first resized using bilinear interpolation so that its smallest side has 416 pixels. Next, the resized image is cropped to a size of ℝ 3×384×384 (the size required for ViT), with the crop location random during training and centered during testing. The cropped image is then split into a set of non-overlapping patches-each of size ℝ 3×16×16 (i.e., 576 non-overlapping patches). Each patch is then flattened into a one-dimensional array of size ℝ 768 . A colour depth of 8-bits was used for the images (where images with a higher colour depth were downsampled to 8-bits).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Encoder:</head><p>The set of inputs given to the encoder consist of the projection of each patch and the [ C L S ] embedding. The patch projections are formed by passing each flattened patch through a learnable projection matrix: 𝑊 𝑒𝑛𝑐𝑜𝑑𝑒𝑟 𝑝𝑎𝑡𝑐ℎ ∈ ℝ 768×768 . The [ C L S ] embedding is learnt using matrix 𝑊 𝑒𝑛𝑐𝑜𝑑𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 ∈ ℝ 1×768 (where [ C L S ] is the classification token whose corresponding output is fed to a classification head during pre-training). The corresponding output for the [ C L S ] embedding forms an aggregate representation over all patches. Before the set of inputs are given to the first ViT hidden layer, a position embedding is added to each element of the set. There are 577 position embeddings for the encoder, with position "0" reserved for the [ C L S ] embedding and positions "1" to "576" reserved for the patch projections (which provide information about the location of each patch within the medical image). Each position embedding is stored in a learnable matrix: 𝑊 𝑒𝑛𝑐𝑜𝑑𝑒𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ∈ ℝ 577×768 . The weights for the encoder (including its embeddings and the patch projection) are initialised using one of the pre-trained ViT checkpoints from <ref type="bibr" target="#b10">[11]</ref>. This checkpoint has been pre-trained on ImageNet21k <ref type="bibr" target="#b14">[15]</ref> and then subsequently on ImageNet 2012 <ref type="bibr" target="#b8">[9]</ref>. We also investigate additionally pre-training this checkpoint on a multi-label medical image classification task; we denote this Medical Image Transformer (MIT). The multi-label medical image classification task is comprised of four datasets, as described in Section 3.2.1. The pre-trained checkpoint of either ViT or MIT is used to initialise the encoder for the submissions in Figure <ref type="figure">2</ref>, where a tick in the "MIT" column indicates that MIT was used over ViT. Moreover, submission identifiers using the pre-trained ViT checkpoint are labelled in Figure <ref type="figure">2</ref> starting with "vit", while those using a pre-trained MIT checkpoint have are labelled starting with "mit".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Decoder:</head><p>The weights of the decoder (along with its embeddings) are initialised using the pretrained PubMedBERT checkpoint <ref type="bibr" target="#b5">[6]</ref>. We classify PubMedBERT as a Medical Report Transformer (MRT)-a Transformer that has been pre-trained on medical text (in this case medical literature) and is suitable for generating medical reports. The output of the last hidden layer of the encoder is fed to each decoder hidden layer via a randomly initialised multi-head cross-attention module, which is inserted between the masked multi-head self-attention module and the Feedforward Neural Network (FNN) module of each layer [7, Section 3.1, Decoder]. PubMedBERT has a vocabulary size of 30,522, comprising subword units. When feeding a subword unit to the encoder, it is first converted to its corresponding token index, and then subsequently into a token embedding. Each token embedding is stored in learnable matrix 𝑊 𝑑𝑒𝑐𝑜𝑑𝑒𝑟 𝑡𝑜𝑘𝑒𝑛 ∈ ℝ 30,522×768 . Next, a position and a segment embedding are added to the token embedding. The position embedding indicates the location of the subword within the caption. A maximum of 512 positions are used for PubMedBERT, with each position embedding stored in learnable matrix 𝑊 𝑑𝑒𝑐𝑜𝑑𝑒𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ∈ ℝ 512×768 . As only one caption is generated per medical image (even though PubMedBERT is pre-trained using two segments), the embedding for segment "0" will only be used. Each segment embedding is stored in learnable matrix 𝑊 𝑑𝑒𝑐𝑜𝑑𝑒𝑟 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 ∈ ℝ 2×768 . When generating a caption, the token [ B O S ] (beginning of sentence) is first fed to output the first subword of the caption. Caption generation finishes once the decoder generates the [ E O S ] token. Each submission used PubMedBERT, as shown by the submission identifiers in Figure <ref type="figure">2</ref> (i.e. "vit2mrt" and "mit2mrt"), where "mrt" indicates that the decoder is an MRT, where, in this case, PubMedBERT is the MRT. During testing, the maximum amount of subwords that the decoder could generate was set to 128. Beam search was also used, with a beam size of eight. Additionally, all n-grams of size three were only allowed to occur once.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Pre-training and fine-tuning</head><p>Next, we describe the pre-training and fine-tuning strategies for the submissions. Here, finetuning refers to training on the task's dataset. Pre-training refers to training on other, external datasets we selected; this was done before the fine-tuning stage. Teacher Forcing (TF) with categorical cross entropy loss was used to fine-tune each seq2seq model on the task's dataset <ref type="bibr" target="#b15">[16]</ref>. We also investigate additionally fine-tuning the seq2seq models (which have already been fine-tuned using TF) with Self Critical Sequence Training (SCST) <ref type="bibr" target="#b12">[13]</ref>. Submissions that were fine-tuned with TF and then SCST have a tick in the "SCST" column of Figure <ref type="figure">2</ref>.</p><p>For pre-training the seq2seq models, we used the Radiology Objects in COntext (ROCO) medical image captioning dataset (described in Section 3.2.1) with TF-before fine-tuning on the task's dataset. A tick in the "ROCO" column of Figure <ref type="figure">2</ref> signifies if this stage of pre-training was conducted for a submission.</p><p>The AdamW optimiser <ref type="bibr" target="#b16">[17]</ref> was used for gradient descent optimisation during pre-training and fine-tuning. A learning rate of 1𝑒 − 7 was used for fine-tuning on the task's dataset with SCST. A learning rate of 5𝑒 − 5 and a linear warm-up of 10,000 training steps from a learning rate of zero was used when pre-training the MITs, pre-training on ROCO with TF, and when fine-tuning on the task's dataset with TF. All other hyperparameters for AdamW were set to their defaults. For the pre-training strategy in <ref type="bibr" target="#b10">[11]</ref>, L2 regularisation (with a term of 0.9) helped to prevent overfitting the ViT during pre-training. Motivated by this, we investigated L2 regularisation for pre-training only (i.e., for pre-training the MITs and when pre-training on ROCO using TF). We investigated an L2 regularisation term with two of the submissions, as shown by the legend in Figure <ref type="figure">2</ref>.</p><p>A mini-batch size of 64 was used to pre-train each MIT, to pre-train on ROCO with TF, and to fine-tune on the task's dataset with TF. A mini-batch size of eight was used to fine-tuning on the task's dataset with SCST. For epoch selection at each stage of pre-training and fine tuning in Figure <ref type="figure">2</ref>, early stopping with a patience of five was used. The validation micro F1 score was the monitored metric for early stopping with each MIT and the validation BLEU score (BLEU is described in Section 2.2) was the monitored metric for early stopping with the seq2seq models. For submission v i t 2 m r t -0 . 1 . 1 _ 5 _ e 1 3 1 , the early stopping criteria was not enforced until epoch 50, as the BLEU score did not increase from zero until after this epoch. When fine-tuning on the task's dataset using SCST, the maximum amount of subwords that the decoder could generate was set to 32 due to memory restrictions. Moreover, greedy search was used (i.e., a beam size of 1) when generating the baseline for SCST <ref type="bibr" target="#b12">[13]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1.">Pre-training datasets</head><p>A number of external medical image datasets were used to pre-training the MIT, to take advantage of any stored knowledge about medical images when fine-tuning on the task's dataset. Specifically, four medical image multi-label classification datasets were identified; these are shown in Table <ref type="table" target="#tab_1">1</ref>. While CheXpert and MURA included validation sets, test sets were not available with any of the four datasets. We refrain from using these validation sets as the CheXpert validation set has been used as a test set previously <ref type="bibr" target="#b17">[18]</ref>. Instead, 5% of each training set was selected and removed to form a validation set. Together, the datasets have  • PadChest includes 160,828 chest X-rays obtained from 67,000 patients of the San Juan Hospital (Spain) from 2009 to 2017 <ref type="bibr" target="#b18">[19]</ref>. Each X-ray has an associated report produced by a radiologist. From these reports, labels were extracted manually by trained physicians (for 27% of the X-rays) and automatically (for 73% of the X-rays) using a supervised method. The labels covered six different position views, 174 different radiographic findings, 19 differential diagnoses, and 104 anatomic locations. The labels were then organised into a hierarchical taxonomy and mapped to Unified Medical Language System (UMLS) Concept Unique Identifier (CUI) codes. For our work, the considered 254 classes were derived from the l a b e l C U I S , L o c a l i z a t i o n s C U I S , M o d a l i t y _ D I C O M , and the V i e w P o s i t i o n _ D I C O M labels described in <ref type="bibr" target="#b18">[19,</ref><ref type="bibr">Table 11]</ref>, where the Digital Imaging and Communications in Medicine (DICOM) fields were extracted from the X-ray. We found that 33 of the images were corrupt and were thus excluded.  <ref type="bibr" target="#b17">[18]</ref>. The studies were performed between October 2002 and July 2017.</p><p>An automatic system was used to extract 14 observations from the associated radiology reports (no finding, enlarged cardiom, cardiomegaly, lung lesion, lung opacity, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, pleural other, fracture, and support devices). Each observation class was rated as either positive, uncertain, or negative, thus resulting in 42 classes. <ref type="foot" target="#foot_2">3</ref>• ChestX-ray14 contains 86,524 chest X-rays (collected from 1992 to 2015) concerned with common thorax diseases <ref type="bibr" target="#b19">[20]</ref>. It consists of 15 disease labels (atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, and hernia). The labels were mined from the associated radiological reports of the X-rays. <ref type="foot" target="#foot_3">4</ref>• The MURA training set comprises of 36,808 musculoskeletal radiographs from 13,457 studies of 11,184 patients, where each study is manually labelled by radiologists as either normal or abnormal <ref type="bibr" target="#b20">[21]</ref>. Each radiograph concerns one of seven sections of the body (elbow, finger, hand, humerus, forearm, shoulder, or wrist), where each is classed as either normal or abnormal, resulting in 14 total classes. MURA is a multi-class classification task unlike the previous datasets. <ref type="foot" target="#foot_4">5</ref>The ROCO dataset was used to pre-train the seq2seq models before fine-tuning, as depicted in Figure <ref type="figure">2</ref>. ROCO comprises of image-caption pairs from PubMed Central articles, where compound, multi-pane, and non-radiology images were removed using an automatic system <ref type="bibr" target="#b21">[22]</ref>. The ROCO training and validation sets contain 65,450 and 8,180 examples, respectively. The ROCO dataset contains several medical imaging modalities including computer tomography, ultrasound, X-ray, fluoroscopy, positron emission tomography, mammography, magnetic resonance imaging, and angiography. <ref type="foot" target="#foot_5">6</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results and discussion</head><p>The BLEU scores for each submission on the validation and test sets of the task's dataset are shown in Table <ref type="table" target="#tab_3">2</ref>. <ref type="foot" target="#foot_6">7</ref> Submission v i t 2 m r t -0 . 1 . 1 _ 5 _ e 1 3 1 attained the highest test score. This submission had the simplest configuration-no regularisation, no SCST, no ROCO pre-training, and no MIT pre-training. This indicates that the additional steps considered for the other configurations were not suited to the task's dataset. One possible explanation is the small size of the task's dataset.</p><p>Another observation is the large discrepancy between the validation and test scores. This indicates a significant difference between the examples of the two sets or that the submissions were overfitted to the training and/or validation sets. Additionally, the validation score was an inconsistent predictor of which submission would achieve the highest test score. While submission v i t 2 m r t -0 . 1 . 2 _ 2 _ e 4 6 attained the highest test score, it was outperformed by multiple submissions in terms of validation score. In fact, submission v i t 2 m r t -0 . 1 . 3 _ 5 _ e 3 attained the highest validation score, which had a configuration that employed ROCO pre-training and SCST, but no regularisation or MIT pre-training.</p><p>Using L2 regularisation during pre-training had a negative impact on performance, where both m i t 2 m r t -0 . 1 . 7 _ 1 _ e 0 and m i t 2 m r t -0 . 1 . 8 _ 1 _ e 1 produced worse validation and test scores than m i t 2 m r t -0 . 1 . 5 _ 1 _ e 1 . A regularisation term of 0.9 (mit2mrt-0.1.8_1_e1) was able to attain higher validation and test scores than a term of 0.5 (mit2mrt-0.1.7_1_e0); however, this could be due to submission m i t 2 m r t -0 . 1 . 8 _ 1 _ e 1 completing more epochs of fine-tuning, as shown in Figure <ref type="figure">2</ref>.</p><p>The impact of the medical image multi-label classification task for MIT pre-training was inconclusive. Comparing submission v i t 2 m r t -0 . 1 . 1 _ 5 _ e 1 3 1 to m i t 2 m r t -0 . 1 . 9 _ 1 _ e 1 3 8 , using an MIT produced higher validation scores but lower test scores than ViT. Oppositely, employing an MIT over a ViT resulted in a lower validation score, but a higher test score-when comparing submission v i t 2 m r t -0 . 1 . 3 _ 5 _ e 3 to m i t 2 m r t -0 . 1 . 5 _ 1 _ e 1 . It should be emphasised that the medical images that largely make up the medical image multi-label classification task dataset are chest X-rays-whereas those in the task's dataset, as well as ROCO, are more varied in modality and location. This suggests that the datasets used for the medical image multi-label classification task were not suited to the subsequent stages of pre-training and fine-tuning depicted in Figure <ref type="figure">2</ref>.</p><p>The impact of using ROCO to pre-train the seq2seq models before fine-tuning was also inconclusive. Comparing submission v i t 2 m r t -0 . 1 . 1 _ 5 _ e 1 3 1 to v i t 2 m r t -0 . 1 . 2 _ 3 _ e 9 1 , ROCO pre-training increased the validation score and decreased the test score. However, pre-training on ROCO substantially reduced the number of epochs until convergence during fine-tuning. Note that medical image captioning datasets derived from PubMed Central articles have been previously criticised for their significant amount of noise <ref type="bibr" target="#b3">[4]</ref>. This could mean that pre-training on ROCO may be harmful to performance.</p><p>SCST has been effective for medical image captioning previously <ref type="bibr" target="#b2">[3]</ref>. Here, we note that SCST is sensitive and can be difficult to attain stable training. Comparing submission v i t 2 m r t -0 . 1 . 1 _ 5 _ e 1 3 1 to v i t 2 m r t -0 . 1 . 4 _ 2 _ e 0 , SCST substantially decreased the validation and test scores. This was likely due to the learning rate being too high for this configuration. Oppositely, SCST improved both validation and test scores when comparing submission v i t 2 m r t -0 . 1 . 2 _ 3 _ e 9 1 to v i t 2 m r t -0 . 1 . 3 _ 5 _ e 3 , indicating that the learning rate was suitable for this configuration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>For ImageCLEFmed Caption 2021, the performance of submission v i t 2 m r t -0 . 1 . 1 _ 5 _ e 1 3 1 placed the AEHRC CSIRO team third-with a score 0.078 worse than the first placed participant. This indicates that utilising pre-trained Transformer checkpoints to initialise the encoder and decoder of a seq2seq model is a promising approach for this task. However, the impact of the selected pre-training data was unclear; pre-training the seq2seq model with ROCO produced inconclusive results. Instead of ROCO, an image-caption dataset derived from real medical images and their associated radiologists' reports-such as MIMIC-CXR <ref type="bibr" target="#b22">[23]</ref>-is recommended.</p><p>The impact of the medical image multi-label classification task was also inconclusive, where its medical images were likely too dissimilar to those from ROCO and the task's dataset. The impact of SCST and L2 regularisation were clearer, with SCST providing a small improvement when configured correctly and the used L2 regularisation terms resulting in a decrease in performance. In future work, we aim to conduct a more thorough investigation of the proposed approach-to better adapt it to medical image captioning. At the same time, our overall approach has been intentionally kept general so that it might be applied to tasks other than medical image captioning.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), September 21-24, Bucharest, Romania Envelope aaron.nicolson@csiro.au (A. Nicolson); jason.dowling@csiro.au (J. Dowling); bevan.koopman@csiro.au (B. Koopman) Orcid 0000-0002-7163-1809 (A. Nicolson); 0000-0001-9349-2275 (J. Dowling); 0000-0001-5577-3391 (B. Koopman) (a) Medical image "This image is a transverse evaluation of the bladder and right ureteral jet. Renal ultrasound studies also include evaluation of the ureterovesical junction through Color Flow Doppler study of fluid movement of the ureteral jet. " (b) Ground truth caption</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Training example synpic100306 from the ImageCLEFmed Caption 2021 dataset: (a) the medical image and (b) its ground truth caption. The task was to develop an automated system that, given the medical image, could predict the provided reference caption.</figDesc><graphic coords="2,89.29,84.19,197.92,128.18" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Transformer-based seq2seq model for medical image captioning. The red elements depict the general path from medical image to the predicted caption. The encoder for each submission was either ViT or MIT.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>325 classes, 482,197 training examples, and 25,377 validation examples. Given the number of classes, the weights of the classification head for the pre-trained ViT checkpoint were replaced with randomly-initialised learnable weight matrix 𝑊 𝑒𝑛𝑐𝑜𝑑𝑒𝑟 ℎ𝑒𝑎𝑑 ∈ ℝ 768×325 , before pre-training on the medical image multi-label classification task to form MIT. The number of classes and the number of examples for each dataset are detailed in the table</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>The stages of pre-training and fine tuning for each submission. The seq2seq model is formed by combining ViT with PubMedBERT or MIT with PubMedBERT. The order of training is from left-toright; the MIT encoder is formed by additionally pre-training ViT on the medical image multi-label classification task described in Section 3.2.1; pre-training on the ROCO dataset (described in Section 3.2.1) with TF occurs before fine-tuning with TF; fine-tuning with SCST occurs after fine-tuning with TF. The epoch for each stage (signified by "e" followed by the epoch number) is selected using early stopping, as described in Section 3.2. Note that the epoch specified in each submission identifier for the final stage of training is offset by one. The selected epochs for ViT and PubMedBERT are not given, as these are publicly available pre-trained checkpoints. L2 regularisation was used during the MIT pre-training stage and for the ROCO pre-training stage with two of the submissions, where the used regularisation terms are indicated in the legend on the top-left corner.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell>Pre-training</cell><cell></cell><cell cols="2">Fine-tuning</cell></row><row><cell>L2 regularisation</cell><cell>Encoder</cell><cell></cell><cell>Decoder</cell><cell></cell><cell>Seq2seq</cell><cell></cell></row><row><cell>0.5</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>0.9</cell><cell>ViT</cell><cell>MiT</cell><cell>PubMedBERT</cell><cell>ROCO (TF)</cell><cell>Task (TF)</cell><cell>Task (SCST)</cell></row><row><cell>vit 2m r t -0.1.1_5_e131</cell><cell>?</cell><cell></cell><cell>?</cell><cell></cell><cell>? (e132)</cell><cell></cell></row><row><cell>vit 2m r t -0.1.2_2_e46</cell><cell>?</cell><cell></cell><cell>?</cell><cell>? (e14)</cell><cell>? (e47)</cell><cell></cell></row><row><cell>vit 2m r t -0.1.2_3_e91</cell><cell>?</cell><cell></cell><cell>?</cell><cell>? (e26)</cell><cell>? (e92)</cell><cell></cell></row><row><cell>vit 2m r t -0.1.3_5_e3</cell><cell>?</cell><cell></cell><cell>?</cell><cell>? (e26)</cell><cell>? (e92)</cell><cell>? (e4)</cell></row><row><cell>vit 2m r t -0.1.4_2_e0</cell><cell>?</cell><cell></cell><cell>?</cell><cell></cell><cell>? (e147)</cell><cell>? (e1)</cell></row><row><cell>m it 2m r t -0.1.5_1_e1</cell><cell>?</cell><cell>? (e29)</cell><cell>?</cell><cell>? (e31)</cell><cell>? (e98)</cell><cell>? (e2)</cell></row><row><cell>m it 2m r t -0.1.7_1_e0</cell><cell>?</cell><cell>? (e2)</cell><cell>?</cell><cell>? (e37)</cell><cell>? (e77)</cell><cell>? (e1)</cell></row><row><cell>m it 2m r t -0.1.8_1_e1</cell><cell>?</cell><cell>? (e2)</cell><cell>?</cell><cell>? (e27)</cell><cell>? (e116)</cell><cell>? (e2)</cell></row><row><cell>m it 2m r t -0.1.9_1_e138</cell><cell>?</cell><cell>? (e29)</cell><cell>?</cell><cell></cell><cell>? (e139)</cell><cell></cell></row><row><cell>Figure 2:</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Summary of the number of examples in the training and validation splits for the four medical image multi-label classification datasets, as well as the number of classes.</figDesc><table><row><cell>Dataset</cell><cell cols="3">Training split Validation split Classes</cell></row><row><cell>PadChest</cell><cell>152,787</cell><cell>8,041</cell><cell>254</cell></row><row><cell>CheXpert</cell><cell>212,244</cell><cell>11,170</cell><cell>42</cell></row><row><cell cols="2">ChestX-ray14 82,198</cell><cell>4,326</cell><cell>15</cell></row><row><cell>MURA</cell><cell>34,968</cell><cell>1,840</cell><cell>14</cell></row><row><cell>Total</cell><cell>482,197</cell><cell>25,377</cell><cell>325</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>1,2 • The CheXpert training set contains 223,414 chest radiographs of 65,240 patients from the Stanford Hospital</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Results for each submission of the AEHRC CSIRO team for the ImageCLEFmed Caption 2021 task. The submissions are sorted by their test scores. The best validation and test scores are in boldface.</figDesc><table><row><cell cols="3">Submission identifier Val. BLEU Test BLEU</cell></row><row><cell cols="2">v i t 2 m r t -0 . 1 . 1 _ 5 _ e 1 3 1 0.836</cell><cell>0.432</cell></row><row><cell>m i t 2 m r t -0 . 1 . 5 _ 1 _ e 1</cell><cell>0.856</cell><cell>0.430</cell></row><row><cell>v i t 2 m r t -0 . 1 . 3 _ 5 _ e 3</cell><cell>0.860</cell><cell>0.426</cell></row><row><cell>v i t 2 m r t -0 . 1 . 2 _ 3 _ e 9 1</cell><cell>0.842</cell><cell>0.423</cell></row><row><cell cols="2">m i t 2 m r t -0 . 1 . 9 _ 1 _ e 1 3 8 0.844</cell><cell>0.419</cell></row><row><cell>m i t 2 m r t -0 . 1 . 8 _ 1 _ e 1</cell><cell>0.830</cell><cell>0.416</cell></row><row><cell>v i t 2 m r t -0 . 1 . 4 _ 2 _ e 0</cell><cell>0.736</cell><cell>0.415</cell></row><row><cell>m i t 2 m r t -0 . 1 . 7 _ 1 _ e 0</cell><cell>0.817</cell><cell>0.405</cell></row><row><cell>v i t 2 m r t -0 . 1 . 2 _ 2 _ e 4 6</cell><cell>0.821</cell><cell>0.388</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The corrupt filenames are available at: https://github.com/anicolson/supplementary/blob/main/padchest_ corrupt.txt</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">PadChest is available at: http://bimcv.cipf.es/bimcv-projects/padchest/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">CheXpert is available at: https://stanfordmlgroup.github.io/competitions/chexpert/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">MURA is available at: https://nihcc.app.box.com/v/ChestXray-NIHCC</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">MURA is available at: https://stanfordmlgroup.github.io/competitions/mura/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">ROCO is available at: https://github.com/razorx89/roco-dataset</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">Note that submission vit2mrt-0.1.2_2_e46 was a preliminary submission where the incorrect epoch was selected with early stopping due to a rounding error of the monitored metric score.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of the ImageCLEFmed 2021 concept &amp; caption prediction task</title>
		<author>
			<persName><forename type="first">O</forename><surname>Pelka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>García Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jacutprakart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<ptr target="org" />
	</analytic>
	<monogr>
		<title level="m">CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS</title>
				<meeting><address><addrLine>Bucharest, Romania</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>P'eteri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sarrouti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Demner-Fushman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kozlovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Liauchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dicente</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Pelka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jacutprakart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Berari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tauteanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fichou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Brie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dogariu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">D</forename><surname>Ştefan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Constantin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chamberlain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Campello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">A</forename><surname>Oliver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Moustahfid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Popescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Deshayes-Chossart</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021)</title>
		<title level="s">LNCS Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Bucharest, Romania</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Clinically Accurate Chest X-Ray Report Generation</title>
		<author>
			<persName><forename type="first">G</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-M</forename><forename type="middle">H</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mcdermott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Boag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-H</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Szolovits</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ghassemi</surname></persName>
		</author>
		<ptr target="http://proceedings.mlr.press/v106/liu19a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th Machine Learning for Healthcare Conference</title>
				<editor>
			<persName><forename type="first">F</forename><surname>Doshi-Velez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Fackler</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Jung</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Kale</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Ranganath</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Wallace</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Wiens</surname></persName>
		</editor>
		<meeting>the 4th Machine Learning for Healthcare Conference<address><addrLine>PMLR, Ann Arbor, Michigan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">106</biblScope>
			<biblScope unit="page" from="249" to="269" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A Survey on Biomedical Image Captioning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pavlopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kougia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Androutsopoulos</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/W19-1803</idno>
		<ptr target="https://www.aclweb.org/anthology/W19-1803.doi:10.18653/v1/W19-1803" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on Shortcomings in Vision and Language, Association for Computational Linguistics</title>
				<meeting>the Second Workshop on Shortcomings in Vision and Language, Association for Computational Linguistics<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="26" to="36" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N19-1423</idno>
		<ptr target="https://www.aclweb.org/anthology/N19-1423.doi:10.18653/v1/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tinn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lucas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Usuyama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Naumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Poon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">a r X i</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">0</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Attention is All You Need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS&apos;17</title>
				<meeting>the 31st International Conference on Neural Information Processing Systems, NIPS&apos;17<address><addrLine>Red Hook, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Curran Associates Inc</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="6000" to="6010" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Deep convolutional neural network based medical image classification for disease diagnosis</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Yadav</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Jadhav</surname></persName>
		</author>
		<idno type="DOI">10.1186/s40537-019-0276-2</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Big Data</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">ImageNet Large Scale Visual Recognition Challenge</title>
		<author>
			<persName><forename type="first">O</forename><surname>Russakovsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Satheesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Karpathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khosla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11263-015-0816-y</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">115</biblScope>
			<biblScope unit="page" from="211" to="252" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Leveraging Pre-trained Checkpoints for Sequence Generation Tasks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Rothe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narayan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Severyn</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl_a_00313</idno>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="264" to="280" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">a r X i</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">0</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">CPTR: Full Transformer Network for Image Captioning</title>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">a r X i</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Self-Critical Sequence Training for Image Captioning</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Rennie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Marcheret</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mroueh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Goel</surname></persName>
		</author>
		<idno type="DOI">10.1109/cvpr.2017.131</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Bleu: a Method for Automatic Evaluation of Machine Translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
		<ptr target="https://www.aclweb.org/anthology/P02-1040.doi:10.3115/1073083.1073135" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<meeting>the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Philadelphia, Pennsylvania, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">ImageNet-21K Pretraining for the Masses</title>
		<author>
			<persName><forename type="first">T</forename><surname>Ridnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ben-Baruch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Noy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zelnik-Manor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">a r X i</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A Learning Algorithm for Continually Running Fully Recurrent Neural Networks</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zipser</surname></persName>
		</author>
		<idno type="DOI">10.1162/neco.1989.1.2.270</idno>
	</analytic>
	<monogr>
		<title level="j">Neural Computation</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="270" to="280" />
			<date type="published" when="1989">1989</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Decoupled Weight Decay Regularization</title>
		<author>
			<persName><forename type="first">I</forename><surname>Loshchilov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=Bkg6RiCqY7" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison</title>
		<author>
			<persName><forename type="first">J</forename><surname>Irvin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rajpurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ciurea-Ilcus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chute</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Marklund</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Haghgoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ball</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shpanskaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Seekins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Mong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Halabi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">K</forename><surname>Sandberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">B</forename><surname>Larson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">P</forename><surname>Langlotz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">N</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Lungren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<idno type="DOI">10.1609/aaai.v33i01.3301590</idno>
	</analytic>
	<monogr>
		<title level="m">Association for the Advancement of Artificial Intelligence (AAAI)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="590" to="597" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">-Vayá, PadChest: A large chest x-ray image dataset with multi-label annotated reports</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bustos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pertusa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-M</forename><surname>Salinas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De La Iglesia</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.media.2020.101797</idno>
		<ptr target="https://www.sciencedirect.com/science/article/pii/S1361841520301614.doi:10.1016/j.media.2020.101797" />
	</analytic>
	<monogr>
		<title level="j">Medical Image Analysis</title>
		<imprint>
			<biblScope unit="volume">66</biblScope>
			<biblScope unit="page">101797</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Summers</surname></persName>
		</author>
		<idno type="DOI">10.1109/cvpr.2018.00943</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Large Dataset for Abnormality Detection in Musculoskeletal Radiographs</title>
		<author>
			<persName><forename type="first">P</forename><surname>Rajpurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Irvin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bagul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Laird</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>Ball</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Langlotz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shpanskaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Lungren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><surname>Mura</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">a r X i</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">7</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis</title>
		<author>
			<persName><forename type="first">O</forename><surname>Pelka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Koitka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Nensa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<editor>D. Stoyanov, Z. Taylor, S. Balocco, R. Sznitman, A. Martel, L. Maier-Hein, L. Duong, G. Zahnd, S. Demirci, S. Albarqouni, S.-L. Lee, S. Moriconi, V. Cheplygina, D. Mateus, E. Trucco, E. Granger, P. Jannin</editor>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>Springer International Publishing</publisher>
			<biblScope unit="page" from="180" to="189" />
			<pubPlace>Cham</pubPlace>
		</imprint>
	</monogr>
	<note>Radiology Objects in COntext (ROCO): A Multimodal Image Dataset</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E W</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">J</forename><surname>Pollard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Berkowitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">R</forename><surname>Greenbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Lungren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">G</forename><surname>Mark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Horng</surname></persName>
		</author>
		<idno type="DOI">10.1038/s41597-019-0322-0</idno>
	</analytic>
	<monogr>
		<title level="j">Scientific Data</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
