<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">UMass at ImageCLEF Caption Prediction 2018 Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yupeng</forename><surname>Su</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Worcester Polytechnic Institute</orgName>
								<address>
									<postCode>01609</postCode>
									<settlement>Worcester</settlement>
									<region>MA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Feifan</forename><surname>Liu</surname></persName>
							<email>feifan.liu@umassmed.edu</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Massachusetts Medical School</orgName>
								<address>
									<postCode>01655</postCode>
									<settlement>Worcester</settlement>
									<region>MA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Max</forename><forename type="middle">P</forename><surname>Rosen</surname></persName>
							<email>max.rosen@umassmemorial.org</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Massachusetts Medical School</orgName>
								<address>
									<postCode>01655</postCode>
									<settlement>Worcester</settlement>
									<region>MA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">UMass at ImageCLEF Caption Prediction 2018 Task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">7C859E214257DE7E259BEEDF99C93CC3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Image Caption</term>
					<term>Encoder-Decoder</term>
					<term>LSTM</term>
					<term>Convolutional Neural Network</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the details of our participation in the image caption prediction task at CLEF 2018. We explored and implemented an encoder-decoder framework to generate caption given a medical image. For the encoder, we compared two variations of convolutional neural networks (CNN) architectures: ResNet-152 and VGG-19, and for the decoder we used the long short term memory (LSTM) recurrent neural network. The attention mechanism was also experimented on a smaller sample to evaluate its impact on the model fitting and prediction performance. We submitted 4 valid runs and the best result achieved 0.18 BLEU score, which ranked second among all participant teams.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Automatically making computer understand the content of an image and offering reasonable description in natural language has gained more and more attentions from the computer vision and natural language processing community. In clinical practice, medical specialist and medical researchers usually write diagnosis reports to record microscopic findings from images, so automatic captioning on medical images will benefit healthcare providers with valuable insights and reduce their burden across the overall clinical workflow.</p><p>Due to its importance, ImageCLEF 2018 <ref type="bibr" target="#b4">[5]</ref> continued the medical image captioning challenge <ref type="bibr" target="#b5">[6]</ref> with the aim of advancing methodological development in mapping visual information from medical images to condensed textual descriptions. The main stream of Recent methods <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b9">10]</ref> was combining the recurrent neural networks (RNN) for modeling natural language, with deep convoluational neural networks(CNN) for extracting image features. Within that framework, RNN based natural language generation was conditioned on CNN based image contextual features through encoder-decoder framework which was originally proposed for neural machine translation <ref type="bibr" target="#b0">[1]</ref>. In the meantime, different deep architectures were also explored, including Deep Residual Networks <ref type="bibr" target="#b8">[9]</ref> which created shortcuts between different layers to enable neural network to model the identity mapping, and Wide Residual Networks <ref type="bibr" target="#b7">[8]</ref> which shows that the widening of residual networks blocks can lead to improved performance compared to increasing their depth. More recently, Zhang et al. applied MDNet <ref type="bibr" target="#b6">[7]</ref> on medical image captioning, where the image model utilizes an enhanced multi-scale feature ensemble method and the language model adopts an improved attention mechanism, outperforming other comparative baselines. It is worth noting that the caption in their clinical dataset was generated through a special guideline which requires additional manual efforts.</p><p>In participating the CLEF caption prediction task, we adopted the proven successful Encoder-Decoder architecture of neural networks, where we investigated two state-of-the-art image encoder variants (VGG-19 and ResNet-152) integrated with LSTM recurrent neural networks. Most existing studies use image encoders which were pre-trained on ImageNet, a very large dataset containing daily images from general domain. However, this pre-trained model is not available in medical domain. Therefore, we performed finetuning during end-to-end training of our model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Dataset</head><p>The whole training dataset of Image CLEF caption prediction contains 222,314 image-caption pairs, and the length of captions varies from 1 word to 816 words. The average caption length is 20 words. Images are mainly from medical domain, but some of them are generic pictures such as animals or geography pictures. Caption itself is also noisy, e.g. some copy right or author information is also included. The official test data contains 9,938 images which are used to evaluate the system's performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methods</head><p>In the encoder-decoder framework, the encoder extracts effective image features, based on which the decoder generates a sequence of textual terms for summarizing the image's content. As image understanding is crucial for caption prediction, we built two systems: VGG Attention Model using VGG-19 architecture as encoders integrated with the attention mechanism, and ResNet Model using residual networks as encoders. Those two encoders have shown promising results on image classification and are expected to work well for caption prediction. Both systems depend on LSTM model for decoding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Preprocessing</head><p>To improve generality of model, we used random crop method to augment more data into original dataset. As we found that only 0.28%(628 instances) of the training data contain captions with length larger than 145, we removed these data to reduce the sequence length required in LSTM model which can speed up the training process without losing much data representativeness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">VGG Attention Model</head><p>We adopted the architecture from the "show,attend, and tell" <ref type="bibr" target="#b9">[10]</ref> as our VGG attention model, which is one of the state-of-art models for general domain image captioning task. It consists of a VGG-19 encoder and an attention based LSTM decoder where the soft attention mechanism was implemented. The architecture is illustrated in Fig. <ref type="figure" target="#fig_0">1</ref>. Encoder: VGG-19 Attention mechanism can guide the LSTM decoder to dynamically focus on specific parts of the image when generating the next caption term at each time step. Therefore we choose the output from the relatively lower convolution layer of the VGG net as image feature representation, which enables the model to attend to salient regions of the image. Specifically, we extracted features from the conv5 3 layer, which provides the 14 x 14 x 512 feature map (512 is the number of feature maps). The flattened 196 x 512 encoding will be fed in the subsequent LSTM decoder.</p><p>Decoder: Attention Based LSTM As shown in Fig. <ref type="figure" target="#fig_0">1</ref>, the LSTM encoder receives the current word embedding and a dynamic image representation at each time step through the attention module. The attention module calculates different weights for feature map from different image locations, so that the LSTM encoder will know which location should be focused more for producing the next word. The weights are based on the hidden state of the LSTM at previous time step and the feature map extracted from the aforementioned VGG encoder(Fig. <ref type="figure" target="#fig_0">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">ResNet Model</head><p>Another model we explored is to use ResNet(residual neural networks) architecture in place of VGG network as image encoders. The motivation to swap VGG with ResNet is that the latter achieved state-of-the-art results in ImageNet classification tasks in general domain. It indicates that ResNet has the potential to produce a better encoding on image features. By adding shortcut of each block, residual network architecture also accounts for the vanishing gradients problem when training very deep convolution neural networks.</p><p>As shown in Fig <ref type="figure" target="#fig_1">2</ref>, we adopted ResNet-152 encoder which directly connects to a standard LSTM decoder. Due to the depth of the architecture, we didn't add attention in this model as it is tricky to pick up an appropriate layer to attend to, which we will investigate in the future work. For implementation, we used pre-trained models' parameters to initialize ResNet-152 and performed end-to-end finetuning. Note that for both systems, the LSTM decoder doesn't use any pre-trained model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Over-fit a Small Dataset</head><p>One of the challenges for this caption prediction task is that the caption length varies largely, and the number of words per ranges from 1 to 145 even with our preprocessing step. In last year's challenge, ISIA <ref type="bibr" target="#b10">[11]</ref> implemented separate deep learning models to handle different lengths of captions, while the other team PRNA <ref type="bibr" target="#b11">[12]</ref> developed one deep learning model to handle all of them. It shows that a single model may work well as long as it is properly parameterized. As the Hochreiter pointed out( <ref type="bibr" target="#b12">[13]</ref>), LSTM has capacity to handle the length over 1000. In order to determine some hyper-parameters, we sampled 10 instances from the training data whose captions contains words from 140 to 145. Using this small data, we tried to over-fit our deep learning model to empirically choose hyper-parameters by measuring its capacity of modeling this small data with long captions. Two hyper-parameters we are interested are the hidden size of the LSTM and size of word embedding, which plays an important role in mapping visual context features with textual terms.</p><p>We use the BLEU score as a metric to measure whether the model can overfit the small data properly. All the experiments are using Adam gradient descent method, and the learning rate is set as 0.0001. As we can see from Table <ref type="table" target="#tab_0">1</ref>, the capacity of the model to handle long length captions is more sensitive to hidden size than word embedding size, i.e. when the hidden size is not large enough(256), increasing the word embedding size doesn't help much(run1 vs. run2); when the hidden size is increased to 512, the model can fit well with embedding size of 256 or 512(run3 or run4). As expected, when embedding size is further reduced to 128, the capacity is adversely affected(run4 vs. run5). Based on this result, we choose the hidden size of 512 and the word embedding size of 256 in our systems. We also did this experiment on the attention LSTM model with VGG encoder and similar results were observed. We can see that run1 delivered an under-fitted model obtaining the worst performance. With the same training epochs, no differences were found in terms of removing UNK or not, which may be due to the pre-processing step embedded in the evaluation script. Run4 achieved the best result of 0.18 demonstrating fully finetuning is beneficial.  <ref type="figure" target="#fig_2">3</ref> and Fig. <ref type="figure" target="#fig_3">4</ref> show two examples, one is from CLEF image caption prediction dataset, which has five sentences, and the other one is from COCO dataset, which only has one sentence.</p><p>The caption in Fig. <ref type="figure" target="#fig_2">3</ref> is very complicated, covering different semantics: a summary at the beginning followed by other descriptions as well as reasoning/inferring which can't be seen in the picture at all. In contrast, the caption from COCO dataset(Fig. <ref type="figure" target="#fig_3">4</ref>), it only describes what can be seen from the picture. Thus, compared with the general domain VQA task, CLEF caption prediction is  The man at bat readies to swing at the pitch while the umpire looks on more challenging, not to mention the additional added complexity due to difficult medical terms. To address this, the model needs to be more intelligent with adequate reasoning ability, which may require more complex and hierarchical text modeling structure with support of background knowledge.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Illustration of VGG Attention Model. We use VGG-19-conv5-3 to extract features from images and the initialization of hidden state (h0 in picture) is the average feature map. LSTM receives the current word embedding, and weighed feature map to predict the next word.</figDesc><graphic coords="3,134.77,267.33,283.46,208.52" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Illustration of ResNet-based Model.</figDesc><graphic coords="4,134.77,455.64,283.46,97.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Example: image of CLEF training dataset, caption: (1)Summary of key interactions in 'atypical' WD-repeats 6 and 7 of RACK1, illustrated with the structure of yeast Asc1p (PDB: 3FRX). (2)RACK1 proteins are characterised by a significant sequence extension between blades 6 and 7, leading to a knob-like protrusion from the upper face of the propeller. (3)The P0 Asp in blade 7 forms a salt bridge to an Arg at the P+4 position and this is packed against a Tyr (or in some orthologues Phe) in the P-2 position. (4) The P0 Asp is absent in blade 6. (5)In principle, these features together with the absence of the GH signature dipeptide on the loops between blades 5 and 6 and between blades 6 and 7 may facilitate structural transitions in this region of the protein that are important for function.</figDesc><graphic coords="7,134.77,138.97,170.11,141.74" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Example: image of COCO dataset, caption:The man at bat readies to swing at the pitch while the umpire looks on</figDesc><graphic coords="7,134.77,456.24,198.43,149.92" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Over-fit results of ResNet using a small data. Maximum number of epoch is 500.</figDesc><table><row><cell cols="2">Runs Hidden and Word Dimension</cell><cell>Model</cell><cell>BLEU</cell></row><row><cell>1</cell><cell>256 x 256</cell><cell cols="2">ResNet LSTM 0.132</cell></row><row><cell>2</cell><cell>256 x 512</cell><cell cols="2">ResNet LSTM 0.124</cell></row><row><cell>3</cell><cell>512 x 512</cell><cell cols="2">ResNet LSTM 0.908</cell></row><row><cell>4</cell><cell>512 x 256</cell><cell cols="2">ResNet LSTM 0.923</cell></row><row><cell>5</cell><cell>512 x 128</cell><cell cols="2">ResNet LSTM 0.324</cell></row><row><cell>3.5 Training</cell><cell></cell><cell></cell></row><row><cell cols="4">We used Adam stochastic gradient descent method for model training, with</cell></row><row><cell cols="4">learning rate set as 0.0001, the maximum epoch set as 20, and the batch size set</cell></row><row><cell cols="4">as 16. All our models were trained on two Tesla M60 GPUs in an Amazon EC2</cell></row><row><cell cols="4">server. However, attention based model training process is extremely slow, so we</cell></row><row><cell cols="4">have no time to get a usable model before the challenge deadline. Therefore, our</cell></row><row><cell cols="2">submitted runs were based on ResNet model.</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Official Test ResultsCLEF image caption prediction dataset is very different with COCO or Flickr 10k dataset. Some captions of this dataset are extraordinary long. Fig.</figDesc><table><row><cell>Submitted Runs</cell><cell cols="2">BLEU Rank</cell></row><row><cell>best results among teams</cell><cell>0.25</cell><cell>1</cell></row><row><cell>run1 epoch 6</cell><cell cols="2">0.132 4</cell></row><row><cell>run2 epoch 13</cell><cell cols="2">0.176 2</cell></row><row><cell>run3 epoch 13 remove UNK</cell><cell cols="2">0.176 2</cell></row><row><cell cols="3">run4 epoch 13 remove UNK finetune 0.180 2</cell></row><row><cell>5 Discussion and Conclusions</cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Neural Machine Translation by Jointly Learning to Align and Translate</title>
		<author>
			<persName><forename type="first">D</forename><surname>Bahdanau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.0473</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note>cs, stat</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Long-term Recurrent Convolutional Networks for Visual Recognition and Description</title>
		<author>
			<persName><forename type="first">J</forename><surname>Donahue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Hendricks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Venugopalan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Guadarrama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Saenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1411.4389</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">DenseCap: Fully Convolutional Localization Networks for Dense Captioning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Karpathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1511.07571</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Show and Tell: A Neural Image Caption Generator</title>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Toshev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1411.4555</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Villegas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>Herrera</surname></persName>
		</author>
		<author>
			<persName><surname>De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Eickhoff</surname></persName>
		</author>
		<author>
			<persName><surname>Vincent Andrearczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">D</forename><surname>Cid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Liauchuk</surname></persName>
		</author>
		<author>
			<persName><surname>Vassili Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Farri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joey</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lungren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D.-T</forename><surname>Dang-Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Piras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gurrin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference of the CLEF Association</title>
				<meeting>the Ninth International Conference of the CLEF Association<address><addrLine>CLEF</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Overview of the ImageCLEF 2018 Caption Prediction tasks</title>
		<author>
			<persName><forename type="first">García</forename><surname>Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Eickhoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Andrearczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename></persName>
		</author>
		<ptr target="WS.org&lt;http://ceur-ws.org&gt;" />
	</analytic>
	<monogr>
		<title level="m">CLEF2018 Working Notes</title>
				<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mcgough</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1707.02485</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Zagoruyko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Komodakis</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1605.07146</idno>
		<title level="m">Wide Residual Networks</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Deep Residual Learning for Image Recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1512.03385</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zemel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1502.03044</idno>
		<title level="m">Show, Attend and Tell: Neural Image Caption Generation with Visual Attention</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">ISIA at the ImageCLEF</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jiang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Image Caption Task</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">PRNA at ImageCLEF 2017 Caption Prediction and Concept Detection Tasks</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sreenivasan</surname></persName>
		</author>
		<imprint>
			<biblScope unit="volume">5</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Long Short-Term Memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Comput</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page">17351780</biblScope>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
