<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">JUST at VQA-Med: A VGG-Seq2Seq Model</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Bashar</forename><surname>Talafha</surname></persName>
							<email>talafha@live.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Jordan University of Science and Technology</orgName>
								<address>
									<settlement>Irbid</settlement>
									<country key="JO">Jordan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mahmoud</forename><surname>Al-Ayyoub</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Jordan University of Science and Technology</orgName>
								<address>
									<settlement>Irbid</settlement>
									<country key="JO">Jordan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">JUST at VQA-Med: A VGG-Seq2Seq Model</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">5B1C1880C2922A921B1DFA213476EA28</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Sequence to sequence</term>
					<term>VGG Network</term>
					<term>Global Vectors for Word Representation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the VGG-Seq2Seq system for the Medical Domain Visual Question Answering (VQA-Med) Task of ImageCLEF 2018. The proposed system follows the encoder-decoder architecture, where the encoders fuses a pretrained VGG network with an LSTM network that has a pretrained word embedding layer to encode the input. To generate the output, another LSTM network is used for decoding. When used with a pretrained VGG network, the VGG-Seq2Seq model managed to achieve reasonable results with 0.06, 0.12, 0.03 BLEU, WBSS and CBSS, respectively. Moreover, the VGG-Seq2Seq is not expensive to train.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Visual Question Answering (VQA) is a recent and exciting problem at the intersection between Computer Vision (CV) and Natural Language Processing (NLP), where the input is an image and a question related to it written in a natural language and the output is the correct answer to the question. The answer can be a simple yes/no, choosing one of several options, a single word, or a complete phrase of sentence <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b3">4]</ref>.</p><p>From a first glance, the VQA problem seem like a very challenging one. The traditional CV techniques used for extracting useful information from images and the NLP techniques typically used for Question Answering (QA) are very far from each other and the interplay between them seem to be complex. Moreover the ability to construct an useful answer based on such multi-modal input adds to the complexity of the problem. Luckily, the recent advances in Deep Learning (DL) have paved the way to building more robust VQA techniques <ref type="bibr" target="#b3">[4]</ref>.</p><p>In this paper, we are interested in an interesting variation of VQA where both the image and question are from the medical domain. It is known as the Medical Domain Visual Question Answering (VQA-Med) Task <ref type="bibr" target="#b4">[5]</ref> of ImageCLEF 2018 <ref type="bibr" target="#b7">[8]</ref>. This task requires building a model that provide an answer to question about the content of a medical image. In order to address this task, we propose a DL model we call: VGG-Seq2Seq model. The model takes an image and a question as input and outputs the answer of this question based on fusing features extracted based on the image content with those extracted from the question itself.</p><p>The rest of this paper is organized as follows. The following section presents a very brief coverage of the related work. Sections 3 and 4 discuss the problem at hand and the model we propose to handle it. The experimental evaluation of our model and its discussion are presented in Section 5. Finally, the paper is concluded in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Works</head><p>According to a recent survey on the VQA problem <ref type="bibr" target="#b3">[4]</ref>, most of the exiting approaches are based on DL techniques. The only interesting exceptions are the Answer Type Prediction (ATP) technique of <ref type="bibr" target="#b8">[9]</ref> and the Multi-World QA of <ref type="bibr" target="#b11">[12]</ref>. Of course, there are other non-DL approaches that are used as baseline for various datasets and approaches. Discussing them is outside the scope of this paper.</p><p>Regarding the DL-based approaches for VQA, most of them employ one of the word embedding techniques (typically, Word2Vec <ref type="bibr" target="#b13">[14]</ref>) sometimes coupled with a Recurrent Neural Networks (RNN) to embed the question. Moreover, most of them use Convolutional Neural Networks (CNN) to extract features from the images. Examples of such approaches include iBOWIMG <ref type="bibr" target="#b24">[25]</ref>, Full-CNN <ref type="bibr" target="#b10">[11]</ref>, Ask Your Neurons (AYN) <ref type="bibr" target="#b12">[13]</ref>, Vis+LSTM <ref type="bibr" target="#b17">[18]</ref>, Dynamic Parameter Prediction (DPPnet) <ref type="bibr" target="#b14">[15]</ref>, etc. Another type of DL-based techniques employ some sort of attention mechanism such as Where to Look (WTL) <ref type="bibr" target="#b18">[19]</ref>, Recurrent Spatial Attention (R-SA) <ref type="bibr" target="#b25">[26]</ref>, Stacked Attention Networks (SAN) <ref type="bibr" target="#b22">[23]</ref>, Hierarchical Coattention (CoAtt) <ref type="bibr" target="#b9">[10]</ref>, Neural Module Networks (NMNs) <ref type="bibr" target="#b0">[1]</ref>, etc.</p><p>Most of the work discussed in this section is not directly applicable to the VQA-Med for two reasons. The first one is an obvious one which is the focus on the medical domain, which gives this problem its unique set of challenges. As for the other one, it is related to how the sentences of the answers are constructed in VQA-Med, which is different from existing VQA datasets such as DAtaset for QUestion Answering on Realworld images (DAQUAR) <ref type="bibr" target="#b11">[12]</ref>, Visual7W <ref type="bibr" target="#b25">[26]</ref>, Visual Madlibs <ref type="bibr" target="#b23">[24]</ref>, COCO-QA <ref type="bibr" target="#b17">[18]</ref>, Freestyle Multilingual Image Question Answering dataset (FM-IQA) <ref type="bibr" target="#b2">[3]</ref>, Visual Question Answering (VQA) <ref type="bibr" target="#b1">[2]</ref>, etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Task Description and Dataset</head><p>Nowadays, patients can access and review their medical reports related to their healthcare due to the availability and accessibility of electronic medical records which will help them better understand their conditions. This increases the need for an automated system capable of taking questions related to some medical problems along with accompanying images to support this question and provide correct answer for them. This is exactly the task we are addressing in this work. Given an image in the medical domain associated with a set of clinically relevant questions, the goal of the task is answering the questions based on the visual image content <ref type="bibr" target="#b4">[5]</ref>.</p><p>The dataset represents images related to medical domain. It was extracted from PubMed Central articles (essentially a subset of the ImageCLEF 2017 caption prediction task). The dataset is divided into about 5k training set and about 0.5k validation set of medical images associated with question-answer pairs, and about 0.5k testing set of medical images associated with only questions. Figure <ref type="figure" target="#fig_0">1</ref> shows some examples from the training set <ref type="bibr" target="#b4">[5]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">The VGG-Seq2Seq Model</head><p>In this section, we discuss our VGG-Seq2Seq model, which follows the encoderdecoder architecture. The model is shown in Figure <ref type="figure" target="#fig_1">2</ref>. In the following paragraphs, we discuss in detail its different parts.</p><p>The encoder consists of two main components. The first component is a Long short term memory (LSTM) network with a pretrained word embedding layer which encodes the question into a vector representation, while the second component is a pretrained VGG network that takes the image as an input and extracts a vector representation for that image. The final state of the encoding, the outputs of the two components are concatenated together into one vector called thought vector.</p><p>The decoder consists of LSTM network that takes the thought vector as initial state and 〈start〉 token as input in the first time step and try to predict the answer using softmax layer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Encoder</head><p>Two main components have been conducted in the encoder, the first component is a LSTM network with a pretrained word embedding layer, and the second component is the VGG network. In the first component the semantic meaning of the question will be extracted, a 300 dimensional pretrained word embedding layer is used to encode the word into a dense semantic space using Glove <ref type="bibr" target="#b16">[17]</ref>. This word representation is then fed to a LSTM network with 1024 hidden nodes. LSTM <ref type="bibr" target="#b6">[7]</ref> is a special type of Recurrent Neural Network (RNN) that has been designed to solve the problem of vanishing gradient. The LSTM layer used its memory cells to store the context information. LSTM has three gates (i.e. Input gate, forget gate and output gate) which will decide how the input will be handled.</p><p>At any time step, inputs to the LSTM cell are current word (x), previous hidden state (h-1) and previous memory state (c-1), and LSTM cell outputs are current hidden state (h) and current memory state (c). These states have 1024 hidden nodes. At last time step in the sequence, we will call last LSTM cell output hidden state (h) final hidden state, and output memory state final memory state.</p><p>In the second component, we use the concept of transfer learning where a pretrained model is used with some modification to serve a wholly new task. We use pretrained VGG network <ref type="bibr" target="#b19">[20]</ref> with removing the last softmax layer. This network will output a vector of size 4096 representing a vector of features for the input image. This vector is then passed to tow fully-connected layers with 2500 and 1024 hidden nodes respectively. The main purpose of these two layers is to decrease the features vector dimension to become close to the LSTM output vectors.</p><p>The 1024 image features vector is then concatenated with both the LSTM final hidden state and final memory state as shown in figure <ref type="figure" target="#fig_1">2</ref>, we will call those tow vectors thought vectors, we believe that the thought vectors will represent the semantic meaning of the input question and features of the input image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Decoder</head><p>In this part the answer of the input image and question will be extracted. The decoder consists of LSTM layer that takes three inputs, the first input is 〈start〉 token that indicates to start decoding, the second input and third input are the decoder initial states which are previous hidden state and previous memory state. The decoder takes the encoder final states (i.e. encoder final hidden state and encoder final memory state) as initial states. Thus, the decoder initial states will be the thought vectors.</p><p>At the first time step, LSTM cell will takes 〈start〉 token as input given the initial state and calculates the probability distribution of the target word using softmax layer. The word with the highest probability will be the first word of the answer, this word will be then passed to the second LSTM cell as input and predict the second word of the answer. The full answer will be generated by repeating this process until the model predicts 〈end〉 token.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation and Results</head><p>This section discusses the experiments used to evaluate our model and the obtained results. However, we first need to discuss the evaluation process.</p><p>As described in VQA-Med task description <ref type="bibr" target="#b4">[5]</ref>, three pre-processing steps are conducted on each answer before running the evaluation metrics: (a) converting each answer to lower-case, (b) removing all punctuations and tokenizing the the answer to a list of words and (c) removing stopwords using NLTKs English stopwords list.</p><p>In order to evaluate our models, three metrics are used as discussed in VQA-Med task description <ref type="bibr" target="#b4">[5]</ref>: BLEU, WBSS and CBSS. The BLEU <ref type="bibr" target="#b15">[16]</ref> metric is used to calculate the similarity between the predicted answer and the actual answer. The second metric is WBSS (Word-based Semantic Similarity) <ref type="bibr" target="#b20">[21]</ref>, which calculates semantic similarity in the biomedical domain. Finally, CBSS (Concept-based Semantic Similarity) <ref type="bibr" target="#b21">[22]</ref>, which is similar to WBSS, except that it can extract biomedical concepts from the answers using MetaMap via the pymetamap wrapper, and it builds a dictionary using these extracted concepts.</p><p>Three experiments are conducted to evaluate our model. They described as follows.</p><p>-In the first experiment, instead of using pretrained VGG-net we built a Convolutional Neural Network (CNN) that consists of three convolutional and max-pooling layers which behave as the feature extractor, followed by a fully connected layer. This network outputs a vector of size 4096 representing the input image features, this vector then will be fed to the 2500 fully connected layer and the rest of the architecture stayed as is. -In the second experiment, we implemented VGG-Seq2Seq model but, instead of using the pretrained network, we built and trained the VGG network with its convolutional layers on the dataset, the rest of the architecture stayed as is.</p><p>-In the last experiment, we run our proposed model (VGG-Seq2Seq) with pretrained VGG-net on the dataset.</p><p>The three above experiments are trained using a single layer of LSTM network on the encoder with a dimension of 1024 and a single layer of LSTM network on the decoder with a dimension of 2048. All models were trained using RMSprop optimizer <ref type="bibr" target="#b5">[6]</ref> with 0.001 learning rate on 500 epochs with 512 batch size at each epoch and 300 word embedding size. As shown in Table <ref type="table" target="#tab_0">1</ref>, the results show that VGG-Seq2Seq (pre-trained VGG) achieves reasonable results with 0.06, 0.12, 0.03 BLEU, WBSS and CBSS, respectively. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, we addressed the very interesting yet challenging VQA-Med Task of ImageCLEF 2018. We introduced our VGG-Seq2Seq model which employs an encoder-decoder architecture, where the encoders fuses a pretrained VGG network with an LSTM network that has a pretrained word embedding layer to encode the input. As for the answer generation, another LSTM network is used as a decoded. When used with a pretrained VGG network, the VGG-Seq2Seq model managed to achieve reasonable results with 0.06, 0.12, 0.03 BLEU, WBSS and CBSS, respectively. Moreover, the VGG-Seq2Seq is not expensive to train. Obviously, the VGG-Seq2Seq model is far from perfect. We intend to work on it to increase its accuracy and enhance its run-time and space requirements.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Samples of images associated with question-answer pairs from the training set [5].</figDesc><graphic coords="3,134.77,209.17,345.83,140.61" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. The VGG-Seq2Seq Model.</figDesc><graphic coords="4,134.77,115.84,345.81,161.22" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Results of different modelsIt is worth mentioning that the best performing VGG-Seq2Seq is not very expensive to train. It took an average of 252.8 seconds per epoch on a Virtual Machine (VM) equipped with Tesla K80 GPU card with 24GB of RAM. The VM had Ubuntu OS with CUDA 9.0. For the implementation, we use Keras with TensorFlow 1.8 backend.</figDesc><table><row><cell>Model</cell><cell>BLEU</cell><cell>WBSS CBSS</cell></row><row><cell cols="3">VGG-Seq2Seq (Pre-trained VGG) 0.060986477 0.12167 0.029064</cell></row><row><cell>VGG-Seq2Seq</cell><cell cols="2">0.047820372 0.104488 0.014981</cell></row><row><cell>CNN-Seq2Seq</cell><cell cols="2">0.035619839 0.093911 0.011004</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Deep compositional question answering with neural module networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Andreas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Klein</surname></persName>
		</author>
		<idno>CoRR abs/1511.02799</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Vqa: Visual question answering</title>
		<author>
			<persName><forename type="first">S</forename><surname>Antol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitchell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lawrence Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE International Conference on Computer Vision</title>
				<meeting>the IEEE International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="2425" to="2433" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Are you talking to a machine? dataset and methods for multilingual image question</title>
		<author>
			<persName><forename type="first">H</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="2296" to="2304" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Survey of visual question answering: Datasets and techniques</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Gupta</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1705.03865</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Overview of the ImageCLEF 2018 medical domain visual question answering task</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Farri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lungren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<ptr target="WS.org&lt;http://ceur-ws.org&gt;" />
	</analytic>
	<monogr>
		<title level="m">CLEF2018 Working Notes. CEUR Workshop Proceedings</title>
				<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">September 10-14 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Rmsprop: Divide the gradient by a running average of its recent magnitude</title>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Swersky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Coursera lecture</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
	<note>Neural networks for machine learning</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEF 2018: Challenges, datasets and evaluation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Villegas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Eickhoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Andrearczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">D</forename><surname>Cid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Liauchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Farri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lungren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">T</forename><surname>Dang-Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Piras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gurrin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018)</title>
		<title level="s">LNCS Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">September 10-14 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Answer-type prediction for visual question answering</title>
		<author>
			<persName><forename type="first">K</forename><surname>Kafle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kanan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="4976" to="4984" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Hierarchical question-image co-attention for visual question answering</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances In Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="289" to="297" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Learning to answer questions from image using convolutional neural network</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AAAI</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page">16</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A multi-world approach to question answering about real-world scenes based on uncertain input</title>
		<author>
			<persName><forename type="first">M</forename><surname>Malinowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fritz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1682" to="1690" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Ask your neurons: A deep learning approach to visual question answering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Malinowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fritz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">125</biblScope>
			<biblScope unit="issue">1-3</biblScope>
			<biblScope unit="page" from="110" to="135" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="3111" to="3119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Image question answering using convolutional neural network with dynamic parameter prediction</title>
		<author>
			<persName><forename type="first">H</forename><surname>Noh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hongsuck Seo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Han</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="30" to="38" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">J</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th annual meeting on association for computational linguistics</title>
				<meeting>the 40th annual meeting on association for computational linguistics</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Glove: Global vectors for word representation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</title>
				<meeting>the 2014 conference on empirical methods in natural language processing (EMNLP)</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Exploring models and data for image question answering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zemel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="2953" to="2961" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Where to look: Focus regions for visual question answering</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Shih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hoiem</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="4613" to="4621" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Very deep convolutional networks for large-scale image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.1556</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Biosses: a semantic sentence similarity estimation system for the biomedical domain</title>
		<author>
			<persName><forename type="first">G</forename><surname>Sogancıoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Öztürk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Özgür</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">14</biblScope>
			<biblScope unit="page" from="49" to="58" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Verbs semantics and lexical selection</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Palmer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 32nd annual meeting on Association for Computational Linguistics</title>
				<meeting>the 32nd annual meeting on Association for Computational Linguistics</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="1994">1994</date>
			<biblScope unit="page" from="133" to="138" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Stacked attention networks for image question answering</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Smola</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="21" to="29" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Visual madlibs: Fill in the blank description generation and question answering</title>
		<author>
			<persName><forename type="first">L</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Berg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2015">2015. 2015</date>
			<biblScope unit="page" from="2461" to="2469" />
		</imprint>
	</monogr>
	<note>Computer Vision (ICCV)</note>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Simple baseline for visual question answering</title>
		<author>
			<persName><forename type="first">B</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sukhbaatar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Szlam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Fergus</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1512.02167</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Visual7w: Grounded question answering in images</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Groth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="4995" to="5004" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
