<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Explaining Emotional Attitude Through the Task of Image-captioning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Oleg</forename><surname>Bisikalo</surname></persName>
							<email>obisikalo@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Vinnytsia National Technical University</orgName>
								<address>
									<addrLine>Khmelnytsky highway 95</addrLine>
									<postCode>21021</postCode>
									<settlement>Vinnytsya</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Volodymyr</forename><surname>Kovenko</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Vinnytsia National Technical University</orgName>
								<address>
									<addrLine>Khmelnytsky highway 95</addrLine>
									<postCode>21021</postCode>
									<settlement>Vinnytsya</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ilona</forename><surname>Bogach</surname></persName>
							<email>ilona.bogach@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Vinnytsia National Technical University</orgName>
								<address>
									<addrLine>Khmelnytsky highway 95</addrLine>
									<postCode>21021</postCode>
									<settlement>Vinnytsya</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Olha</forename><surname>Chorna</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Kremenchuk Mykhailo Ostrohradskyi National University</orgName>
								<address>
									<addrLine>Pershotravneva Street, 20</addrLine>
									<postCode>39600</postCode>
									<settlement>Kremenchuk</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">International Conference on Computational Linguistics and Intelligent Systems</orgName>
								<address>
									<addrLine>May 12-13</addrLine>
									<postCode>2022</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Explaining Emotional Attitude Through the Task of Image-captioning</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">062CFDB4BF2B62876853E922E4850104</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:54+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Deep learning algorithms</term>
					<term>Emotional attitude</term>
					<term>SOTA models</term>
					<term>Image-captioning</term>
					<term>NLP</term>
					<term>Transfer-learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Deep learning algorithms trained on huge datasets containing visual and textual information, have shown to learn useful features for other downstream tasks. This implies that such models understand the data on different levels of hierarchies. In this paper we study the ability of SOTA (state-of-the-art) models for both texts and images to understand the emotional attitude caused by a situation. For this purpose we gathered a small size dataset based on IMDB-WIKI one and annotated it specifically for the task. In order to investigate the ability of pretrained models to understand the data, the KNN clustering procedure over representations of text and images is utilized in parallel. It's shown that although used models are not capable of understanding the task at hand, a transfer learning procedure based on them helps to improve results on such tasks as image-captioning and sentiment analysis. We then frame our problem as the task of image captioning and experiment with different architectures and approaches to training. Finally, we show that adding additional biometric features such as probabilities of emotions and gender probabilities improves the results and leads to better understanding of emotional attitude.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Recent development of hardware and access to big datasets allowed researchers to train sophisticated deep learning based algorithms, which suppressed many other approaches. The deep learning revolution affected many fields, whereas the most interesting results were obtained in the field of NLP (natural language processing) <ref type="bibr" target="#b0">[1]</ref> and CV (computer vision) <ref type="bibr" target="#b1">[2]</ref>. It was shown that SOTA models trained on big datasets (ImageNet <ref type="bibr" target="#b2">[3]</ref>, Google News) tend to learn useful features that can be used for other downstream tasks <ref type="bibr" target="#b3">[4]</ref>. Building on that idea we study how well such models understand the emotional attitude and its cause implicitly or explicitly introduced by visual and textual data. Understanding the emotional attitude and explaining it is a pretty hard task even for a human, as the solution requires the exact understanding of cause and consequence that are affected by the environment and biometric features. For the purpose of experiments a new small-size dataset containing image-text pairs called "EmoAtCap" is collected. The overall contribution of our work is summarized below:</p><p>1. A small-size dataset "EmoAtCap" which is based on IMDB-WIKI one, that can be used for image-captioning and sentiment analysis. It is publicly available <ref type="bibr" target="#b4">[5]</ref> for facilitating future research in this domain.</p><p>2. A set of experiments on the tasks of image-captioning and sentiment analysis, based on features extracted from highlighted models. It's also shown that adding biometric features as gender and emotions distribution improves the performance of image-captioning models.</p><p>The training procedure was conducted using tensorflow <ref type="bibr" target="#b6">[6]</ref> and pytorch <ref type="bibr" target="#b7">[7]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Data collection</head><p>The actual data needed to include both images and their captions. As the main intent was to capture the emotional attitude, the images would have to contain people and explicit or implicit information about the cause of their emotional state. The captions should have contained an exhaustive unbiased description of the situation. Based on highlighted requirements, the first idea was to make a dataset from the subset of existing image-captioning datasets.</p><p>Image-captioning is the process of generating textual description of an image. The task implies that the relevant dataset consists of image-text pairs. One of the most popular datasets for the discussed task is COCO <ref type="bibr" target="#b8">[8]</ref>, which consists of 330K images. We used only the subset of dataset related to image-captioning, mainly the 2014 train split, which consisted of 29766 images along with 5 captions per each image. As it would be pretty hard and cumbersome to filter out images manually, a YoloV3 <ref type="bibr" target="#b9">[9]</ref> object-detection algorithm trained on the discussed dataset was used. Only images that contained objects of class "person" were left. As a result, the COCO dataset was shrunk to 3731 images. However, filtered images and captions only contained the actual plot of the image without any emotional attitude. The other analyzed dataset was a VizWiz <ref type="bibr" target="#b10">[10]</ref> one. VizWiz is the first goaloriented VQA (visual question answering) dataset arising from a natural VQA setting, which consists of over 31,000 visual questions originating from blind people. Needed data subset was found by filtering the captions using people related words. As the final data was of a poor quality, this variant was declined. The last image-text data we experimented with was SentiCap <ref type="bibr" target="#b11">[11]</ref> one. SentiCap consists of 2360 images containing sentiments. After filtering the dataset in the same way as it was done for VizWiz one, we arrived with only 830 samples, which was not enough for our task.</p><p>The other variant was to gather a dataset from the very beginning and annotate it. The images were taken from the IMDB-WIKI <ref type="bibr" target="#b12">[12]</ref> dataset for age and gender detection. Each image was annotated with a description of the emotional attitude of the person or people on it. As a result we arrived with the dataset of 3840 image-text pairs, where each image was resized to 224x224 pixels (Fig. <ref type="figure" target="#fig_0">1</ref>). In order to categorize the dataset, sentiments related to captions were added using Vader <ref type="bibr" target="#b13">[13]</ref>, which is a rule based model for sentiment analysis. Then the sentiments were checked by humans one more time to produce more meaningful ones. As the result of the analysis, the data appeared to be imbalanced in terms of the new category (Fig. <ref type="figure">2</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 2: Distribution of caption sentiments in the dataset</head><p>New sentiment category was used for analysis of clustering and for solving the task of sentiment analysis given the captions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Pretrained models overview</head><p>In order to analyze the ability of pretrained models to understand such difficult information as emotional attitude, recent SOTA models trained on big datasets of textual and visual information were chosen.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">ResNet</head><p>ResNet, introduced by Kaiming He et.al, is a deep convolutional architecture, which suppressed previous results on Imagenet benchmark and showed to be pretty successful for object detection by obtaining a 28% relative improvement on the COCO object detection dataset. Main advantage of such architecture is the addition of residual connections that help to fight the problem of vanishing gradients, which is typical for deep neural networks. This advantage gave a possibility to train a very deep network, each layer of which learned different useful features. In our work ResNet152V2 pretrained on the Imagenet dataset was used. We also experimented with ResNet50 trained on FER <ref type="bibr" target="#b14">[14]</ref> dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">EfficientNet</head><p>EfficientNet, introduced by Tan et.al, is a deep convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a compound coefficient. It achieves state-of-the-art 84.3% top-1 accuracy on ImageNet and transfers well to other tasks, reaching state-of-the-art accuracy on CIFAR-100 <ref type="bibr" target="#b15">[15]</ref> (91.7%), Flowers <ref type="bibr" target="#b16">[16]</ref> (98.8%), and 3 other transfer learning datasets. In our work EfficientNet trained on age-gender IMDB-WIKI dataset was used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Word2Vec</head><p>Word2Vec, introduced by Mikolov et.al <ref type="bibr" target="#b17">[17]</ref> is a neural network based approach to learning word embeddings. The approach gives a possibility to use two methods of learning: CBOW and skipgramm. During the CBOW approach, the model is asked to predict the current word given the context, whereas skip-gram one tries to predict words within a certain range before and after the current word. As a result of such training, the model learns meaningful word vectors that are often used for transfer learning. Word2Vec embeddings pretrained on Google News with the vectors' dimensionality of 300 were used in the paper.</p><p>The exact setup of experiments and description of layers using which the data representation was derived along with experimental results is discussed further in the paper</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Image-captioning</head><p>Image understanding is the process of interpreting regions/objects to figure out what's happening in the image. This may include figuring out what the objects are, their spatial relationship to each other, etc <ref type="bibr" target="#b18">[18]</ref>. This statement implies that one of the definitions of scene understanding is a capability of describing its context. Thus, we theorize that a model which can describe the emotional attitude based on image is capable of understanding it. The task of describing the image is known as imagecaptioning, and gained a huge popularity with the development of deep neural networks <ref type="bibr" target="#b19">[19]</ref>. Though there are many different approaches to the task [20], we exploit only the encoder-decoder architecture, where encoder's goal is to encode the representation of the image into the feature vector and decoder's one is to generate the captions based on this information. The theoretical foundations of constructing text messages / captions by modeling combinations of significant words are considered in <ref type="bibr" target="#b21">[21]</ref>. For the role of encoder, a convolutional neural network is often exploited, whereas for the role of decoder -recurrent one. In our work the research is done due to different encoder-decoder architectures used to solve the task of image-captioning.</p><p>As it was stated by Kovenko et.al <ref type="bibr" target="#b22">[22]</ref>, by solving the problem of data reconstruction, autoencoders tend to learn low-level features, which are useful for transfer learning. Based on this idea we train the deep convolutional autoencoder on our dataset and use latent code produced by the encoder part for encoding images in image-captioning task. Also the experiments include the output of 4th block of ResNet, along with the logits of ResNet as the encoders. In order to compare this transfer learning approaches, we also experiment with custom not pretrained convolutional encoder.</p><p>The decoder part is represented by the embedding layer and LSTM (Long-short-term-memory) <ref type="bibr" target="#b23">[23]</ref> network. LSTM is capable of learning long-time dependencies, which is especially useful when working with sequential data. As the embedding layer, for all the experiments, Word2Vec was used. For all the approaches, layer normalization <ref type="bibr" target="#b24">[24]</ref> after LSTM was used. As it was stated by Xu et.al <ref type="bibr" target="#b25">[25]</ref>, the attention mechanism applied to image-captioning tasks can greatly improve results. Nezami et.al <ref type="bibr" target="#b26">[26]</ref> showed that usage of additional features of emotions helps to improve results on the imagecaptioning datasets that include emotional aspects. Based on these ideas, we experimented with using attention and conditioning LSTM on additional features. Different from Nezami's approach, gender features were also used and the emotional ones were encoded as probability distribution. Specifically, YoloV3 is used to extract face regions from the images and EfficientNet trained on Age-Gender dataset along with ResNet trained on FER one are used to predict gender and emotions.</p><p>Gender features are produced using predicted probabilities for each face presented on the image (formula 1).</p><formula xml:id="formula_0">𝑆 𝑔 = 𝑆 𝑔 ∑ 𝑆 𝑔 𝐺 𝑔 , 𝑆𝑔 = ∑ 1 𝑃 𝑖 𝑔 𝑁 𝑖=1 , 𝑃 𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑝𝑟𝑒𝑑 𝑖 )<label>(1)</label></formula><p>where G -number of unique genders, g -gender, 𝑆 𝑅𝑥𝐺 -normalized vector with gender probabilities, N -number of faces presented on the image, 1 𝑃 𝑖 𝑔 -identifier of Pibeing equal to specific g, 𝑃 𝑖 -result of an argmax operation over prediction probability vector for specific face i. Emotional features are produced as normalized probability distribution of the sum of probability vectors for each face presented on the image (formula 2).</p><formula xml:id="formula_1">𝐸 = ∑ 𝑝𝑟𝑒𝑑 𝑖 𝑁 𝑖 , 𝐸 = 𝐸 ∑ 𝐸 𝑗 𝑀 𝑗<label>(2)</label></formula><p>where 𝐸 𝑅𝑥𝑀 -vector of averaged emotion probabilities, N -number of faces presented on the image, pred i -prediction probability vector for specific face i, M -number of unique genders.</p><p>The data was splitted in the same way as for sentiment analysis. The approaches were validated based on the test set performance using beam search technique with the beam size of 5. BLEU score along with perplexity were used as the main metrics. For all the experiments RMSprop optimizer was used, with the initial learning rate of 0.0001. In order not to overfit, the loss reduction technique was used. If there was no improvement in validation perplexity for two epochs, the loss was reduced by the factor 10. All the models were trained with a batch size of 64 for 30 epochs (Fig. <ref type="figure" target="#fig_1">3</ref>). Analyzing the results it's obvious that the transfer-learning procedure gives better results than training from scratch (ordinary) w.r.t BLEU on a test set. It's also clear that ResNet representation tends to give better results than autoencoder's one, possibly because of a deeper architecture and better learned features. Attention didn't work well for all the approaches, probably because of the low number of samples in the dataset and small amount of epochs. So far an approach that utilized logits output of ResNet for encoder part of the network along with Word2Vec embeddings and additional features of emotions (resnet_logits_w2v_emotions) gave the best results on test data w.r.t to averaged BLEU score. The other appomodelrach, which is also worth paying attention to is the one, which incorporates both emotions and gender features. Despite resnet_logits_w2v_emotions_gender didn't achieve the best performance on test BLEU, it reached the best balanced performance on all the data splits, and thus was chosen as the best one. The architecture of the overall prediction pipeline is shown in Fig. <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 4. Architecture of the pipeline of resnet_logits_w2v_emotions_gender approach</head><p>As it can be seen from Fig. <ref type="figure">4</ref>, the overall pipeline is dependent on the face pre-processing step along with the detection of emotions and gender. Obviously, if the performance of highlighted steps is poor, the final output will be at least biased. The example of such bias is represented in Fig. <ref type="figure" target="#fig_2">5</ref>. During error analysis, it was found that the model suffers from slight overfitting on most frequent words and phrases (like "man is flirting with a woman" presented in Fig. <ref type="figure" target="#fig_2">5</ref>), which is a problem caused by a small diversity of the dataset. Despite the fact that the collected data is a noisy one, as each image was annotated by a different expert, which is not very suitable for the task of imagecaptioning, the model succeeds to give adequate results on average (Fig. <ref type="figure">6</ref>). It's important to note that the longer training would probably give better results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and further work</head><p>In this paper we analyzed the ability of deep learning models to understand the emotional attitude driven by the situation. For this purpose, a new dataset with image-text pairs was presented. In result of pretrained SOTA models analysis, it was concluded that some of them can be used in the process of transfer-learning. Through the experiments it was shown that the dataset can be used to solve the problem of sentiment analysis. It was then theorized that the problem of understanding the emotional attitude, can be transferred to the task of imagecaptioning. Empirical results have shown that addition of emotional and gender features along with transferlearning based on ResNet network and Word2Vec embeddings improve the overall captioning performance. Our approach gives pleasant results on average, confirming that deep learning models are able to understand emotional attitude if they are trained to. It's important to note that such an approach has many downsides, as it's dependent on the performance of three additional models for face, emotions and gender detection. The other problem that was faced is the noisy nature of the dataset and small variation of phrases in it. In future work it's planned to gather a bigger dataset, label each image with 5 captions and fix current problems.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1</head><label>1</label><figDesc>Figure 1 (a, b): Dataset examples with corresponding captions</figDesc><graphic coords="2,85.23,596.03,431.89,102.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. Comparison of image-captioning models and approaches. For train and validation perplexity, the values are shown for the last epoch of training</figDesc><graphic coords="6,72.00,72.00,464.20,178.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 5 .</head><label>5</label><figDesc>Figure 5. Example of bias of additional features w.r.t image-captioning process. S -vector of gender features, E -vector of emotions features. T -true caption, greedy -result of greedy decoding, beam -result of beam search decoding. Changing additional features, changes the generation of captions using greedy decoding strategy.</figDesc><graphic coords="7,72.00,72.00,452.66,239.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>b). Example of generated captions. T -true caption, greedy -result of greedy decoding, beam -result of beam search decoding. Captions which are fully inappropriate are marked with bleu. c -e). Example of generated captions. T -true caption, greedy -result of greedy decoding, beam -result of beam search decoding. Captions which are fully inappropriate are marked with bleu.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="8,93.15,309.76,419.85,124.35" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Acknowledgements</head><p>We would like to thank Oleksii Abdullaiev, Dmytro Tarasovskyi and Dmytro Maliovanyi for their contribution in terms of the dataset creation.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Natural Language Processing</title>
		<author>
			<persName><forename type="first">Elizabeth</forename><forename type="middle">D</forename><surname>Liddy</surname></persName>
		</author>
		<ptr target="https://surface.syr.edu/cgi/viewcontent.cgi?article=1043&amp;context=istpub" />
	</analytic>
	<monogr>
		<title level="m">Encyclopaedia of Library and Information Science</title>
				<meeting><address><addrLine>NY</addrLine></address></meeting>
		<imprint>
			<publisher>Marcel Decker, Inc</publisher>
			<date type="published" when="2021-12-12">12 December 2021</date>
		</imprint>
	</monogr>
	<note>2nd Ed</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">What is Computer Vision</title>
		<ptr target="https://www.ibm.com/topics/computer-vision" />
		<imprint>
			<date type="published" when="2021-12-12">12 December 2021</date>
		</imprint>
		<respStmt>
			<orgName>IBM</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Imagenet: A large-scale hierarchical image database</title>
		<author>
			<persName><forename type="first">Jia</forename><surname>Deng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE conference on computer vision and pattern recognition</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="248" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">How transferable are features in deep neural networks?</title>
		<author>
			<persName><forename type="first">Jason</forename><surname>Yosinski</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1411.1792</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Volodymyr</forename><forename type="middle">;</forename><surname>Kovenko</surname></persName>
		</author>
		<author>
			<persName><surname>Abdullaiev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">;</forename><surname>Oleksii</surname></persName>
		</author>
		<author>
			<persName><surname>Maliovanyi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">;</forename><surname>Dmytro</surname></persName>
		</author>
		<author>
			<persName><surname>Tarasovskyi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">;</forename><surname>Dmytro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ilona</forename><surname>Bogach</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">EmoAtCap : Emotional attitude captioning dataset</title>
		<author>
			<persName><forename type="first">Oleh</forename><surname>Bisikalo</surname></persName>
		</author>
		<idno type="DOI">10.17632/dym6p2pvbt</idno>
	</analytic>
	<monogr>
		<title level="j">Mendeley Data</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Tensorflow: Large-scale machine learning on heterogeneous distributed systems</title>
		<author>
			<persName><forename type="first">M</forename><surname>Abadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Barham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Brevdo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Citro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">.</forename><forename type="middle">.</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename></persName>
		</author>
		<idno type="arXiv">arXiv:1603.04467</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Pytorch: An imperative style, high-performance deep learning library</title>
		<author>
			<persName><forename type="first">A</forename><surname>Paszke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Massa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lerer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bradbury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">.</forename><forename type="middle">.</forename><surname>Chintala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="8026" to="8037" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Microsoft coco: Common objects in context</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Maire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Belongie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hays</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Perona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">.</forename><forename type="middle">.</forename><surname>Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European conference on computer vision</title>
				<meeting><address><addrLine>, Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014-09">2014. September</date>
			<biblScope unit="page" from="740" to="755" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Yolov3: An incremental improvement</title>
		<author>
			<persName><forename type="first">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1804.02767</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Vizwiz grand challenge: Answering visual questions from blind people</title>
		<author>
			<persName><forename type="first">D</forename><surname>Gurari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Stangl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Grauman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">.</forename><forename type="middle">.</forename><surname>Bigham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="3608" to="3617" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Senticap: Generating image descriptions with sentiments</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mathews</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2016-03">2016. March</date>
			<biblScope unit="volume">30</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Dex: Deep expectation of apparent age from a single image</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rothe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Timofte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Gool</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCVW.2015.41</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision workshops</title>
				<meeting>the IEEE international conference on computer vision workshops</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="10" to="15" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Vader: A parsimonious rule-based model for sentiment analysis of social media text</title>
		<author>
			<persName><forename type="first">C</forename><surname>Hutto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gilbert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International AAAI Conference on Web and Social Media</title>
				<meeting>the International AAAI Conference on Web and Social Media</meeting>
		<imprint>
			<date type="published" when="2014-05">2014. May</date>
			<biblScope unit="volume">8</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Challenges in representation learning: A report on three machine learning contests</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">J</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">L</forename><surname>Carrier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mirza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hamner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">.</forename><forename type="middle">.</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename></persName>
		</author>
		<idno type="DOI">10.1016/j.neunet.2014.09.005</idno>
	</analytic>
	<monogr>
		<title level="j">Neural Networks</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="page" from="59" to="63" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Learning Multiple Layers of Features from Tiny Images</title>
		<author>
			<persName><forename type="first">Alex</forename><surname>Krizhevsky</surname></persName>
		</author>
		<ptr target="https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf" />
		<imprint>
			<date type="published" when="2021-12-12">12 December 2021</date>
		</imprint>
	</monogr>
	<note type="report_type">Tech Report</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Automated flower classification over a large number of classes</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Nilsback</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Sixth Indian Conference on Computer Vision, Graphics &amp; Image Processing</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2008-12">2008. December. 2008</date>
			<biblScope unit="page" from="722" to="729" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Hierarchy-based image embeddings for semantic image retrieval</title>
		<author>
			<persName><forename type="first">B</forename><surname>Barz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Denzler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Winter Conference on Applications of Computer Vision (WACV)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019-01">2019. January. 2019</date>
			<biblScope unit="page" from="638" to="647" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">Bryan</forename><forename type="middle">S</forename><surname>Morse</surname></persName>
		</author>
		<ptr target="http://www.sci.utah.edu/~gerig/CS6640-F2012/Materials/BMorse-BYU-iu-active-contours.pdf-Titlefromthescreen" />
		<title level="m">Image Understanding</title>
				<imprint>
			<date type="published" when="2021-12-12">12 December 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Show and tell: A neural image caption generator</title>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Toshev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="3156" to="3164" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">A comprehensive survey of deep learning for image captioning</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Z</forename><surname>Hossain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Sohel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Shiratuddin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Laga</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CsUR)</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1" to="36" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">The Method of Modelling the Mechanism of Random Access Memory of System for Natural Language Processing</title>
		<author>
			<persName><forename type="first">O</forename><surname>Bisikalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bogach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sholota</surname></persName>
		</author>
		<idno type="DOI">10.1109/TCSET49122.2020.235477</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 15th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="472" to="477" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A Comprehensive Study of Autoencoders&apos; Applications Related to Images</title>
		<author>
			<persName><forename type="first">V</forename><surname>Kovenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bogach</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IT&amp;I Workshops</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="43" to="54" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Layer normalization</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Ba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename></persName>
		</author>
		<idno type="arXiv">arXiv:1607.06450</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Show, attend and tell: Neural image caption generation with visual attention</title>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhudinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">.</forename><forename type="middle">.</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2015-06">2015. June</date>
			<biblScope unit="page" from="2048" to="2057" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Face-cap: Image captioning using facial expression analysis</title>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">M</forename><surname>Nezami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hamey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Joint European Conference on Machine Learning and Knowledge Discovery in Databases</title>
				<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018-09">2018. September</date>
			<biblScope unit="page" from="226" to="240" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
