<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Baptist</forename><surname>Vandersmissen</surname></persName>
							<email>baptist.vandersmissen@ugent.be</email>
							<affiliation key="aff0">
								<orgName type="department">ELIS</orgName>
								<orgName type="laboratory">Multimedia Lab</orgName>
								<orgName type="institution">Ghent University -iMinds</orgName>
								<address>
									<settlement>Ghent</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Abhineshwar</forename><surname>Tomar</surname></persName>
							<email>abhineshwar.tomar@ugent.be</email>
						</author>
						<author>
							<persName><forename type="first">Fréderic</forename><surname>Godin</surname></persName>
							<email>frederic.godin@ugent.be</email>
						</author>
						<author>
							<persName><forename type="first">Wesley</forename><surname>De Neve</surname></persName>
							<email>wesley.deneve@ugent.be</email>
							<affiliation key="aff0">
								<orgName type="department">ELIS</orgName>
								<orgName type="laboratory">Multimedia Lab</orgName>
								<orgName type="institution">Ghent University -iMinds</orgName>
								<address>
									<settlement>Ghent</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Image and Video Systems Lab</orgName>
								<orgName type="institution">KAIST</orgName>
								<address>
									<settlement>Daejeon</settlement>
									<country key="KR">South Korea</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rik</forename><surname>Van De Walle</surname></persName>
							<email>rik.vandewalle@ugent.be</email>
							<affiliation key="aff0">
								<orgName type="department">ELIS</orgName>
								<orgName type="laboratory">Multimedia Lab</orgName>
								<orgName type="institution">Ghent University -iMinds</orgName>
								<address>
									<settlement>Ghent</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Ghent University-iMinds at MediaEval 2014 Diverse Images: Adaptive Clustering with Deep Features</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8E15BE0AB85AA00F414C01E622B47D0F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:10+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we attempt to tackle the MediaEval 2014 Retrieving Diverse Social Images challenge, a filter and refinement problem defined for a Flickr-based ranked set of social images. We build upon solutions proposed in [5] and mainly focus on exploiting the joint use of all modalities. The use of image features extracted from a deep convolutional neural network, combined with the use of distributed word representations, forms the basis of our approach.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>In this paper, we describe our approach for tackling the MediaEval 2014 Retrieving Diverse Social Images Task <ref type="bibr" target="#b1">[1]</ref>. This task focuses on result diversification in the context of image retrieval. We refer to <ref type="bibr" target="#b1">[1]</ref> for a complete task overview.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">METHODOLOGY</head><p>This section describes four different approaches created to solve the aforementioned challenge. The approach used in the last run uses external data sources; all other approaches exclusively use data provided by the task organizers. We focused on two parts: relevance estimation of an image with respect to a specific location and similarity estimation between a pair of images. Particularly, run 2, 3 and 5 build upon these parts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Run 1: Visual-only</head><p>We propose a hierarchical clustering-based approach for the ranking of images in accordance with their relevance and diversity for a specific location. We used the approach proposed in <ref type="bibr">[5]</ref> (cf. "Visual run").</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Run 2: Textual-only</head><p>The textual run makes use of information derived from the provided tags and other textual metadata. This approach aims at diversifying the results by optimizing an adapted performance metric. We modified both the relevance and diversity estimation of the algorithm proposed in <ref type="bibr">[5]</ref> (cf. "Textual run") as presented in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Relevance Estimation</head><p>The relevance of an image is estimated by making use of textual metadata. Let Tx denote the set of tags assigned to image x. Next formula predicts the relevance of image x:</p><formula xml:id="formula_0">Rel(x) = α × tags(x) + β × 1 f lickr(x) ,<label>(1)</label></formula><p>with α and β representing scalars,</p><formula xml:id="formula_1">tags(x) = |{t | t ∈ Tx, tf idft &gt; λ}| |Tx| × t∈Tx tf idft,<label>(2)</label></formula><p>and f lickr(x) denoting the original Flickr ranking of image x. The TF-IDF score of tag t is denoted by tf idft. The tag score (cf. Equation <ref type="formula" target="#formula_1">2</ref>) is the sum of each tag's normalized TF-IDF score multiplied by the relative number of high scoring tags. In our approach, λ is set to the average TF-IDF score. This benefits images with a higher number of more relevant tags.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Diversity Estimation</head><p>Estimating the semantic difference between two images is based on the amount of shared tags. Let x and y denote two images with Tx and Ty denoting their set of tags, respectively. The diversity is then calculated as follows:</p><formula xml:id="formula_2">Div(x, y) = 1 − |Tx ∩ Ty| max(|Tx|, |Ty|) .<label>(3)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Run 3: Visual and Textual</head><p>The fusion of both visual and textual information results in a relevance-based clustering approach (cf. "Combined run" in <ref type="bibr">[5]</ref>). We modified the clustering technique to adaptive hierarchical clustering. The optimal distance to form clusters is determined by finding the "knee" point in the plot of number of clusters versus the inter-cluster distance (similar to <ref type="bibr" target="#b3">[3]</ref>). To estimate the relevance of an image, we use our textual-only method (cf. Section 2.2.1). The diversity between two images is estimated based on the Euclidean distance between their visual descriptor, which is represented by a CN3x3 and LBP3x3 vector <ref type="bibr" target="#b1">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Run 5: External Sources</head><p>The algorithm used to produce the fifth run is based on the one used in Section 2.3. Both the relevance and diversity </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.1">Relevance Estimation</head><p>In order to accurately estimate the relevance of an image, a well-defined target location is necessary. Thus, each location is first described in both a textual and visual manner.</p><p>To create this textual identity, related information of each location is extracted from DBpedia<ref type="foot" target="#foot_0">1</ref> . From this information textual keywords are extracted and combined with the top k most frequently occurring tags in the set of images of a location. The visual identity is formed on the basis of a set of representative photos, retrieved via Wikipedia. The relevance of an image is calculated based on a linear combination of the following three factors: textual relevance, visual relevance, and Flickr relevance.</p><p>The textual relevance of an image is entirely based on its tags. Again, assume that Tx denotes the set of tags of image x and that Ta denotes the set of tags depicting location a (i.e., textual identity):</p><formula xml:id="formula_3">Rel(x) = t∈Tx e max k∈Ta {sim(t,k)} |Tx| ,<label>(4)</label></formula><p>We propose a new method to compute the similarity between tags and omit the use of the ubiquitous TF-IDF. Therefore, we make use of distributed word representations, namely word2vec<ref type="foot" target="#foot_1">2</ref> . A pretrained model (the Google News Dataset-based dictionary defined as Tw) is used to convert words to vectors. Such vectors preserve the semantic and linguistic regularities among words <ref type="bibr" target="#b2">[2]</ref>. The following formula describes this approach:</p><formula xml:id="formula_4">sim(ta, t b ) =    cos(Θ) if ta ∈ Tw ∧ t b ∈ Tw 1 if ta / ∈ Tw ∨ t b / ∈ Tw, ta = t b 0 else ,<label>(5)</label></formula><p>with ta and t b depicting a tag, and cos(Θ) the cosine similarity between their representative vectors. With this technique, semantically similar and spelling-wise different tags can still have an influence on the eventual relevance score.</p><p>Visual relevance is calculated based on the maximum similarity between the image and the representative Wikipedia images. Finally, Flickr relevance is the inverse of the original Flickr ranking of the image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.2">Diversity Estimation</head><p>To improve the similarity estimation and thus dissimilarity estimation between two images, we attempt to find more effective visual descriptors. Therefore, we make use of a deep convolutional neural network, trained on 1.2 million images from ImageNet, named OverFeat<ref type="foot" target="#foot_2">3</ref> , to extract high-level features <ref type="bibr" target="#b4">[4]</ref>. Each image is resized and cropped to a size of 231 pixels by 231 pixels, then for each image a representative vector is extracted from a convolutional network. This is done by feed-forward propagation through the network and omitting the fully connected layers, which results in a vector of size 4096 for each image. Thus, we assume that the numerous filters in the convolutional layers extract high-level and representative features. The diversity between two images is then again estimated based on the Euclidean distance between their descriptors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXPERIMENTS</head><p>In Table <ref type="table" target="#tab_0">1</ref>, we can see the results of the original Flickr ranking together with the results of all algorithms on the development set. Table <ref type="table" target="#tab_1">2</ref> shows the results on the test set. Clearly, run 5 outperforms the other approaches when observing the F1-measure. Run 5 reaches an F1-score of 57.16% on the development set and 54.55% on the test set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CONCLUSIONS</head><p>We observe that run 5, using distributed word representations for the relevance estimation and OverFeat features for the diversity assessment, outperforms all others. Particularly, the use of advanced image features positively influences the F1-score. For future work, the influence of more focused distributed word representations will be investigated.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Results on development set. Flickr Run 1 Run 2 Run 3 Run 5</figDesc><table><row><cell>P@20 0.8333 0.7083</cell><cell>0.7500</cell><cell cols="2">0.7700 0.8567</cell></row><row><cell cols="3">CR@20 0.3455 0.3967 0.4441 0.4043</cell><cell>0.4289</cell></row><row><cell>F1@20 0.4885 0.5086</cell><cell>0.5579</cell><cell cols="2">0.5302 0.5716</cell></row><row><cell cols="4">estimation components are adapted and described below.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Results on test set.</figDesc><table><row><cell></cell><cell cols="4">Run 1 Run 2 Run 3 Run 5</cell></row><row><cell>P@20</cell><cell>0.6232</cell><cell>0.7480</cell><cell cols="2">0.7557 0.8008</cell></row><row><cell>CR@20</cell><cell cols="3">0.3600 0.4279 0.4035</cell><cell>0.4252</cell></row><row><cell>F1@20</cell><cell>0.4503</cell><cell>0.5369</cell><cell cols="2">0.5181 0.5455</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://dbpedia.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://code.google.com/p/word2vec/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://cilvr.nyu.edu/doku.php?id=code:start</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Popescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Ginsca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MultimediaEval working Notes</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<publisher>NIPS</publisher>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms</title>
		<author>
			<persName><forename type="first">S</forename><surname>Salvador</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Chan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Tools with Artificial Intelligence</title>
				<imprint>
			<date type="published" when="2004-11">Nov 2004</date>
			<biblScope unit="page" from="576" to="584" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Overfeat: Integrated recognition, localization and detection using convolutional networks</title>
		<author>
			<persName><forename type="first">P</forename><surname>Sermanet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Eigen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mathieu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Fergus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
			<publisher>CoRR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Ghent University-iMinds at MediaEval 2013 Diverse Images: Relevance-Based Hierarchical Clustering</title>
		<author>
			<persName><forename type="first">B</forename><surname>Vandersmissen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tomar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Godin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">De</forename><surname>Neve</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Van De Walle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval 2013 Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="1043">October 18-19. 1043. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
