<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Bilkent-RETINA at Retrieving Diverse Social Images Task of MediaEval 2014</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mustafa</forename><forename type="middle">Ilker</forename><surname>Sarac</surname></persName>
							<email>mustafa.sarac@cs.bilkent.edu.tr</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Engineering</orgName>
								<orgName type="institution">Bilkent University</orgName>
								<address>
									<postCode>06800</postCode>
									<settlement>Ankara</settlement>
									<country key="TR">Turkey</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pinar</forename><surname>Duygulu</surname></persName>
							<email>duygulu@cs.bilkent.edu.tr</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Engineering</orgName>
								<orgName type="institution">Bilkent University</orgName>
								<address>
									<postCode>06800</postCode>
									<settlement>Ankara</settlement>
									<country key="TR">Turkey</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Bilkent-RETINA at Retrieving Diverse Social Images Task of MediaEval 2014</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CD39AF4460E3622B8484BE3DE3FFEF4D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:10+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper explains the approach proposed by Bilkent -RETINA team for the Retrieving Diverse Social Images task of MediaEval 2014 <ref type="bibr" target="#b1">[1]</ref>. We develop a framework which first removes outliers using one-class support vector machines (SVM) to improve relevance. Second it clusters the eliminated set and retrieves the centroids to diversify the results. We tried to exploit visual only features during our experiments. For the first run we used the provided visual features and for the second run we used well known visual features like SIFT [2] and GIST <ref type="bibr" target="#b4">[4]</ref>.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>In today's world image sharing applications are being used extremely. Users of Facebook upload 350 million photos<ref type="foot" target="#foot_0">1</ref> each day and it is said to be equal to the number of photos have been taken during 19th century in total <ref type="foot" target="#foot_1">2</ref> . Given that large number of images, search engines become more important than ever in order to produce good quality search results. In this task the quality factors are determined by means of relevancy and diversity.</p><p>Participants were provided with a development dataset (devset) of 30 locations and a testing dataset (testset) of 123 locations <ref type="bibr" target="#b1">[1]</ref>. Each location consists of up to 300 photos which are retrieved from Flickr using text information. In the following, we provide a framework which first removes the outlier images and then apply k-means clustering to obtain diversified results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">PROPOSED APPROACH</head><p>Our method can be summarized in 4 steps as shown in Figure <ref type="figure" target="#fig_0">1</ref>, namely:</p><p>Step 1: Feature extraction In this step we compute visual features for each image of each location. Some of the features are provided by the task and 2 of them are extracted by our team.</p><p>Step 2: Outlier removal In order to increase number of relevant images for each location in the dataset, we apply an outlier removal procedure. Step 3: Clustering After the outlier removal step, in order to increase the diversity score we apply k-means clustering to the remaining images at each location.</p><p>Step 4: Retrieval In the retrieval step we select cluster centroids that we obtain in the previous step. Each centroid should represent a different aspect of a given location so that it is aimed to get a good diversification results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">VISUAL FEATURES</head><p>The task organizers provide us with 6 visual descriptors (CM, CN, CSD, GLRLM, HOG, LBP) out of which 4 have also a spatial pyramid representation (CM, CN, GLRLM and LBP). We sought for the best combination of these features using provided devset images. We found out that best results are obtained when all these features are combined. So we concatenate all these 10 visual descriptors and come up with a feature vector of 945 dimensions for each image (i.e., descvis). Then we normalize each feature vector to zero mean and unit variance.</p><p>We also extracted other visual features like GIST and bag of visual words (BOVW) representations using dense SIFT features <ref type="bibr" target="#b2">[2,</ref><ref type="bibr" target="#b4">4]</ref>. We use these extra features while ing the fifth run of the challenge. GIST features are 512 dimensional global features and they are useful in capturing the scene information in images. It is important to capture and differentiate scenery information in order to boost diversity of the results.</p><p>In order to compute dense-SIFT descriptors we use vlfeat's standart feature extactor tool <ref type="bibr" target="#b5">[5]</ref>. First we resize each image to a fixed size of 200 by 200 pixels and then we obtain 128 by 5776 dimensional SIFT features per image. In order to create a pool of descriptors we randomly sample 100 descriptors from each image and then we apply k-means algorithm with 'plusplus' option. We try 3 different k values (e.g., 600, 800 and 1000). According to the performance on devset, we choose k of k-means as 1000 and it becomes the volume of our visual words dictionary. Using this dictionary, we quantize each image to 1000 dimensional feature vectors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">OUTLIER REMOVAL</head><p>We use SVM to find out the outliers and construct a subset of images per location which are more relevant than the initial set. Our method is similar to <ref type="bibr" target="#b3">[3]</ref> but we use a fixed set of negative examples for each of devset and testset which are selected in the following ways. For devset images we picked 2 random images from each of the 30 locations, for testset images we select 60 random images from each of the 123 locations considering at most 1 image from each testset location. Then for each location, similar to cross validation, we select 60 random positive images and first train and then classify using one-class SVM, and repeat this procedure 10 times consecutively. Finally we select the model which scored the highest accuracy assuming that this model provides the best seperation. We use this process for each location, using the same negative examples at each step but with different positive examples. We use a quadratic kernel while experimenting with SVM because our features are dense vectors so that they are not easily seperable by linear kernel functions. We observed on the devset that as the result of outlier removal process, we get rid of some of the irrelevant images and obtain a higher relevancy score for each location.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">CLUSTERING AND RETRIEVAL</head><p>After outliers are removed we cluster the images of each location using a k-means algorithm. On the devset we try 2 different K values. First we select K as 25, because we observed that each location has at most 25 subclasses in their diversity subgroups. Second we select K as 50, because that was the maximum number of images required to be retrieved. The latter method, over clustering, seemed to work better in devset so that we report our test set results using K as 50.</p><p>After we compute cluster centroids, we simply retrieve images which are closest to the centroids. We apply k nearest neighbor method with Euclidean distance and search for the nearest neighbor for each centroid. While computing nearest neighbor we pay great attention to retrieve unique neighbors for each cluster centroid.</p><p>Results from devset are shown in Table <ref type="table" target="#tab_0">1</ref>. One may observe that SIFT-BOVW <ref type="bibr" target="#b2">[2]</ref> features works better than default features. The reason is that local descriptors are generally works better to capture similarities among images so that each cluster becomes more coherent. GIST <ref type="bibr" target="#b4">[4]</ref> features also perform better than the default features and perform similar to SIFT-BOVW features. Results from our 2 submissions, namely Run#1 and Run#5, can be found in Table <ref type="table" target="#tab_1">2</ref>. Similar to devset results, using SIFT-BOVW we obtain better results from Run#5 than Run#1. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">CONCLUSIONS</head><p>We showed that it is possible to obtain competitive results using only visual features. Our framework first eliminates the outliers and then using clustering it tries to leverage the diversity to the retrieval results. However it is obvious that one can improve the scores by utilizing more information into our framework like textual features, credibility scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">ACKNOWLEDGMENTS</head><p>This research was supported by the MUCKE project funded within the FP7 CHIST-ERA scheme and also Scientific and Technical Research Council of Turkey (TUBITAK) under grant number 112E174.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Overall framework structure. When the images related to a specific location are given as input, our framework produces diversified results for that location.</figDesc><graphic coords="1,330.57,240.11,211.60,158.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Results on devset using provided features, GIST and SIFT-BOVW.</figDesc><table><row><cell>Feat. name</cell><cell cols="2">P@20 CR@20 F1@20</cell></row><row><cell>descvis</cell><cell>0.7139 0.3813</cell><cell>0.4863</cell></row><row><cell>GIST</cell><cell>0.7209 0.3798</cell><cell>0.5037</cell></row><row><cell cols="2">SIFT-BOVW 0.7167 0.3933</cell><cell>0.5013</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Official results on testset.</figDesc><table><row><cell cols="4">Run# P@20 CR@20 F1@20</cell></row><row><cell>1</cell><cell>0.6809</cell><cell>0.375</cell><cell>0.4758</cell></row><row><cell>5</cell><cell>0.7228</cell><cell>0.387</cell><cell>0.4966</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.businessinsider.com/facebook-350-millionphotos-each-day-2013-9</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">2 http://blog.1000memories.com/94-number-of-photos-evertaken-digital-and-analog-in-shoebox</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Popescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Gînscȃ</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2014 Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">October 16-17. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Object recognition from local scale-invariant features</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Lowe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The proceedings of the seventh IEEE international conference on</title>
				<imprint>
			<publisher>Ieee</publisher>
			<date type="published" when="1999">1999. 1999</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="1150" to="1157" />
		</imprint>
	</monogr>
	<note>Computer vision</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Using one-class svm outliers detection for verification of collaboratively tagged image training sets</title>
		<author>
			<persName><forename type="first">H</forename><surname>Lukashevich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nowak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dunker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICME 2009. IEEE International Conference on</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="682" to="685" />
		</imprint>
	</monogr>
	<note>Multimedia and Expo</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Building the gist of a scene: The role of global image features in recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Oliva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Torralba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Progress in brain research</title>
		<imprint>
			<biblScope unit="volume">155</biblScope>
			<biblScope unit="page" from="23" to="36" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">VLFeat: An open and portable library of computer vision algorithms</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vedaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Fulkerson</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
