<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Photo Set Refinement and Tag Segmentation in Georeferencing Flickr Photos</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Jiewei</forename><surname>Cao</surname></persName>
							<email>jonbakerfish@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">South China University of Technology</orgName>
								<address>
									<settlement>GuangZhou</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Photo Set Refinement and Tag Segmentation in Georeferencing Flickr Photos</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">2C00E2D53C59464E03B3650DFA3DC517</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T17:58+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we describe our approach as part of the MediaEval 2013 Placing Task evaluation. We use language model and similarity search as baseline approach, and improve the accuracy by two techniques: photo set refinement and tag segmentation. The first technique takes advantage of geo-location correlation among test photos and the second one exploits the textual similarity between tags.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The MediaEval 2013 Placing Task requires participants to assign geographical coordinates (latitude and longitude) to each provided test image, we refer to <ref type="bibr" target="#b1">[1]</ref> for a detailed description. A framework proposed by <ref type="bibr" target="#b2">[2]</ref> is used as our baseline approach. The main contributions of this paper are two techniques to improve the accuracy of georeferencing. Firstly, we noticed that Flickr users can organize their photos by assigning them to different sets and collections <ref type="foot" target="#foot_0">1</ref> . Intuitively, photos in the same set are highly correlated, and we can exploit these relations when estimating the geolocation of given images. The outcome of our submitted runs justifies this assumption. Secondly, when only training data provided by the task organizers can be used, the unseen tags -tags only existing in test data -are useless for geo-referencing. However, we tried to exploit these tags by applying tag segmentation. This is similar to the word segmentation pre-processing for language that written without spaces between words, such as Chinese. Both proposed techniques can be applied to other existing systems with little changes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">METHODOLOGY 2.1 Data Pre-processing</head><p>A total of 8,539,050 geo-referenced photos from Flickr were provided as training data. Following <ref type="bibr" target="#b2">[2]</ref>, we carried out two preliminary filter steps on this training set. First, photos without tags are removed. Second, we removed the duplicated photos in a slightly different approach: photos uploaded by the same user, and with an identical tag set, and the Haversine distance among these photos is less than are treated as duplicates and only one instance is retained. Here we use a distance threshold instead of identical latitude and longitude in order to relax the restriction of filtering, and we can remove more or less duplicates according to the we selected. Smaller distance threshold means more photos with identical tag set and different location can be retained, and identical geo-location is a special case when . Finally, this resulted in a pre-processed training set with 4,538,784 photos when the . There are five different test sets and we chose test3 whose size is 53,000. We didn't use any external resource for georeferencing except run 5, in which we geocoded the home location of users in the test set, using the Google Geocoding API<ref type="foot" target="#foot_1">2</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Baseline Approach</head><p>The framework proposed by <ref type="bibr" target="#b2">[2]</ref> applies a two steps approach to estimate the location of test photos. First, the location of the training data are clustered into 500, 2500 and 10000 clusters which could be referred to as , and . Given a clustering, a Naï ve Bayes classifier is used to find the most likely cluster to contain the location of a given test photo. Second, within the found cluster, they use a similarity search to find the training items whose tags are the closest to the ones of test photo. In <ref type="bibr" target="#b3">[3]</ref>, they proposed an improved spatially aware feature ranking method which is based on Ripley's K statistic. Therefore, we use this framework with Ripley's K feature selection as our baseline approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Photo Set Refinement</head><p>Photos within the same set or collection would be highly geolocation correlated. For example, a user can upload his photos which were taken on during a trip into a new set created by him. However, not every photo in the same set is well tagged because a user only tags the photos he loved or interested in, and leaving others un-tagged or poorly tagged. This will result in photos with completely different tag sets or visual content could be considered as taken in the same location or nearby, if they were within the same photo set.</p><p>A test photo with poor tags will result in a bad estimation. However, if this photo belongs to a photo set which contains one or more photos with well estimated location (usually well tagged), then we can use the centroid location of these photos as the estimation for the bad one. This is the intuition of our proposed photo set refinement, and there are two problems here: 1. Given a photo, how to find its neighbors within the same photo sets? 2. How to distinguish between the well estimated photo and bad one? Although we didn't handle the Placeability sub-task of Placing Task at MediaEval 2013, our solution for the second problem may be considered as a naive approach for error estimation.</p><p>To handle the first problem, it seems we can simply break down the test data into different sets according to the original photo sets created by users. However, a photo set in this user scenario can be changed from time to time, whether it's adding new photos or deleting the old ones. And the geo-location correlation between these photos will become weaker. Therefore we need a different approach: Given a photo, we find its neighbors in the test data by comparing their user id, the timestamp of the photo was taken on and uploaded. If a photo has an identical user id with the given photo, and the time interval between their taken dates is less than , and their uploaded dates interval is less than , then we consider these two photos belong to the same photo set. Here both thresholds ( and ) are set to 7 days because we consider a week-long vacation is common for most people, and photos taken and uploaded during these days can be consider as a photo set.</p><p>There are three clusterings of the training data, namely , and , and a given test photo can be classified to three different medoids respectively, which we referred to as , and . Intuitively, these three medoids are not far from each other if is well estimated and vice versa. So given a photo set , we consider as well estimated if all the Haversine distances among , and are less than 1000km, otherwise is marked as badly estimated. Finally, we use the centroid location of well estimated photos as the finial estimation for the poorly estimated ones, and if no well estimated photo is found, we use the home location of the user (in run 5 only) or simply leave it unchanged.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Tag Segmentation</head><p>Consider the tag = 'southchinauniversityoftechnology' and tag = 'southchinauniversityoftechnologylibrary'. If was an unseen tag, it will be ignored even though we can assume that and are correlated because of their textual similarity. However, we can split into two terms 'southchinauniversityoftechnology' and 'library', then the first term is identical to and can be used for georeferencing. Our approach for tag segmentation is to model the distribution of the segmentation output. First, we assume all tags are independently distributed, and the relative frequency of all tags in the training data was calculated. We created a tag dictionary sorted in descending order with size 2,080,618. We also assume that the tags in the training data follow Zipf's law <ref type="bibr">[4]</ref>, which means that the tag with rank has probability , where is the number of tags in the dictionary. Then we use dynamic programming to infer the position of the cut point. The most likely segmentation is the one that maximizes the product of the probability of each individual split term. Instead of directly using the tag probability, we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.</p><p>Given a test photo, all the tags in this photo are preprocessed by tag segmentation before georeferencing. For each tag, we select its longest split term and assign it to this photo as a new tag. The remaining terms (such as 'library') are discarded because these terms are usually not spatially relevant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">RESULTS AND DISCUSSION</head><p>We submitted five runs and the results of our experiments are shown in Table <ref type="table" target="#tab_0">1</ref>.</p><p>run1: is the baseline approach run2: uses visual features only and K-nearest neighbor search. run3: corrects poorly estimated photos in run1 by photo set refinement proposed in section 2.3. run5: uses the user home location in the photo set refinement step.</p><p>Note that this location is also used when estimating the prior probability in language model framework, we refer to <ref type="bibr" target="#b2">[2]</ref> for more details.</p><p>The result of run3 justifies our assumption and we can estimate test photos jointly to improve the accuracy. In our experiment, the number of different estimated photos between run1 and run3 is 4,963, and this is the number of photos changed during the photo set refinement step. After comparing the georeferencing result of run1 and run3 with the ground truth, among these 4,963 photos, we found that 4,390 photos' estimated location in run3 became closer to the real location in comparison with run1, and the rest of 573 photos had a larger error distance in run3 compared with run1. This is mainly caused by the incorrectness of differentiating well estimated photo and the bad one. For some well estimated photos, the Haversine distances among their , and could be far from each other. Therefore, we need a much more robust way to find out the error estimation.</p><p>Run4 doesn't show a promising improvement compared with run3. The reason is that unseen tags are not always segmentable, but the proposed technique did improve the performance slightly and the extra time and computational costs are low. However, other than tag segmentation which only exploits the textual similarity between unseen tags and training tags, we can also try to find out the semantic similarity between them by utilizing external resource or machine learning technique.</p><p>Run5 indicates that the home location of the user is very important for georeferencing for most photos, which is consistent with previous research findings. In run2, we simply used the extracted visual features provided by task organizers and ran a Knearest neighbor search to find the most similar photo in the training set. However, we didn't get a reasonably geo-location prediction and more intensive study is needed in our future work.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 : Percentage of correctly detected locations and median error of each run in kilometer.</head><label>1</label><figDesc></figDesc><table><row><cell></cell><cell cols="6">1 km 10 km 100 km 500 km 1000 km ME km</cell></row><row><cell cols="2">run1 20.7</cell><cell>43.0</cell><cell>55.3</cell><cell>62.8</cell><cell cols="2">66.3 37.65831</cell></row><row><cell>run2</cell><cell>0.0</cell><cell>0.0</cell><cell>0.0</cell><cell>0.1</cell><cell cols="2">0.6 10026.17</cell></row><row><cell cols="2">run3 21.1</cell><cell>44.2</cell><cell>57.1</cell><cell>65.2</cell><cell cols="2">69.2 28.01581</cell></row><row><cell cols="2">run4 21.2</cell><cell>44.2</cell><cell>57.5</cell><cell>65.5</cell><cell>69.6</cell><cell>27.0791</cell></row><row><cell cols="2">run5 20.9</cell><cell>46.1</cell><cell>61.7</cell><cell>71.8</cell><cell cols="2">76.5 16.73021</cell></row><row><cell cols="7">run4: is similar to run3 but tag segmentation is used to preprocess</cell></row><row><cell></cell><cell cols="4">the test data before georeferencing.</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.flickr.com/help/collections/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://developers.google.com/maps/documentation/geocoding/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Working Notes for the Placing Task at MediaEval 2013</title>
		<author>
			<persName><forename type="first">C</forename><surname>Hauff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thomee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Trevisiol</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2013 Workshop</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013-10-19">18-19 October 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Georeferencing Flickr resources based on textual meta-data</title>
		<author>
			<persName><forename type="first">O</forename><surname>Van Laere</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schockaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dhoedt</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.ins.2013.02.045</idno>
		<ptr target="http://dx.doi.org/10.1016/j.ins.2013.02.045" />
	</analytic>
	<monogr>
		<title level="j">Information Sciences</title>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Spatially-Aware Term Selection for Geotagging</title>
		<author>
			<persName><forename type="first">O</forename><surname>Van Laere</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Quinn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schockaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dhoedt</surname></persName>
		</author>
		<idno type="DOI">10.1109/TKDE.2013.42</idno>
		<ptr target="http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.42" />
	</analytic>
	<monogr>
		<title level="j">IEEE TKDE</title>
		<imprint>
			<biblScope unit="volume">2013</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Human Behaviour and the Principle of Least-Effort</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">K</forename><surname>Zipf</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1949">1949</date>
			<publisher>Addison-Wesley</publisher>
			<pubPlace>Cambridge MA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
