<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Ranking Based Clustering for Social Event Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Taufik</forename><surname>Sutanto</surname></persName>
							<email>taufik.sutanto@qut.edu.au</email>
							<affiliation key="aff0">
								<orgName type="institution">Queensland University of Technology</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Richi</forename><surname>Nayak</surname></persName>
							<email>r.nayak@qut.edu.au</email>
							<affiliation key="aff1">
								<orgName type="institution">Queensland University of Technology</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Ranking Based Clustering for Social Event Detection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E6C48CFA7223AD302E1BC9FE1254B6CC</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The problem of clustering a large document collection is not only challenged by the number of documents and the number of dimensions, but it is also affected by the number and sizes of the clusters. Traditional clustering methods fail to scale when they need to generate a large number of clusters. Furthermore, when the clusters size in the solution is heterogeneous, i.e. some of the clusters are large in size, the similarity measures tend to degrade. A ranking based clustering method is proposed to deal with these issues in the context of the Social Event Detection task. Ranking scores are used to select a small number of most relevant clusters in order to compare and place a document. Additionally, instead of conventional cluster centroids, cluster patches are proposed to represent clusters, that are hubs-like set of documents. Text, temporal, spatial and visual content information collected from the social event images is utilized in calculating similarity. Results show that these strategies allow us to have a balance between performance and accuracy of the clustering solution gained by the clustering method.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The Social Event Detection (SED) task at the 2014 Medi-aEval Benchmark for Multimedia Evaluation consists of two subtasks: <ref type="bibr" target="#b1">(1)</ref> Image clustering based on a given set of events; and (2) retrieval of social events based on predefined queries <ref type="bibr" target="#b4">[4]</ref>. The SED task poses challenges to clustering analysis due to the real-world nature of the data such as the large number of dimensions, large data size, multi-domain types of features, and the need to group data into a large and unfixed number of clusters. This paper focuses on proposing a solution to the first subtask, i.e., semi-supervised clustering of social event images based on the metadata and visual content.</p><p>Search engine technologies e.g., Sphinx, Lucene or Solr have been successfully implemented to process large sized document collections for information retrieval. Utilizing the concept of ranking scores used in search engines, coupled with using prior knowledge from the learning data, in semisupervised clustering has shown to be an effective and efficient approach of clustering text data <ref type="bibr" target="#b5">[5,</ref><ref type="bibr">6]</ref>. This type of approaches works fine when the collection size or the number of clusters required is small. Calculating ranking scores for a large number of documents is known to be computa-Copyright is held by the author/owner(s).</p><p>MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain tionally expensive, as well as, a large size cluster makes the similarity measure between documents ambiguous.</p><p>Semi-supervised clustering methods have shown to produce a better result compared to their traditional unsupervised counterpart <ref type="bibr" target="#b2">[2]</ref>. In the 2013 SED task, we proposed and used a scalable ranking based semi-supervised clustering approach that produces accurate clusters <ref type="bibr" target="#b5">[5]</ref>. However, this method suffers with the communication cost for long documents. To deal with the issue, we utilized the document frequency distribution and exclude the most occurring terms in the query document (i.e. the document to be clustered) if needed.</p><p>The use of hubs has been explored and has shown its efficacy in dealing with high dimensional data and clusters with large sizes <ref type="bibr" target="#b3">[3]</ref>. However, the k-NN calculation of hubs demands a considerable amount of extra computation that is not suitable for large data clustering. In this paper, documents are assigned to clusters based on its distances to cluster patches. These patches are calculated based on the ranking scores from the queries. Document frequencies are used to select a subset of terms from documents to create the queries. These patches become data representatives to measure distances for a document, instead of using each cluster centroid. The use of patches is expected to enable the clustering method to capture more specific sub-topics within a cluster.</p><p>In this paper, we present a method based on cluster patches to calculate the distance between a document and the groups of documents inside a cluster (Figure <ref type="figure" target="#fig_0">1</ref>). Instead of a single centroid, patches are proposed to represent a large highdimensional cluster in order to control the significance of similarity measurement. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">PREPROCESSING</head><p>All the features of the images were used in the clustering process except of their URL. English stopwords and some symbols (e.g. #,&amp;,@) were filtered. Title, tag, username, and description attributes were combined into a short document. No external resources were used in the analysis. The document length normalized tf-idf was used as the term weighting scheme. The time information were transformed into day interval between date taken and date upload. Spatial information (i.e. latitude and longitude) were used by utilizing a modified Harversine-formula. The modification is done by changing the range of the measure to a unit value as in cosine distance. Feature-based super-pixel segmentation is used to extract compact color and texture representation for small image patches <ref type="bibr" target="#b1">[1]</ref>. This representation has smaller dimension compared to the bag-of-visual words (BOVW) approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">THE MODEL</head><p>A set of patches P are calculated in each iteration based on the ranking score from the document query. Instead of comparing a document with a cluster centroid, the document feature vector is compared with all the patches. The patches are calculated based on a certain size (δ) neighborhood of documents based on ranking scores within clusters. Optimal distance from the document and these patches is then used to decide the document assignment to a cluster. More detail of the approach is given in Algorithm 1.  βi is a weight parameter to combine the effect of various types of attributes. These parameters can be fine tuned manually or calculated from the learning data by using variable importance measures from a decision tree model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">RESULTS AND DISCUSSION</head><p>We submitted five runs for the supervised clustering task (Table <ref type="table" target="#tab_0">1</ref>). Runs one, three, four, and five used the proposed method on text only, text-time-space, all attributes, and text-images set of attributes respectively. While run two is using the method as described in <ref type="bibr" target="#b5">[5]</ref> using the text attribute only. The first two runs indicate that the proposed method has comparable accuracy to general ranking method, but an improved cluster quality as shown by NMI. While the remaining runs shows that the usage of image, spatial, and time information is ineffective in this data for the purpose of clustering. The main reason behind this is the dependence of the proposed method on text ranking.</p><p>An adaptive weighting where weights of each attribute are dynamic among documents is a priority for future investigation to solve this issue. Future work will also explore on finding the optimal parameter γ and improve the scalability of the method in distributed data and distributed computing environment.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Ranking based document clustering with patches.</figDesc><graphic coords="1,355.40,543.21,179.87,156.56" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Algorithm 1 :</head><label>1</label><figDesc>input : Set of documents D, initial clusters C = {c1, c2, . . . , cK }, neighborhood size m, patches size δ, and cluster threshold γ. output: K disjoint partitions of D. Index all documents D; for each di ∈ Dtest do calculate a set of cluster patches P = {0 &lt; |d rank | &lt; δi, i ∈ I, d ∈ cj}; for each p ∈ P do calculate p * = maxp{sim(di, p), p ∈ P }; if sim(di, p * ) &gt; γ then Assign document di to a cluster where p * belongs; else Form a new cluster c=di; end Update cluster labels via the search engine end end Incremental ranking based social event images clustering algorithm. The similarity measure between a document d and a patch p in a cluster c is given by utilizing textual, temporal, spatial and visual information within images: sim(d, p) = β1sim cosine (d, p) + β2sim time (d, p)+ β3sim space (d, p) + β4sim image (d, p).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Semi-Supervised clustering results.</figDesc><table><row><cell></cell><cell>Run1</cell><cell>Run2</cell><cell>Run3</cell><cell>Run4</cell><cell>Run5</cell></row><row><cell cols="6">F1-Score 0.7463 0.7533 0.7445 0.7440 0.7456</cell></row><row><cell>NMI</cell><cell cols="5">0.9024 0.9020 0.9017 0.9015 0.9020</cell></row><row><cell>Div. F1</cell><cell cols="5">0.7447 0.7516 0.7428 0.7424 0.7439</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">ACKNOWLEDGMENTS</head><p>We like to thank Dr Simon Denman from QUT Computational Intelligence and Signal Processing lab for providing us the image visual content encoding.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">SLIC superpixels compared to state-of-the-art superpixel methods</title>
		<author>
			<persName><forename type="first">R</forename><surname>Achanta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shaji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lucchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Susstrunk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Constrained Clustering: Advances in Algorithms</title>
		<author>
			<persName><forename type="first">S</forename><surname>Basu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Davidson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wagstaff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Theory, and Applications</title>
				<imprint>
			<publisher>Chapman &amp; Hall/CRC</publisher>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
	<note>1 edition</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The heterogeneous cluster ensemble method using hubness for clustering text documents</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nayak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WISE 2013</title>
				<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<biblScope unit="page" from="102" to="110" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Social event detection at MediaEval 2014: Challenges, datasets, and evaluation</title>
		<author>
			<persName><forename type="first">G</forename><surname>Petkos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Papadopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Mezaris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kompatsiaris</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop</title>
				<meeting>the MediaEval 2014 Multimedia Benchmark Workshop<address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1044">October 16-17, 2014. 1044. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">ADMRG @ MediaEval 2013 social event detection</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sutanto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nayak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</title>
				<meeting>the MediaEval 2013 Multimedia Benchmark Workshop<address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">October 18-19, 2013. 2013</date>
			<biblScope unit="volume">1043</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">The ranking based constrained document clustering method and its application to social event detection</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sutanto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nayak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Database Systems for Advanced Applications</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">8422</biblScope>
			<biblScope unit="page" from="47" to="60" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
