<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">USI Participation at SMERP 2017 Text Summarization Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Anastasia</forename><surname>Giachanou</surname></persName>
							<email>anastasia.giachanou@usi.ch</email>
							<affiliation key="aff0">
								<orgName type="institution">Università della Svizzera italiana (USI)</orgName>
								<address>
									<settlement>Lugano</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ida</forename><surname>Mele</surname></persName>
							<email>ida.mele@usi.ch</email>
							<affiliation key="aff0">
								<orgName type="institution">Università della Svizzera italiana (USI)</orgName>
								<address>
									<settlement>Lugano</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fabio</forename><surname>Crestani</surname></persName>
							<email>fabio.crestani@usi.ch</email>
							<affiliation key="aff0">
								<orgName type="institution">Università della Svizzera italiana (USI)</orgName>
								<address>
									<settlement>Lugano</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">USI Participation at SMERP 2017 Text Summarization Task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E70972C256A816677B4938E8D8328141</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Twitter</term>
					<term>emergency situations</term>
					<term>text summarization</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This short report describes the participation of the Università della Svizzera italiana (USI) at the SMERP Workshop Data Challenge Track for the task text summarization of Level 1. Our participation is based on a linear interpolation for combining relevance and novelty scores of the retrieved tweets. Our method is fully automatic. For the relevance score we used the results from our runs at the text retrieval task whereas for the novelty we used a method based on Word2Vec. In total, we submitted four different runs and we used two different weight parameters. The results showed that when relevance and novelty have an equal contribution in selecting the tweets to use for the summary, the performance is better compared to favoring only the novelty. Additionally, information from POS tags improves the performance of the summarization task.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Recent years have seen the rapid growth of social media platforms (e.g., Facebook, Twitter, Google+) that enable people to share information on the web with a simple way. People use social media platforms for a number of different reasons that range from writing their opinions on products to sharing information on emergency situations.</p><p>Twitter<ref type="foot" target="#foot_0">1</ref> , one of the most popular microblogs, is a good source of information and mining it can be very useful to assist relief operations in emergency situations. However, a large number of data is posted online, hence it is very difficult to extract and summarize useful information from tweet. Tweet summarization aims to automatically generate a condensed version of the most important content from the tweets that are relevant to a specific information need. Past research work on tweet summarization focused on topic-level summarization. Sharifi et al. <ref type="bibr" target="#b8">[9]</ref> proposed a technique based on finding the most commonly used phrases for a topic to create topic-related summaries. Inouye and Kalita <ref type="bibr" target="#b4">[5]</ref> proposed to use clustering methods for selecting the posts to add to the summary whereas Chakrabarti et al. <ref type="bibr" target="#b1">[2]</ref> proposed a methodology based on Hidden Markov Model.</p><p>Other researchers have analyzed Twitter data for finding newsworthy stories <ref type="bibr" target="#b0">[1]</ref> or for understanding what caused a change in the opinion of users <ref type="bibr" target="#b2">[3]</ref>. These works are related to the task of information extraction and are orthogonal to the problem of text summarization which is based on a specific information need (e.g., a query or a topic).</p><p>In this short report, we present our methodology for the text summarization task at the Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) data challenge. Our participation is based on a linear interpolation which combines relevance and novelty scores of the retrieved tweets.</p><p>For computing the relevance scores we used the same techniques used for the runs we submitted to the SMERP Data Challange Track of the text retrieval task. Our first submitted run for this task was based on plain query expansion whereas the second one used additional information from POS tags. A detailed description of the methodology we proposed for the task of text retrieval is provided in <ref type="bibr" target="#b3">[4]</ref>.</p><p>Our summarization methods are fully automatic. We submitted four different runs for the summarization task (i.e., two for each of the two runs used in the text retrieval task). For each of them we assigned a different weight parameter which represents the importance of relevance and novelty of tweets and allows to produce a list of relevant and at the same time diverse tweets which can be used in the summary.</p><p>To compute the novelty of each tweet, we decided to use a metric that is based on text similarity. For computing this similarity we used a methodology based on word embeddings. More specifically, we used Word2Vec <ref type="bibr" target="#b6">[7]</ref> to produce word embeddings able to capture the semantic similarity. Word embeddings have been used in several application including topic extraction <ref type="bibr" target="#b5">[6]</ref> and sentiment analysis <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b9">10]</ref>.</p><p>The results showed that setting the weight parameter to 0.5 (i.e., relevance and novelty have an equal contribution) performs better compared to favoring only the diversity. In addition, we could observe that information from POS tags improves the performance in the summarization task.</p><p>This report is organized as follows. Section 2 describes the methodology we adopted for the task of text summarization. In Section 3 we present the results of our experiments, and Section 4 concludes the report.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methodology</head><p>For this task, we used a fully automatic method to extract summaries based on the linear interpolation of relevance and novelty scores. The novelty is quantified as the diversity of the current tweet with respect to the other tweets in the relevance ranking that can be selected for the text summary.</p><p>For the summarization we used the tweets retrieved in the two runs of the text retrieval task <ref type="bibr" target="#b3">[4]</ref>. More formally, let t i be a tweet with position i in the relevance ranking for a query, we computed the following summary score:</p><formula xml:id="formula_0">S(t i ) = λ * rel(t i ) + (1 − λ) * div(t i )</formula><p>where rel(t i ) is the normalized relevance score of the tweet t i , and div(t i ) is the diversity score of t i . The weight parameter λ balances relevance and diversity, in particular, the larger the value of λ, the more diversity is rewarded. We submitted 4 runs: USI 1 1 and USI 2 1 with λ = 0.5 in order to give same importance to relevance and diversity; USI 1 2 and USI 2 2 with λ = 0.8 to favor the diversity.</p><p>The diversity score refers to the novelty of each tweets that is in the result list and is calculated as:</p><formula xml:id="formula_1">div(t i ) = 1 − maxSim(t i )</formula><p>where maxSim(t i ) is the maximum similarity between the tweet t i and each of the tweets that were retrieved before it:</p><formula xml:id="formula_2">maxSim(t i ) = max j∈{1,...,i−1} sim(t i , t j )</formula><p>Such similarity is computed by using a methodology based on Word2Vec<ref type="foot" target="#foot_1">2</ref> . We use Word2Vec <ref type="bibr" target="#b6">[7]</ref> to produce word embeddings because we want to capture the semantic similarity, too. To train the model, we use an external collection C e and we set the window to 5.</p><p>The collection C e consists of the tweets posted during Nepal earthquake that occurred on the 25th of April 2015. To be more specific, the original collection contains 90,000 tweets posted from the 1st to the 5th of May 2015. To use the collection for the training, we first removed the URLs, some specific characters (e.g., @, #), and the retweets. Then, we filtered out terms that are specific to Nepal earthquake by extracting the entities related to geographical names or people (e.g., Kathmandu, Mahadevstan, Rahul Gandhi) and removing all of them. At the end of this cleaning process we had 22,017 tweets, 198,280 tokens, and 12,379 unique tokens.</p><p>After having computed the summary scores, we ranked the tweets based on their decreasing values and took the first tweets in the summary-score ranking in order to have a summary up to 300 words.</p><p>Table <ref type="table" target="#tab_0">1</ref> shows the summary of the submitted runs for the task of text summarization for Level 1. Table <ref type="table" target="#tab_1">2</ref> shows the performance results of the submitted runs for the task of text summarization for Level 1 ranked according to ROUGE-1. From the results we can observe that in the runs where we used both query expansion and POS tags to retrieve the relevant tweets performed better compared to other methods that were based only on query expansion. Also, we observe that setting the weight parameter to 0.5 performs better compared to the other which favors diversity. We plan to do some further analysis on the results to understand the strengths and the limitations of our methods. Finally we should note that our runs were the only fully automatic methods submitted for text summarization at Level 1 and therefore we can not directly compare the performance of our methods to the one achieved by the approaches submitted by the other groups.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusions</head><p>In this short report we presented the participation of the Università della Svizzera italiana (USI) at the SMERP Workshop Data Challenge Track for the task text summarization at Level 1. Our participation was based on a linear interpolation for combining relevance and novelty scores of the retrieved tweets. We submitted four different runs. The results showed that setting the weight parameter to 0.5 performs better compared to favoring diversity. In addition, the results showed that using information from POS tags yields better performance in the summarization task.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Summary of runs</figDesc><table><row><cell>Run id</cell><cell>Task</cell><cell>Description of the run</cell></row><row><cell cols="2">USI 1 1 Summarization</cell><cell>QE, λ = 0.5</cell></row><row><cell cols="2">USI 1 2 Summarization</cell><cell>QE, λ = 0.8</cell></row><row><cell cols="3">USI 2 1 Summarization QE + POS, λ = 0.5</cell></row><row><cell cols="3">USI 2 2 Summarization QE + POS, λ = 0.8</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Performance results on text summarization task</figDesc><table><row><cell>Run id ROUGE-1</cell></row><row><cell>USI 2 1 0.3209</cell></row><row><cell>USI 1 1 0.3044</cell></row><row><cell>USI 2 2 0.3035</cell></row><row><cell>USI 1 2 0.3010</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://twitter.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">The library used for Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments. This research was partially funded by the Swiss National Science Foundation (SNSF) under the project OpiTrack.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Beyond Trending Topics: Real-World Event Identification on Twitter</title>
		<author>
			<persName><forename type="first">H</forename><surname>Becker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Naaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gravano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media</title>
				<meeting>the Fifth International AAAI Conference on Weblogs and Social Media</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="438" to="441" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Event summarization using tweets</title>
		<author>
			<persName><forename type="first">D</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Punera</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media</title>
				<meeting>the Fifth International AAAI Conference on Weblogs and Social Media</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="66" to="73" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Explaining sentiment spikes in twitter</title>
		<author>
			<persName><forename type="first">A</forename><surname>Giachanou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Crestani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 25th ACM International on Conference on Information and Knowledge Management</title>
				<meeting>the 25th ACM International on Conference on Information and Knowledge Management</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2263" to="2268" />
		</imprint>
	</monogr>
	<note>CIKM &apos;16</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">USI Participation at SMERP 2017 Text Retrieval Task</title>
		<author>
			<persName><forename type="first">A</forename><surname>Giachanou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Crestani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) Workshop (Data Challenge Track)</title>
				<meeting>Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) Workshop (Data Challenge Track)</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Comparing twitter summarization algorithms for multiple post summaries</title>
		<author>
			<persName><forename type="first">D</forename><surname>Inouye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">K</forename><surname>Kalita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="298" to="306" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Topical word embeddings</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">S</forename><surname>Chua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence</title>
				<meeting>the Twenty-Ninth AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="2418" to="2424" />
		</imprint>
	</monogr>
	<note>AAAI &apos;</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Learning Representations. ICLR &apos;</title>
				<meeting>the International Conference on Learning Representations. ICLR &apos;</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">13</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Twitter sentiment analysis with deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Severyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moschitti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="959" to="962" />
		</imprint>
	</monogr>
	<note>SIGIR &apos;15</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Summarizing microblogs automatically</title>
		<author>
			<persName><forename type="first">B</forename><surname>Sharifi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hutton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kalita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="685" to="688" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Building large-scale twitter-specific sentiment lexicon: A representation learning approach</title>
		<author>
			<persName><forename type="first">D</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers</title>
				<meeting>the 25th International Conference on Computational Linguistics: Technical Papers</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="172" to="182" />
		</imprint>
	</monogr>
	<note>COLING &apos;</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
