<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Summarizing Disaster Related Event from Microblog</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sandip</forename><surname>Modha</surname></persName>
							<email>sjmodha@gmail.com</email>
						</author>
						<author>
							<persName><forename type="first">Dhirubhai</forename><surname>Ambani</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Rishab</forename><surname>Singla</surname></persName>
							<email>singlarishab15@gmail.com</email>
						</author>
						<author>
							<persName><forename type="first">Prasenjit</forename><surname>Majumder</surname></persName>
							<email>prasenjit_majumder@gmail.com</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Institute of Information and Communication Technology</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Institute of Information and Communication Technology</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Institute of Information and Communication Tech-nology</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="institution" key="instit1">Chintak Soni</orgName>
								<orgName type="institution" key="instit2">LDRP-ITR</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Summarizing Disaster Related Event from Microblog</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">46B83E99BF626FE95EA5B0CC236B0BD2</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Microblog</term>
					<term>Information Retrieval</term>
					<term>Disaster</term>
					<term>Wordnet</term>
					<term>BM25</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The Information Retrieval Lab at DA-IICT India participated in text summarization of the Data Challenge track of SMERP 2017. SMERP 2017 track organizers have provided the Italy earthquake tweet dataset along with the set of topics which describe important information required during any disaster related incident. The main goal of this task is to gather how well the participant's system summarizes important tweets which are relevant to a given topic in 300 words. We have anticipated Text summarization as a clustering problem. Our approach is based on extractive summarization. We have submitted runs in both the levels with different methodologies. We have done query expansion on the topics using Wordnet. In the first level, we have calculated the cosine similarity score between tweets and expanded query. In the second level, we have used language model with Jelinek-Mercer smoothing to calculate relevance score between tweets and expanded query. We have selected tweets above a relevance threshold which are the initial candidate tweets for the summarization of each query. To ensure novelty, Jaccard Similarity is used to create a cluster for each topic. We have reported results in terms of ROGUE-1, ROGUE-2, ROGUE-L and ROGUE-SU4.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Microblogs, like Twitter, provide a unique crowdsourcing platform where people across the world can post their opinions or observations about real world events. Twitter is the real time data source which has massive user-generated content. Since tweets are posted by multiple users with diverse views, many tweets have redundant content. Due to enormous volume of the tweets, tweet visualization is the biggest challenge. We can address this challenge by creating a summary from relevant tweet with respect to given topic.</p><p>The aim of the Text summarization Data Challenge Track is to evaluate and benchmark different summarization systems on standard social media dataset. The text summarization track is offered in two levels. In the first level, tweets which are posted on the first day of the earthquake in Italy were provided. Tweets posted on second and third day of the Italy earthquake were provided in the second level.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Summarization methods can be divided into two types (i) Extractive Summarization (ii) Abstractive summarization. We have focused on extractive summarization. Basically, Extractive Summarization methods are further divided into 3 types which are (i)) graph based (ii) cluster based (iii) Centroid based. TREC<ref type="foot" target="#foot_0">1</ref> has started Microblog track since 2011 with an adhoc retrieval task and converged it into real time summarization in 2016. CLIP <ref type="bibr" target="#b1">[2]</ref> used a word embedding technique to expand query. They have used BM25 model to calculate relevance score between tweets and query. For summarization, they used jaccard similarity across relevant tweets. Luchenet. al <ref type="bibr" target="#b3">[4]</ref> used simple keyword matching technique which assigns more weight to the original term compared to the expanded term. For summarization, they have used simple word overlap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Statement</head><p>Given topics Q = &lt;SMERP-T 1 , SMERP-T 2 , SMERP-T 3 , SMERP-T 4 &gt;, and Tweets DataSet T = &lt;T 1 , T 2 ,..,T n &gt; from the dataset, we need to compute the relevance score between tweets and topics in order to create topic-wise summary S = &lt;S Q1 ....S Qn &gt;.Where S Qi is the set of topic-wise relevant and novel tweets. We can model topic specific summary as below.</p><p>SQ1=&lt;T 1 ,T 2 ,…..,T n &gt; where T i ,T j Є T For given topic, Relevance score between tweet and topic must be greater than specified threshold T rel . In addition to this, these tweets should be novel i.e. similarity between all tweet of the summary should less that the novelty threshold T nov .if any tweet T i is included in the summary for a particular topic then it should satisfy the following constraints.</p><p> Length of summary of profile(S Qi ) &lt;= 300 word  Relevance score(t i , Q i ) &gt;T rel  Sim(t i ,t j ) &lt;T nov for all t j Є S Qi</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4</head><p>Our Methodology</p><formula xml:id="formula_0">4</formula><p>topics have been provided in TREC format by the track organizers. The topics consist of a title, description and a narrative. The topics might be referred to as queries in the paper. Further, we elaborate our approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Topic Preprocessing</head><p>Topics consist of a title in which the general information needed is given. A description, which is sentence long and a narrative, the content of which is paragraph long gives an elaborate picture of the topic. &lt;narr&gt; Narrative: A relevant message must contain information about relief-related activities of different NGOs and Government organizations engaged in rescue and relief operation. Messages that contain information about the volunteers visiting different geographical locations would also be relevant. Messages indicating that organizations are accumulating money and other resources will also be relevant. However, messages that do not contain the name of any NGO / Government organization would not be relevant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&lt;/top&gt;</head><p>The topic to query conversion starts with removal of stopwords. We run Stanford POS tagger <ref type="foot" target="#foot_1">2</ref> .The noun and verb labeled keywords are extracted and added to the query. We believe that topics are vague so by human intervention, the query is built.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Topic Expansion</head><p>We have used a lexical database WordNet<ref type="foot" target="#foot_2">3</ref> for topic expansion which puts English words into sets of synonyms, synsets. The top two synonyms are extracted and added to the query using Wordnet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Tweet Filtering</head><p>After downloading the tweets, only English tweets were worked on. Further, retweets and tweets with only hashtags, emoticons or special characters were not considered. Also, tweets with less than 5 words were ignored. We removed all the stopwords and non-ASCII character from the tweets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Relevance Score</head><p>We have used cosine similarity to calculate the relevance score between tweet and expanded query in the first level. In the second level we have retrieved relevant tweets using language model with Jelinek-Mercer smoothing with parameter λ=0.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Novelty Detection</head><p>Tweets are posted by many users at different times from different parts of the world.</p><p>To create the text summary from the tweets is a challenging task. Ideally, summary should include all relevant tweets with constraint that it should not include redundant information. Tweet summarization is a multiple document summarization problem. Each tweet can be considered as a single document.</p><p>To create the summary, we have selected top tweets from each topic whose relevance score is greater than specified relevance threshold T rel . We have empirically set value of T rel . Now for the next eligible tweet, we calculate it's similarity with tweets already added in the summary so as to ensure novelty between them. Again a Jaccardthreshold t nov =0.6 was decided empirically and tweets below it were added into summary. Lower the similarity score, greater is the dissimilarity ensuring more novelty. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>In both levels, our summarization method remain same. However, we have used different tweet retrieval techniques. SMERP 2017 track organizers have considered ROUGE-L as primary metric to evaluate performance of all the runs. The following tables show our results in comparison with the top run. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions And Future Work</head><p>In this paper, we have implemented a method based on extractive summarization. Table <ref type="table" target="#tab_0">1</ref> and Table <ref type="table" target="#tab_1">2</ref> show that our results are comparatively lower than IIEST. In the future we will investigate our underperformance and will carry out post-hoc/ error analysis. We would like to design a summarization system based on deep neural network and logistic regression.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>&lt;top&gt;&lt;num&gt; Number: SMERP-T4 &lt;title&gt;WHAT ARE THE RESCUE ACTIVITIES OF VARIOUS NGOs / GOVERNMENT ORGANIZATIONS &lt;desc&gt; Description: Identify the messages which describe on-ground rescue activities of different NGOs and Government organizations.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 .</head><label>1</label><figDesc>Fig.1. Methodology Flowchart</figDesc><graphic coords="5,136.05,343.10,345.90,99.04" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Task-2 (summarization) result level-1</figDesc><table><row><cell>Sr</cell><cell>Run-id</cell><cell></cell><cell>Run type</cell><cell>Re-</cell><cell>Re-</cell><cell>Re-</cell><cell>Re-</cell></row><row><cell>no</cell><cell></cell><cell></cell><cell></cell><cell>call(ROU</cell><cell>call(ROU</cell><cell>call(ROU</cell><cell>call(RO</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>GE-1)</cell><cell>GE-2)</cell><cell>GE-L)</cell><cell>UGE-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>SU4)</cell></row><row><cell>1</cell><cell cols="2">daiict_irlab_2</cell><cell>Semi-</cell><cell>.3309</cell><cell>.1543</cell><cell>.3085</cell><cell>.1055</cell></row><row><cell></cell><cell></cell><cell></cell><cell>automatic</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>2</cell><cell>Top</cell><cell>run</cell><cell>Semi-</cell><cell>.5109</cell><cell>.2824</cell><cell>.4885</cell><cell>.2329</cell></row><row><cell></cell><cell>IIEST</cell><cell></cell><cell>automatic</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Task-2 (summarization) result level-2</figDesc><table><row><cell>Sr</cell><cell>Run-id</cell><cell></cell><cell>Run type</cell><cell>Re-</cell><cell>Re-</cell><cell>Re-</cell><cell>Re-</cell></row><row><cell>no</cell><cell></cell><cell></cell><cell></cell><cell>call(ROU</cell><cell>call(ROU</cell><cell>call(ROU</cell><cell>call(RO</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>GE-1)</cell><cell>GE-2)</cell><cell>GE-L)</cell><cell>UGE-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>SU4)</cell></row><row><cell>1</cell><cell>dai-</cell><cell></cell><cell>Semi-</cell><cell>.3515</cell><cell>.1297</cell><cell>.3254</cell><cell>.1194</cell></row><row><cell></cell><cell cols="2">ict_irlab_sum</cell><cell>automatic</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>m_l2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>2</cell><cell>Top</cell><cell>run</cell><cell>Semi-</cell><cell>.5540</cell><cell>.2436</cell><cell>.5142</cell><cell>.2864</cell></row><row><cell></cell><cell>IIEST</cell><cell></cell><cell>automatic</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://trec.nist.gov/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://nlp.stanford.edu:8080/parser/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://wordnet.princeton.edu/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<ptr target="http://www.computing.dcu.ie/~dganguly/smerp2017/" />
		<title level="m">SMERP ECIR 2017 guidelines</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">CLIP at TREC 2015: Microblog and LiveQA</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bagdouri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Oard</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">University of Waterloo at TREC 2015 Microblog Track</title>
		<author>
			<persName><forename type="first">L</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roegiest</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Simple dynamic emission strategies for microblog filtering</title>
		<author>
			<persName><forename type="first">L</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roegiest</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 39th International ACM SIGIR conference on Research and Development in Information Retrieval</title>
				<meeting>39th International ACM SIGIR conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1009" to="1012" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Earthquake shakes Twitter users: real-time event detection by social sensors</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sakaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Okazaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 19th international conference on World wide web</title>
				<meeting>19th international conference on World wide web</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="851" to="860" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
