<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Information Extraction from Microblog for Disaster Related Event</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Rishab</forename><surname>Singla</surname></persName>
							<email>singlarishab15@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Information and Communication Technology</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dhirubhai</forename><surname>Ambani</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute of Information and Communication Technology</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Institute of Information and Communication Technolo-gy</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Institute of Information and Communication Tech-nology</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sandip</forename><surname>Modha</surname></persName>
							<email>sjmodha@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="department">Institute of Information and Communication Technolo-gy</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Prasenjit</forename><surname>Majumder</surname></persName>
							<email>prasenjit_majumder@gmail.com</email>
							<affiliation key="aff2">
								<orgName type="department">Institute of Information and Communication Tech-nology</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chintak</forename><surname>Mandalia</surname></persName>
							<affiliation key="aff3">
								<orgName type="institution">LDRP-ITR</orgName>
								<address>
									<settlement>Gandhinagar</settlement>
									<region>Gujarat</region>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Information Extraction from Microblog for Disaster Related Event</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">91FC45FA9170863E9427DAE97F85DC69</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Microblog</term>
					<term>Information Retrieval</term>
					<term>Disaster</term>
					<term>Wordnet</term>
					<term>BM25</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents the participation of Information Retrieval Lab(IRLAB) at DAIICT Gandhinagar ,India in Data challenge track of SMERP 2017. This year SMERP Data challenge track has offered a task called Text Extraction on the Italy earthquake tweet dataset, with an objective to retrieve relevant tweets with high recall and high precision. In this task, three runs were submitted by us and we describe the different approaches adopted. Initially, we have performed query expansion on the topics using Wordnet. In the first run, we have ranked tweets using cosine similarity against the topics. In the second run, relevance score between tweets and the topic is calculated using Okapi BM25 ranking function and in the third run relevance score is calculated using language model with Jelinek-Mercer smoothing .</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Microblogs like Twitter can play a very important role in any disaster related event.</p><p>Twitter has a massive registered user base. As of 2016, Twitter<ref type="foot" target="#foot_0">1</ref> had more than 319 million monthly active users. On the day of the 2016 U.S. presidential election, Twitter proved to be the largest source of breaking news, with 40 million tweets sent by 10 p.m. (Eastern Time) that day. Twitter enables humans to act as a social sensor to the real world. It allows its registered users to post short texts called tweets having upto 140 characters.</p><p>Many incidents in the past have proved that social media is the first medium through which news related to a disaster like earthquakes reach the people. Recently, many earthquake incidents have been reported first on Twitter and then on any other media <ref type="bibr" target="#b4">[5]</ref>. Twitter can be effectively accessed by an NGO/Government agency to assess the ground reality of the disaster area to assist in their rescue operations.</p><p>The motivation of the data challenge track is to promote development of IR methodologies that can be used to extract important information from social media during emergency events, and to arrange for comparative evaluation of the methodologies <ref type="bibr" target="#b0">[1]</ref>. The Data challenge track offered two tasks namely Text retrieval in two levels. The track organizers have provided tweet-id of the first day of Italy earthquake in the first level. In the second level, tweet-ids of tweet posted during second day of Italy earthquake, were provided. <ref type="bibr" target="#b0">[1]</ref> Track organizer also provided the topics in TREC style for which we have to extract and summarize relevant tweets.</p><p>The aim of Text Retrieval sub track is to retrieve top relevant tweets with respect to each of the specified topics with high precision and high recall. The paper is organized as follow; we will discuss related work in section 2. In section 3 we describe tweet dataset. In section 4, we describe the problem statement. In section 5 we discuss our methodology. In section 6, we will present the results and analysis. In section 7 we draw conclusions and discuss future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>We started our work by referring TREC MICROBLOG 2015 papers. TREC<ref type="foot" target="#foot_1">2</ref> has started Microblog track since 2011 with objective to explore new IR methodology on short text. CLIP <ref type="bibr" target="#b1">[2]</ref> has trained their Word2vec model using 4 years tweet corpus. They used Okapi BM25 relevance model to calculate the score. To refine the scores of the relevant tweets, tweets were rescored using the SVM rank package using the relevance score of the previous stage. University of waterloo <ref type="bibr" target="#b3">[4]</ref> implemented the filtering tasks, by building a term vector for each user profile and assigning different weights to different types of terms. To discover the most significant tokens in each user profile, they calculated pointwise KL divergence and ranked the scores for each token in the profile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Tweet Dataset</head><p>SMERP 2017 Track organizers have provided dataset of tweets-ids posted on Twitter during the earthquake in Italy in August 2016 along with a Python script that can be used to download the tweets using the Twitter API <ref type="bibr" target="#b0">[1]</ref>. The text retrieval track is offered in two levels, tweets posted on first day and day two and three of Italy earth-quake will be considered in level-1 and level 2 dataset respectively. They have provided 52469 tweet ids in level-1 and 19751 tweet ids in level-2 along with 4 topics in the TREC format.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Problem Statement</head><p>Given topics Q = {SMERP-T 1 , SMERP-T 2 , SMERP-T 3 , SMERP-T 4 }, and Tweets Dataset T = {T 1 , T 2 ,..,T n } from the dataset, we have to design a ranking function R: (Q,T) → {R 1 ,..Rn} which ranks tweets against given topic based upon the relevance score. Ri = {T 1 ,…T n } where R i is the set relevant tweet against i th profile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Our Methodology</head><p>Track organizers have given 4 topics according TREC format which consists of title description and narrative. Essentially these topics are our query and will be used interchangeably throughout the paper. In this section, we describe our approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Topic Preprocessing</head><p>Topics consist of title which describe the general information need, description and narrative which are sentence and paragraph long content which describe the overall picture. &lt;narr&gt; Narrative: A relevant message must mention the availability of some resource like food, drinking water, shelter, clothes, blankets, blood, human resources like volunteers, resources to build or support infrastructure, like tents, water filter, power supply, etc. Messages informing the availability of transport vehicles for assisting the resource distribution process would also be relevant. Also, messages indicating any services like free wifi, sms, calling facility etc. will also be relevant. In addition, any message or announcement about donation of money will also be relevant.However, generalized statements without reference to any resource would not be relevant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&lt;/top&gt;</head><p>To covert topic into query, we have first removed stopwords. We run Stanford POS tagger<ref type="foot" target="#foot_2">3</ref> on topics. All keyword with the noun and verb labels are extracted and added to the query. We believe that the topic are extremely vague so human intervention is required to build the query</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Query Expansion</head><p>We have used lexical database WordNet<ref type="foot" target="#foot_3">4</ref> for query/topic expansion. It puts english words into sets of synonyms called synsets. For each term in a query, we have extracted top 2 synonyms from WordNet and added to the query. We have set equal term weight for original term and the expanded term.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Tweet Preprocessing</head><p>After downloading the tweets, non-English tweets were filtered out. Tweet includes smiles, hashtags, and many special characters. We did not consider retweets or tweets with only hashtags, emoticons or special characters. Also, we ignored tweets with less than 5 words and removed all the stopwords and non-ASCII character from the tweet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Relevance Score</head><p>We have submitted two runs in the first level and three runs in the second level for the Text Retrieval track with different retrieval techniques. Further, we will discuss each technique.</p><p>Relevance score using Cosine similarity.</p><p>In the first run, we used cosine similarities between tweets and expanded topic to calculate relevance score.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Tweet Relevance score using Okapi BM25 model</head><p>In the second run, to calculate relevance score between tweets and expanded query, we have used. Score is defined as follows.</p><p>We have set BM25 model parameter b=0.75,k1=0.2.</p><p>Tweet Relevance score using Language Model.</p><p>In the third run, we have indexed all the tweets in Lucene <ref type="foot" target="#foot_4">5</ref> .Language model with Jelinek-Mercer smoothing was used to retrieve relevant tweets depending on the query. We set a threshold for finding out if a tweet is relevant to a particular topic. The relevance threshold set was 24. The parameter λ was set to 0.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig. 1. Methodology Flowchart</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Results</head><p>SMERP Track organizers have used standard TREC metrics like Bpref,Precision@20,Recall@1000 and MAP to evaluate the runs submitted by all teams. Bpref is used as a primary metric to rank all teams. Table <ref type="table" target="#tab_0">1</ref> and Table <ref type="table">2</ref> show our result in both levels. In level 1, we have achieved higher Recall@1000 compared to top team dcu_ADAPT_run2. However, our Bpref was substantially lower than dcu_ADAPT_run2. In the second run, we have achieved Precision@20,Recall@1000 and MAP better than dcu_ADAPT_run2 but we have reported Bpref substantially lower. We will investigate poor Bpref in future. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions And Future Work</head><p>In this paper, we have applied three different retrieval technique namely Okapi BM25, cosine similarities and language model with Jelinek-Mercer smoothing for extraction. Our results show that BM25 model outperforms the other methods in terms of Bpref, Precision@20, Recall@1000 and mean average precision(MAP). We have also concluded that our system has reported poor Bpref score in both the levels which will be investigated further. We also note that topics are more like a question so we have to consider text features like Named entity and verb phrase or relation in the ranking score in addition to raw tweet text. Further on, a ranking system based on deep neural network and logistic regression could be looked at for better results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>&lt;top&gt; &lt;num&gt; Number: SMERP-T1 &lt;title&gt; WHAT RESOURCES ARE AVAILABLE &lt;desc&gt; Description: Identify the messages which describe the availability of some resources.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Task</figDesc><table><row><cell></cell><cell cols="3">-1 (extraction) result level-1</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Sr_no</cell><cell>Run-id</cell><cell cols="2">Run type Bpref</cell><cell>Preci-</cell><cell>Re-</cell><cell>MAP</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>sion@20</cell><cell>call@1000</cell><cell></cell></row><row><cell>1</cell><cell>dai-</cell><cell>Semi-</cell><cell>0.3171</cell><cell>0.2250</cell><cell>0.3171</cell><cell>0.0417</cell></row><row><cell></cell><cell>ict_irlab_</cell><cell>auto-</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>2</cell><cell>matic</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>2</cell><cell>dai-</cell><cell>Semi-</cell><cell>0.3074</cell><cell>0.2125</cell><cell>0.3015</cell><cell>0.0391</cell></row><row><cell></cell><cell>ict_irlab_</cell><cell>auto-</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>1</cell><cell>matic</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>3</cell><cell>Toprun</cell><cell>Fully-</cell><cell>0.6170</cell><cell>0.4125</cell><cell>0.1794</cell><cell>0.0517</cell></row><row><cell></cell><cell>dcu_AD</cell><cell>auto-</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>APT_run</cell><cell>matic</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://en.wikipedia.org/wiki/Twitter</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://trec.nist.gov/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://nlp.stanford.edu:8080/parser/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://wordnet.princeton.edu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://lucene.apache.org/core/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<ptr target="http://www.computing.dcu.ie/~dganguly/smerp2017/" />
		<title level="m">SMERP ECIR 2017 guidelines</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">CLIP at TREC 2015: Microblog and LiveQA</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bagdouri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Oard</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">University of Waterloo at TREC 2015 Microblog Track</title>
		<author>
			<persName><forename type="first">L</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roegiest</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Simple dynamic emission strategies for microblog filtering</title>
		<author>
			<persName><forename type="first">L</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roegiest</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 39th International ACM SIGIR conference on Research and Development in Information Retrieval</title>
				<meeting>39th International ACM SIGIR conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1009" to="1012" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Earthquake shakes Twitter users: real-time event detection by social sensors</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sakaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Okazaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 19th international conference on World wide web</title>
				<meeting>19th international conference on World wide web</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="851" to="860" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
