<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Sensing Microblog for Effective Information Extractions</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sindur</forename><surname>Patel</surname></persName>
							<email>sindurpatel@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">2Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nirav</forename><surname>Bhatt</surname></persName>
							<email>niravbhatt.it@charusat.ac.in</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">2Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chandni</forename><surname>Shah</surname></persName>
							<email>chandnishah.it@charusat.ac.in</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">2Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rutvika</forename><surname>Nanecha</surname></persName>
						</author>
						<title level="a" type="main">Sensing Microblog for Effective Information Extractions</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4DEA4CF8B00772CB30E330EA7D857136</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Real-time data</term>
					<term>relevance information</term>
					<term>microblog</term>
					<term>twitter stream</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The SMERP 2017 data challenge track given a set of tweets posted during Italy earthquake. For retrieving more relevance information respect to user interest profile in this paper provide BM25 and word2vec techniques for retrieving relevance information from twitter stream. This techniques aim is to find real-world and most relevance information respect to the query. For retrieving most relevant information used query expansion techniques. Information rank retrieval techniques BM25 find important data and give the final score to that information with respect to user interest profile. The result of our method in this task shows this is an effective method.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Microblog is a broadcast medium that allows the user to post short and frequent message <ref type="bibr" target="#b4">[5]</ref>. It's a communication way compared with traditional information, microblogging has gained increased attention among people, organization, research scholars in distinct disciplines.</p><p>Twitter is currently fast growing micro-blogging services, with more than 140 or 150 million users producing over 400 or 500 million tweets per day <ref type="bibr" target="#b4">[5]</ref>. It is an unable to twitter user for update status or tweets, no more than 140 characters to networks of follower using various communication services. Tweets size are limited, Twitter is updated millions of time a day by twitter user all over the world <ref type="bibr" target="#b4">[5]</ref>, and its data varies hugely based on user interest and behaviors. So twitter data have huge amounts of information scaling from news, events etc.</p><p>Twitter Provides timely or real information of any event. Observing, keeping and analyzing this content of user-generated data can yield new unprecedented important information, which not available from traditional media <ref type="bibr" target="#b4">[5]</ref>. Tweets do the live reporting of any event <ref type="bibr" target="#b5">[6]</ref> means finding the information what people are talking away from some conferences, debates, sporting events etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Challenges</head><p>A major problem of twitter is no any rules to post tweets, information's or status so some people provide false, incorrect information about some events. Many numbers of spellings, grammar error, and the use of not a proper sentence structure and mixed language so people can't distinguish important data from unused data. Not all tweets are relevant to the user query or interest profile.</p><p>One-way communication. Twitter often acts as a one-way communication platform. Twitter used by celebrities, TV shows, companies and websites to simply get the word out. It is not used for relationship building.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Information Extraction System</head><p>In this section introduce system architecture for retrieve tweets and do the scoring of tweets based on the query. The system contains four components <ref type="bibr" target="#b1">[2]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Feature Extraction Components</head><p>It extracts a feature from twitter respect to TREC-API (Stream API and Rest API). After obtaining twitter streams we apply preprocessing and filtering to reduce tweets we need to process. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Feature Representation Components</head><p>It represents and expands semantic feature by different expansion techniques. After extracting tweet we need to represent those features in a format so it is suitable to calculate relevance score between tweet and profile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Candidate Generation Components</head><p>We classify tweet into the most relevance profile or remove it directly if it does not match any profile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Scoring and Pushing Components</head><p>By the semantic feature (consider only verbs and nouns in tweet text) and social media attributes we got score semantic (C i ) and quality (Q i ) so final score S i = C i Q i .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Query Expansion Entries</head><p>The query provided by the user is not in a structured and that is incomplete. So then we need to expand that query and do the correct for the better relevance information.</p><p>The main problem in retrieval is that query is short and unable to accurately describe user's information needs. So the solution to this problem is query Expansion <ref type="bibr" target="#b2">[3]</ref>, <ref type="bibr" target="#b3">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Word2Vec</head><p>For retrieving better result we have used word2vec model. Word2vec model used to produce word embeddings <ref type="bibr" target="#b7">[8]</ref>. Predict surrounding words of all word or every word. This model use document or data to train a model maximizing conditional probability of context given the word. Take an input as a large data of text and produce a vector space. So we have expanded the query using this model and then after finding the result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Query Relevance Model</head><p>The query provided by the user is not in a structured and that is incomplete. So then we need to expand that query and do the correct for the better relevance information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">BM25</head><p>BM25 is the best matching bag of word retrieval ranking function <ref type="bibr" target="#b5">[6]</ref> that ranks an information based on the user interest profile or query words appearing in each document's information <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. Developed in the Okapi system in London University. BM25 formula contains many parameters which need to be tuned from relevance assessment <ref type="bibr" target="#b8">[9]</ref>.Given a user interest profile P, containing keywords p 1 , … , p n the BM25 score of document D is Score (D, P)=</p><formula xml:id="formula_0">𝐼𝐷𝐹 𝑝 𝑖 𝑛 1 . 𝑓 𝑝 𝑖 ,𝐷 (𝑘 1 +1) 𝑓 𝑝 1 ,𝐷 +𝑘 1 (1−𝑏+𝑏. 𝐷 𝑎𝑣𝑔𝑑 1 )<label>(1)</label></formula><p>Where f(P i , D) is p i 's term frequency in document D, |D| is the length of document D in words, and avgdl is the average document length in the text collection from which documents are drawn <ref type="bibr" target="#b6">[7]</ref>. k1 and b are default parameters, usually chosen, in absence of an advanced optimization, as k 1 ∈ [1.2, 2.0] and b ∈ [0.5, 0.8] <ref type="bibr" target="#b6">[7]</ref>. In our case, we have used k 1 = 1.2 and b = 0.5. IDF (q i ) is the IDF weight of the query term q i <ref type="bibr" target="#b6">[7]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">System Evaluation Result</head><p>Our result has been evaluated by the SMERP 2017 data challenge track. The evaluation score in terms of bpref, precision@20, Recall@1000 and MAP has been given by the SMERP as 0.2021, 0.1625, 0.1830 and 0.0180 respectively. The evaluation scores of the system without query expansion have been reported as 0.0218, 0.0875, 0.0218 and 0.0072 respectively. Below table shows the result. Our run_id is charusat_smerp17_1. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this paper present the research in the area of information retrieval on the microblog. We have worked on Italy earthquake data that given by SMERP 2017 data challenge track. We have submitted two runs, without word2vec and using word2vec. So we observed that using query expansion technique word2vec showed a better result. Train word2vec using large data and find the improvement in the result.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Different System Components</figDesc><graphic coords="3,126.25,147.40,251.70,335.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>SMERP Level-1 Evaluation Result Table</figDesc><table><row><cell>SL</cell><cell cols="2">Team-id Run-id</cell><cell>Run type</cell><cell>bpref</cell><cell>Precisio</cell><cell>Recall</cell><cell>MAP</cell></row><row><cell>No</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>n@20</cell><cell>@1000</cell><cell></cell></row><row><cell>1</cell><cell>DCU</cell><cell>dcu_ADAPT_run3</cell><cell>Semi-automatic</cell><cell>0.4407</cell><cell>0.1750</cell><cell>0.1256</cell><cell>0.0338</cell></row><row><cell>2</cell><cell>USI</cell><cell>USI_1</cell><cell>Semi-automatic</cell><cell>0.3286</cell><cell>0.5375</cell><cell>0.3183</cell><cell>0.1403</cell></row><row><cell>3</cell><cell cols="2">DAIICT daiict_irlab_2</cell><cell>Semi-automatic</cell><cell>0.3171</cell><cell>0.2250</cell><cell>0.3171</cell><cell>0.0417</cell></row><row><cell>4</cell><cell>RU</cell><cell>rel_ru_nl_lang_analy</cell><cell>Semi-automatic</cell><cell>0.3153</cell><cell>0.2125</cell><cell>0.1913</cell><cell>0.0678</cell></row><row><cell>5</cell><cell cols="2">DAIICT daiict_irlab_1</cell><cell>Semi-automatic</cell><cell>0.3074</cell><cell>0.2125</cell><cell>0.3015</cell><cell>0.0391</cell></row><row><cell>6</cell><cell>CSPIT</cell><cell>charusat_smerp17_1</cell><cell>Semi-automatic</cell><cell>0.2021</cell><cell>0.1625</cell><cell>0.1830</cell><cell>0.0180</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">NUDTSNA at TREC 2015 Microblog Track: A Live Retrieval System Framework for Social Network based on Semantic Expansion and Quality Model</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">CLIP at TREC 2015: Microblog and LiveQA</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bagdouri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Oard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Knowledge-based Query Expansion in Real-Time Microblog Search</title>
		<author>
			<persName><forename type="first">R</forename><surname>Qiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1503.03961</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Microblog retrieval using topical features &amp; query expansion</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">H</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tjondronegoro</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A survey of techniques for event detection in twitter</title>
		<author>
			<persName><forename type="first">F</forename><surname>Atefeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Khreich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Intelligence</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="132" to="164" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<ptr target="http://webtrends.about.com/od/twitter/a/why_twitter_uses_for_twitter.htm" />
		<title level="m">Why Twitter</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<ptr target="https://en.wikipedia.org/wiki/Okapi_BM25" />
		<title level="m">Okpi BM25</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1301.3781</idno>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A machine learning approach for improved BM25 retrieval</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">M</forename><surname>Svore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Burges</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 18th ACM conference on Information and knowledge management</title>
				<meeting>18th ACM conference on Information and knowledge management</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="1811" to="1814" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
