<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Multilingual Microblog Summarization</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sindur</forename><surname>Patel</surname></persName>
							<email>sindurpatel@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">2Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nirav</forename><surname>Bhatt</surname></persName>
							<email>niravbhatt.it@charusat.ac.in</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">2Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chandni</forename><surname>Shah</surname></persName>
							<email>chandnishah.it@charusat.ac.in</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">2Department of Information Technology</orgName>
								<orgName type="institution">Charotar University of Science &amp; Technology</orgName>
								<address>
									<settlement>Changa</settlement>
									<country>Gujarat India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rutvika</forename><surname>Nanecha</surname></persName>
						</author>
						<title level="a" type="main">Multilingual Microblog Summarization</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">58E494A340111C52ED384D53C21AC798</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Real-time data</term>
					<term>Social media</term>
					<term>clustering</term>
					<term>Multi-document summarization Information Search and Retrieval</term>
					<term>Web-based services</term>
					<term>Microblog</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Microblogging is prominent e-communication medium on which short story are updated by the user based on their personal matter and other happening or coming immediate information. The quantity of information is large and also most of the data are redundant or irrelevant because of their popularity. This paper provides effectual techniques for summarization of inside story on microblogs sites such as twitter. The twitter data is the incredibly huge amount of small story circulate by users related to occurring situation or events. This technique focuses on finding factual most similar information respect to the query and used the ranking function for retrieving top-ranked twitter data related to query. Apply similarity measure function on top-ranked Relevant Tweets for detecting novel Tweets and which minimize similarity and maximize dissimilarity of twitter data. And also utilize threshold based decision to find a summary of novel tweets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Microblogging is popular E-communication medium on which user circulate their small story based on incident happening related to their personal or surrounding events. It's a simpler and faster than traditional forms of communication medium and become popular perpetually in every area.</p><p>Twitter is one of the most prominent microblogs sites at the present time. It allows users to posted short and persistent status not more than 140 characters are known as a tweet. Everyday people provided over hundreds of millions of tweets from different parts of the world. People can socialize and interact with each other on day to day basis.</p><p>The Twitter information inside the story depends on user attentiveness and change according to interest. Therefore, Twitter streams contain a large and diverse amount of information ranging from daily-life stories to the latest local and worldwide news and events <ref type="bibr" target="#b0">[1]</ref>.</p><p>In addition, the extensive amount of post has meant that it is nearly impossible to control and regulate the system. Twitter suffers from spam and irrelevant posts that reduce its utility to some extent and most of it is unstructured containing duplicates and errors <ref type="bibr" target="#b1">[2]</ref>. Millions of tweet updated so people have no time to visualize all those tweets. There is need to Provide the effective algorithm for search, extraction, and summarization of this information could create a coherent and comprehensive overview of the topic presented from several points of view <ref type="bibr" target="#b2">[3]</ref>. So this paper finding real world most similar information respect to the query and used the ranking function for retrieving top-ranked twitter data relate to query. Apply similarity measure function on top-ranked relevant tweets for detecting novel tweets and which minimize similarity and maximize dissimilarity of twitter data. And also utilize threshold based decision to find a summary of novel tweets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Challenges</head><p> Limited content of a single post;  Huge amount of posts ( above 400 million updates circulate every day on twitter)  Many posts don't give a significant, valid and useful information;  User search information based on name entities such as organization, people, place, and events;  Many of posts contain opinions and sentiments;  Diverse people belonging to different region post tweet on the same event</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Objectives</head><p> Design and implement system to retrieve most relevance information From Twitter  Do the Clustering of data and, to construct tweet summary of up to 100 novel tweets from the set of relevant tweet for a given interest profile</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem Statement</head><p>Given set of tweets T and set of queries Q where T= {T1, T2, T3...Tn} and= {Q1, Q2, Q3..., Qn} F is a function to summarization And Summary S= {s 1 , s 2, ...,s n } has formed from relevant tweet RT={rt1,rt2,…,rtn} here rti represent as relevant tweet for particular interest profile F: T -&gt; S A batch of top 100 ranked tweets per day per interest profile with any two tweets having a similarity of less than threshold sim(t1, t2) &lt; Ts is used for the summary. dissim is dissimilarity of a set of tweets and sim is similarity of a set of tweets Max Σdissim (T)</p><p>Min Σsim (T) (1)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">System Architecture</head><p>In this Portion, we will identify a batch of top 100 ranked tweets per interest profiles.</p><p>For high-level its results provide relevant and novel information for summarization purpose. Our system Architecture mainly contains four modules</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data Cleaning Module</head><p>We pre-process all raw tweets which performed lower casing and removing hashtags, hyperlinks, and punctuation. Also simply filtering these tweets which do not contain any keywords for each interest profile, and the remaining tweets are taking as candidate tweet collection for possible relevant tweets of each profile.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Query Expansion Module</head><p>The query provided by the user is not in a structured and that is incomplete. So then we need to expand that query and do the correct for the better relevance information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Relevance Ranking Module:</head><p>We utilize the ranking function to measure the relevance between query and tweets.</p><p>After that, all the tweets are ranked based on their relevance score and find the top ranked tweets related to interested profiles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Novelty Detection Module:</head><p>When we obtain the top ranked tweet list after relevance ranking, we will have detect novelty for each tweet from, until we collect enough tweets to pushed into the summary. For novelty, we compared to tweets using Cosine similarity-function. This Module makes a threshold-based decision in which it considers a tweet with a similarity score above relevance threshold. A tweet is considered novel if its similarity score does not exceed a novelty threshold Tr compared to any of the pushed tweets, otherwise, the system ignores it. And pushed all tweets which similarity score less than the threshold into pushed tweet pool for making a summary. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Approach</head><p>In this portion, we represent as some strategy for summarization purpose. Based on this we used top-ranked relative data as an input. For minimize similarity and maximize dissimilarity of tweets we apply proposed algorithms to produce a summary of relevant tweets as output and in which also utilize decision-making function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Cosine Similarity</head><p>Cosine similarity is a measure of similarity between two nonzero vectors of an inner product space that measures the cosine of the angle between them.</p><p>It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.</p><p>The cosine of two none zero vectors can be derived by using the Euclidean dot product formula:</p><formula xml:id="formula_0">Similarity= Cos (Θ) =A.B/llAll.llBll . (<label>2</label></formula><formula xml:id="formula_1">)</formula><p>The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality (decorrelation), and in-between values indicating intermediate similarity or dissimilarity</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Jaccard Similarity</head><p>The Jaccard index, also known as the Jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets and is defined as the size of the intersection divided By the size of the union of the sample sets:</p><formula xml:id="formula_2">J (A, B) =|A∩B|/|A∪B|. 0&lt; J (A, B) &lt; 1.</formula><p>If A and B are both empty, J (A, B) = 1.</p><p>(</p><formula xml:id="formula_3">)<label>3</label></formula><p>Jaccard distance measures dissimilarity between sample sets:</p><formula xml:id="formula_4">Jδ (A, B) =|A∪B|-|A∩B|/|A∪B|= 1-J (A, B). (<label>4</label></formula><formula xml:id="formula_5">)</formula><p>5 System Evaluation Result </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper present system architecture for real-time microblog summarizes techniques, Cosine Similarity and Jaccard Similarity. Apply relevance ranking model to rank candidate tweets and then we used strategies to measure novelty between tweets. And also I have makes a threshold-based decision for making summary which gives a better result. I will try to get the more accurate result using proposed algorithms and providing more training to the system. .</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Different System Components</figDesc><graphic coords="4,137.60,158.90,345.80,359.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>SMERP Level-1 Summarization Evaluation Result Table</figDesc><table><row><cell>Our system has been evaluated by the SMERP 2017 data challenge track. The</cell></row><row><cell>evaluation score in terms of Recall (ROUGE-1), Recall (ROUGE-2), Recall</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A survey of techniques for event detection in twitter</title>
		<author>
			<persName><forename type="first">F</forename><surname>Atefeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Khreich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Intelligence</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="132" to="164" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<title level="m">Introduction to information retrieval</title>
				<meeting><address><addrLine>Cambridge</addrLine></address></meeting>
		<imprint>
			<publisher>Cambridge university press</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">496</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">April: A study of global inference algorithms in multi-document summarization</title>
		<author>
			<persName><forename type="first">R</forename><surname>Mcdonald</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. European Conference on Information Retrieval</title>
				<meeting>European Conference on Information Retrieval<address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="557" to="564" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
