<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Microblog Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sindur Patel</string-name>
          <email>sindurpatel@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nirav Bhatt</string-name>
          <email>niravbhatt.it@charusat.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chandni Shah</string-name>
          <email>chandnishah.it@charusat.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rutvika Nanecha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Technology , Charotar University of Science &amp; Technology</institution>
          ,
          <addr-line>Changa, Gujarat</addr-line>
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Microblogging is prominent e-communication medium on which short story are updated by the user based on their personal matter and other happening or coming immediate information. The quantity of information is large and also most of the data are redundant or irrelevant because of their popularity. This paper provides effectual techniques for summarization of inside story on microblogs sites such as twitter. The twitter data is the incredibly huge amount of small story circulate by users related to occurring situation or events. This technique focuses on finding factual most similar information respect to the query and used the ranking function for retrieving top-ranked twitter data related to query. Apply similarity measure function on top-ranked Relevant Tweets for detecting novel Tweets and which minimize similarity and maximize dissimilarity of twitter data. And also utilize threshold based decision to find a summary of novel tweets.</p>
      </abstract>
      <kwd-group>
        <kwd>Real-time data</kwd>
        <kwd>Social media</kwd>
        <kwd>clustering</kwd>
        <kwd>Multi-document summarization Information Search and Retrieval</kwd>
        <kwd>Web-based services</kwd>
        <kwd>Microblog</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Microblogging is popular E-communication medium on which user circulate their
small story based on incident happening related to their personal or surrounding
events. It’s a simpler and faster than traditional forms of communication medium and
become popular perpetually in every area.</p>
      <p>Twitter is one of the most prominent microblogs sites at the present time. It allows
users to posted short and persistent status not more than 140 characters are known as a
tweet. Everyday people provided over hundreds of millions of tweets from different
parts of the world. People can socialize and interact with each other on day to day
basis.</p>
      <p>
        The Twitter information inside the story depends on user attentiveness and change
according to interest. Therefore, Twitter streams contain a large and diverse amount
of information ranging from daily-life stories to the latest local and worldwide news
and events [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In addition, the extensive amount of post has meant that it is nearly impossible to
control and regulate the system. Twitter suffers from spam and irrelevant posts that








1.2 Objectives
reduce its utility to some extent and most of it is unstructured containing duplicates
and errors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Millions of tweet updated so people have no time to visualize all those
tweets. There is need to Provide the effective algorithm for search, extraction, and
summarization of this information could create a coherent and comprehensive
overview of the topic presented from several points of view [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. So this paper finding
real world most similar information respect to the query and used the ranking function
for retrieving top-ranked twitter data relate to query. Apply similarity measure
function on top-ranked relevant tweets for detecting novel tweets and which minimize
similarity and maximize dissimilarity of twitter data. And also utilize threshold based
decision to find a summary of novel tweets.
      </p>
      <sec id="sec-1-1">
        <title>1.1 Challenges</title>
        <p>Limited content of a single post;
Huge amount of posts ( above 400 million updates circulate every day on
twitter)
Many posts don’t give a significant, valid and useful information;
User search information based on name entities such as organization, people,
place, and events;
Many of posts contain opinions and sentiments;
Diverse people belonging to different region post tweet on the same event
Design and implement system to retrieve most relevance information
From Twitter
Do the Clustering of data and, to construct tweet summary of up to
100 novel tweets from the set of relevant tweet for a given interest
profile
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Problem Statement</title>
      <p>Given set of tweets T and set of queries Q where T= {T1, T2, T3...Tn} and= {Q1, Q2,
Q3..., Qn}</p>
      <p>F is a function to summarization And Summary S= {s1, s2,...,sn} has formed from
relevant tweet RT={rt1,rt2,…,rtn} here rti represent as relevant tweet for particular
interest profile F: T -&gt; S</p>
      <p>A batch of top 100 ranked tweets per day per interest profile with any two tweets
having a similarity of less than threshold sim(t1, t2) &lt; Ts is used for the summary.
dissim is dissimilarity of a set of tweets and sim is similarity of a set of tweets
Max Σdissim (T)</p>
      <p>Min Σsim (T)
In this Portion, we will identify a batch of top 100 ranked tweets per interest profiles.
For high-level its results provide relevant and novel information for summarization
purpose. Our system Architecture mainly contains four modules</p>
      <sec id="sec-2-1">
        <title>3.1 Data Cleaning Module</title>
        <p>We pre-process all raw tweets which performed lower casing and removing hashtags,
hyperlinks, and punctuation. Also simply filtering these tweets which do not contain
any keywords for each interest profile, and the remaining tweets are taking as
candidate tweet collection for identifying possible relevant tweets of each profile.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Query Expansion Module</title>
        <p>The query provided by the user is not in a structured and that is incomplete. So then
we need to expand that query and do the correct for the better relevance information.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3 Relevance Ranking Module:</title>
        <p>We utilize the ranking function to measure the relevance between query and tweets.
After that, all the tweets are ranked based on their relevance score and find the top
ranked tweets related to interested profiles.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.4 Novelty Detection Module:</title>
        <p>When we obtain the top ranked tweet list after relevance ranking, we will have detect
novelty for each tweet from, until we collect enough tweets to pushed into the
summary. For novelty, we compared to tweets using Cosine similarity- function. This
Module makes a threshold-based decision in which it considers a tweet with a
similarity score above relevance threshold. A tweet is considered novel if its
similarity score does not exceed a novelty threshold Tr compared to any of the pushed
tweets, otherwise, the system ignores it. And pushed all tweets which similarity score
less than the threshold into pushed tweet pool for making a summary.
In this portion, we represent as some strategy for summarization purpose. Based on
this we used top-ranked relative data as an input. For minimize similarity and
maximize dissimilarity of tweets we apply proposed algorithms to produce a summary
of relevant tweets as output and in which also utilize decision-making function.</p>
      </sec>
      <sec id="sec-2-5">
        <title>4.1 Cosine Similarity</title>
        <p>Cosine similarity is a measure of similarity between two nonzero vectors of an inner
product space that measures the cosine of the angle between them.</p>
        <p>It is thus a judgment of orientation and not magnitude: two vectors with the same
orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and
two vectors diametrically opposed have a similarity of -1, independent of their
magnitude.</p>
        <p>The cosine of two none zero vectors can be derived by using the Euclidean dot
product formula:</p>
        <p>Similarity= Cos (Θ) =A.B/llAll.llBll</p>
        <p>The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning
exactly the same, with 0 indicating orthogonality (decorrelation), and in-between
values indicating intermediate similarity or dissimilarity</p>
      </sec>
      <sec id="sec-2-6">
        <title>4.2 Jaccard Similarity</title>
        <p>The Jaccard index, also known as the Jaccard similarity coefficient is a statistic used
for comparing the similarity and diversity of sample sets. The Jaccard coefficient
measures similarity between finite sample sets and is defined as the size of
the intersection divided
By the size of the union of the sample sets:</p>
        <p>J (A, B) =|A∩B|/|A∪B|.</p>
        <p>0&lt; J (A, B) &lt; 1.</p>
        <p>If A and B are both empty, J (A, B) = 1.</p>
        <p>Jaccard distance measures dissimilarity between sample sets:</p>
        <p>Jδ (A, B) =|A∪B|- |A∩B|/|A∪B|= 1- J (A, B).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 System Evaluation Result</title>
      <p>Our system has been evaluated by the SMERP 2017 data challenge track. The
evaluation score in terms of Recall (ROUGE-1), Recall (ROUGE-2), Recall
(2)
(3)
(4)
(ROUGE-L), and Recall (ROUGE-SU4) have been reported by the SMERP as .3471,
.0622, .3233, and .1220 respectively and run type is Semi-Automatic.</p>
    </sec>
    <sec id="sec-4">
      <title>6 Conclusion</title>
      <p>Run type
Semi-automatic
Semi-automatic
Semi-automatic
Semi-automatic</p>
      <p>Recall Recall Recall Recall
(ROUGE-1) (ROUGE-2) (ROUGE-L)
(ROUGE</p>
      <p>SU4)
0.5540 0.2436 0.5142 0.2864
0.5187 0.2512 0.4796 0.2505
0.3515 0.1297 0.3254 0.1194
0.3471 0.0622 0.3233 0.1220
In this paper present system architecture for real-time microblog summarizes
techniques, Cosine Similarity and Jaccard Similarity. Apply relevance ranking model
to rank candidate tweets and then we used strategies to measure novelty between
tweets. And also I have makes a threshold-based decision for making summary which
gives a better result. I will try to get the more accurate result using proposed
algorithms and providing more training to the system.
.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atefeh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khreich</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>A survey of techniques for event detection in twitter</article-title>
          .
          <source>Computational Intelligence</source>
          ,
          <volume>31</volume>
          (
          <issue>1</issue>
          ),
          <fpage>132</fpage>
          -
          <lpage>164</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , H.:
          <article-title>Introduction to information retrieval</article-title>
          ,Vol.
          <volume>1</volume>
          , No.
          <volume>1</volume>
          , p.
          <fpage>496</fpage>
          .Cambridge university press, Cambridge (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <article-title>April: A study of global inference algorithms in multi-document summarization</article-title>
          .
          <source>In : Proc. European Conference on Information Retrieval</source>
          , pp.
          <fpage>557</fpage>
          -
          <lpage>564</lpage>
          . Springer, Berlin Heidelberg (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>