<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Microblog Contextualization using Continuous Space Vectors: Multi-Sentence Compression of Cultural Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elvys Linhares Pontes?</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephane Huet</string-name>
          <email>stephane.huetg@univ-avignon.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan-Manuel Torres-Moreno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Carneiro Linhares</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ecole Polytechnique de Montreal</institution>
          ,
          <addr-line>Montreal</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIA, Universite d'Avignon et des Pays de Vaucluse</institution>
          ,
          <addr-line>Avignon</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidade Federal do Ceara</institution>
          ,
          <addr-line>Sobral-CE</addr-line>
          ,
          <country country="BR">Brasil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our work for the MC2 CLEF 2017 lab. We participated in the content analysis task that involves ltering, language recognition and summarization. We combine Information Retrieval with Multi-Sentence Compression methods to contextualize microblogs using Wikipedia's pages.</p>
      </abstract>
      <kwd-group>
        <kwd>Microblog Contextualization</kwd>
        <kwd>Multi-Sentence Compression</kwd>
        <kwd>Word Embedding</kwd>
        <kwd>Wikipedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Many newspapers use microblogs (Twitter, Facebook, Instagram, etc.) to
disseminate news quickly. These microblogs have a limited length (e.g. a tweet is
limited to 140 characters) and contain few information about an event.
Therefore, it is complicated to describe an event completely in a single microblog. A
way to overcome this problem is to get more information from another source
to better explain the microblog.</p>
      <p>
        Several studies on tweet contextualization have been done on this topic. To
just name a few, Liu et al. introduced a graph-based multi-tweet summarization
system [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This graph integrates the functionalities of social networks, solving
partially the lack of information contained in tweets. Chakrabarti and Punera
used a Hidden Markov Model in order to model the temporal events of sets of
tweets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Linhares Pontes et al. used Word Embedding to reduce the vocabulary
size and to improve the results of Automatic Text Summarization (ATS) systems
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
? This work was partially nanced by the French ANR project GAFES of the
Universite d'Avignon et des Pays de Vaucluse (France).
      </p>
      <p>
        MC2 CLEF 2017 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] lab analyzes the context and the social impact of of a
microblog at large. This lab is composed of three main tasks: Content Analysis,
Microblog Search and Time Line Illustration. We participated in the Content
Analysis task that involves classi cation, ltering, language recognition,
localization, entity extraction, linking open data, and summarization of Wikipedia's
pages and microblogs. Speci cally, we worked on the following subtasks: ltering,
language recognition and automatic summarization.
      </p>
      <p>The ltering subtask analyzes whether a tweet describes an existing festival
or not (values are between 0 and 1, 1 for the positive case and 0 otherwise).
The language recognition subtask consists in identifying the language of a
microblog. Finally, the summarization task is to generate a summary (maximum
of 120 words) in four languages (English, French, Portuguese and Spanish) of
Wikipedia's pages describing a microblog.</p>
      <p>This paper is organized as follows. In Section 2 we describe the architecture
of our system to solve the tasks of MC2 CLEF 2017 lab. Then, we present the
process of document retrieval on Wikipedia and the summarization system in
Sections 3 and 4, respectively. Finally, we conclude in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>System Architecture</title>
      <p>The CLEF's organizers selected a set of microblogs (tweets) with the keyword
\festival" to be contextualized by the participants using four versions of Wikipedia
(English, French, Portuguese, and Spanish).</p>
      <p>
        For the language identi cation task, we pre-processed microblogs to remove
all punctuation and emoticons. Then, we use the library langdetect[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to
detect the languages of microblogs.
      </p>
      <p>For ltering and summarization tasks, we divided our system in two parts
(Fig. 1). The rst part aims at retrieving the Wikipedia's pages that best
describes the festival mentioned in a microblog (Section 3). We scored the
Wikipedia's pages according to their relevance with respect to a microblog, which
corresponds to the ltering subtask.</p>
      <p>The second part of our system analyzes the 3 best scored pages and creates
clusters of similar sentences with relevant information. Then, we use an
Automatic Text Compression (ATC) system (Section 4) to compress the clusters and
to generate summaries in four languages describing the festival mentioned in a
microblog (summarization subtask). The algorithm 1 describes how our method
analyzes the microblog, selects the 3 best Wikipedia's pages and generates the
summaries.
3</p>
      <p>Wikipedia's Document Retrieval
The set of CLEF's microblogs is composed of tweets in di erent languages related
to festivals in all the world. Wikipedia provides a more thorough description of
a given festival according to the selected language (e.g. The festival of Avignon
Wikipedia</p>
      <p>Page x</p>
      <p>...</p>
      <p>Page y
title
summary
text
Language recognition task</p>
      <p>Microblog language</p>
      <p>identification
Word Embeddings</p>
      <p>FastText system
Wikipedia's page Library langdetect</p>
      <p>Similarity between a
microblog and Wikipedia's pages
Word Occurrence and Phrase Embeddings</p>
      <p>Get the 50 Wikipedia's
pages most related
to the microblog
Classification of Wikipedia's
pages related to a microblog</p>
      <p>Page b
Page y
...</p>
      <p>Page x</p>
      <p>Wikipedia's page
for each microblog</p>
      <p>Filtering task</p>
      <p>Microblogs
&lt;microblog 1&gt;</p>
      <p>...
&lt;microblog n&gt;
Preprocessing
Indri
Summary (en)
Summary (es)
Summary (fr)
Summary (pt)</p>
      <p>Select the most
related summary
version to a microblog
and translate it</p>
      <p>Summary* (en)
Summary* (es)
Summary* (fr)
Summary* (pt)
Summarization</p>
      <p>task
Classify the compression
of Wikipedia's pages</p>
      <p>Multi-Sentence Compression
The 3 most similar Wikipedia's
pages to a microblog</p>
      <p>Title
Summary</p>
      <p>Text</p>
      <p>Algorithm 1 Automatic Summarization
for each tweet do
for lang in English, French, Portuguese and Spanish do</p>
      <p>Analyze the 3 lang Wikipedia's pages with the highest scores (Equation 4) using
LEMUR system (lang version of Wikipedia)
for each sentence of the abstract of the rst Wikipedia's page (highest score)</p>
      <p>Create the clusters of similar sentences analyzing the 3 highest scored pages
do
end for
for each cluster do
end for
end for
end for</p>
      <p>Create the Word Graph (Section 4.1)</p>
      <p>Generate compressed sentences (Section 4.1)
Generate the summary (lang language) with the compressed sentences that are
the most similar to the tweet
Select the best version of the summaries (most similar to the tweet )</p>
      <p>Translate the best summary version with Yandex translator to other languages
is better described in the French Wikipedia). We independently analyze the four
versions of Wikipedia (en, es, fr, and pt) for each microblog, repeating the whole
process to rst retrieve the best Wikipedia's pages and then to summarize the
pages for the four versions of Wikipedia.</p>
      <p>
        Our system retrieves the most related Wikipedia's pages to a microblog
using a method similar to our previous work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We assume that the hashtags
and usernames represent the keywords of a tweet, and are independent of the
language4. From hashtags, usernames, and the plain text (i.e. the tweet without
hashtags, usernames and punctuation), we create Indry queries to retrieve 50
Wikipedia's documents per each tweet. For each of these documents, we analyze
the title and the summary in relation to the tweet's elements (hashtag, username
and word). Normally, the title of the Wikipedia's document has few words and
contains the core information, while the summary of the document, which is
made of the rst paragraphs of the article before the start of the rst section,
is larger and provide more information5. Therefore, we consider Equation 4 to
compute the relevance score of the Wikipedia's document D with respect to the
microblog T .
      </p>
      <p>scoretitle =
scoresum =
1
1
sim(ht; title) +
2
sim(un; title) +
3
sim(nw; title) (1)
sim(ht; sum) + 2
sim(un; sum) + 3
sim(nw; sum) (2)
sim(x; y) = 1
cosine(x; y) + 2</p>
      <p>
        occur(x; y)
scoredoc = scoretitle + scoresummary
(3)
(4)
where ht are the hashtags of the tweet T , un the usernames of T , nw the normal
words of T , and sum the summary of D. occur(x; y) represents the number of
occurrences of x in y, while cosine(x; y) is the cosine similarity between x and y
using Continuous Space Vectors6 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        We set up empirically the parameters as follows: 1 = 2 = 0:1; 3 =
0:01; 1 = 2 = 0:05; 3 = 0:005; 1 = 1 and 2 = 0:5 . These coe cients give
more weights to hashtags than usernames and the tweet text and compensate
the shorter length of Wikipedia's article titles with respect to their summary.
For each tweet, we nally keep in each language the 3 Wikipedia's documents
with the highest scores to be analyzed by the ATC system.
4 The langdetect library is only used for the language recognition subtask.
5 We did not consider the whole text of Wikipedia's page because it is sometimes huge
and we preferred to rely on the work of the contributors to build the summary of
the article.
6 We used the pre-trained word embeddings (en, es, fr, and pt) of FastText system
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that is available in https://github.com/facebookresearch/fastText/blob/
master/pretrained-vectors.md.
The summary provided at the start of Wikipedia's pages is assumed to be good
enough to be coherent and to provide the basic information. However, relying
only of this part of the article may lead to miss relevant information about the
festival that could be obtained from other sections or even pages in Wikipedia.
For this reason, we preferred to use the summary of the top article as a
basic abstract and to improve its quality with relevant information using
MultiSentences Compression (MSC) (i.e. generate sentences that are shorter and more
informative than the original sentences of the summary). Therefore, we consider
the sentences of the summary of the best scored page as key sentences. Then for
each of these sentences, we create a cluster made of the sentences of the complete
3 retrieved Wikipedia's pages which are similar; to do this, the cosine similarity
is used as metrics and we empirically set up a threshold of 0.4 to consider two
sentences as similar.
      </p>
      <p>Then, for each cluster MSC generates a shorter and hopefully more
informative compression (Section 4.1). Next, we generate the summary concatenating
the most similar compression to the microblog7.</p>
      <p>Some language versions of Wikipedia do not have a page or they have a small
description describing an speci c festival. Therefore, we analyzed the summaries
of each microblog obtained in the four studied languages and only retain the
abstract with contains the best description of the microblog, which is estimated
through the similarity between each summary and the microblog. So, we used
the Yandex library8 to translate the kept summary to others languages (en, es,
fr, and pt).
4.1</p>
      <p>
        Word Graph and Optimization
Our MSC system adopts the approach proposed by Filippova [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to model a
document D as a Word Graph (WG), where the vertices represent the words
and arcs represent the cohesion of the words (more details in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). The weights of
the arcs represent the level of cohesion between the words of two vertices based
on the frequency and the position of these words in the sentences (Equation 5).
(5)
(6)
(7)
w(ei;j ) =
      </p>
      <p>cohesion(ei;j )
f req(i) f req(j)</p>
      <p>;
f req(i) + f req(j)
cohesion(ei;j ) = Pf2D dist(f; i; j) 1
;
dist(f; i; j) =
pos(f; i)
0;
pos(f; j); if pos(f; i) &lt; pos(f; j)
otherwise</p>
      <p>
        In a previous study, we proposed to extend this approach with the analysis of
the keywords and the 3-grams of the document to generate a more informative
7 The summary is composed of the rst 120 words.
8 https://tech.yandex.com/translate/
MSC. Since each cluster to compress is composed of similar sentences, we
consider that there is only one topic; the Latent Dirichlet Allocation (LDA) method
is used to identify the keywords of this topic[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>From the weight of 2-grams (Equation 5), the relevance of a 3-gram is based
on the relevance of the two 2-grams, as described in Equation 8:
3-gram(i; j; k) =</p>
      <p>qt3(i; j; k)
maxa;b;c2GP qt3(a; b; c)
w(ei;j ) + w(ej;k) ;
2</p>
      <p>In order to generate a better compression, the objective function expressed
in Equation 9 is minimized in order to improve the informativeness and the
grammaticality.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we presented our contributions to the MC2 CLEF 2017 lab in
the Content Analysis task. We considered di erent scores for each microblog
element (hashtags, arrobases, and text) to retrieve in four languages (en, es, fr,
and pt) the Wikipedia's pages most related to a microblog. Then, we generated
summaries using MSC from clusters initially made of the abstract of the top
retrieved article and extended with similar sentences from the 3 top retrieved
articles per language. Finally, we analyzed summaries of each microblog obtained
in the four languages to select the one most similar to the microblog; the kept
summary is translated to other languages (en, es, fr and pt).</p>
      <p>Minimize</p>
      <p>bi;j xi;j</p>
      <p>X
(i;j)2A</p>
      <p>X ck wk
k2K</p>
      <p>
        X dk zt
t2T
where xij indicates the existence of the arc (i; j) in the solution, w(i; j) is the
cohesion of the words i and j (Equation 5), zt indicates the existence of the
3-gram t in the solution, dt is the relevance of the 3-gram t (Equation 8), ck
indicates the existence of a word with color (keyword) k in the solution and
is the geometric average of the arc weights in the graph (more details in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
Finally, they calculate the 50 best solutions according to the objective (9) and
we select the sentence with the lowest nal score (Equation 10) as the best
compression.
      </p>
      <p>
        scorenorm(f ) =
escoreopt(f)
jjf jj
;
where scoreopt(f ) is the value of the path to generate the compression f from
Equation 9. As Linhares Pontes et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we set up the parameters to = 1:0,
= 0:9 and = 0:1.
(8)
(9)
(10)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          ,
          <source>Multimodality, and Interaction, Lecture Notes in Computer Science</source>
          , vol.
          <volume>10456</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet allocation</article-title>
          .
          <source>Journal Machine Learning Research 3, 993{1022 (Mar</source>
          <year>2003</year>
          ), http://dl.acm.org/citation.cfm? id=
          <volume>944919</volume>
          .
          <fpage>944937</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Punera</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Event Summarization using Tweets</article-title>
          .
          <source>In: 5th AAAI International Conference on Weblogs and Social Media (ICWSM)</source>
          .
          <source>Association for the Advancement of Arti cial Intelligence</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Filippova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>: Multi-sentence compression: Finding shortest paths in word graphs</article-title>
          .
          <source>In: COLING</source>
          . pp.
          <volume>322</volume>
          {
          <issue>330</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Linhares</given-names>
            <surname>Pontes</surname>
          </string-name>
          , E., da Silva, T.G.,
          <string-name>
            <surname>Linhares</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torres-Moreno</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huet</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Metodos de Otimizaca~o Combinatoria Aplicados ao Problema de Compressa~o
          <source>MultiFrases</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Linhares</given-names>
            <surname>Pontes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Torres-Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            ,
            <surname>Huet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Linhares</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.C.</surname>
          </string-name>
          :
          <article-title>Tweet contextualization using continuous space vectors: Automatic summarization of cultural documents</article-title>
          .
          <source>In: CLEF Workshop on Cultural Microblog Contextualization</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Graph-Based Multi-Tweet Summarization using Social Signals</article-title>
          . In: COLING. pp.
          <volume>1699</volume>
          {
          <issue>1714</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Shuyo</surname>
          </string-name>
          , N.:
          <article-title>Language detection library for java (</article-title>
          <year>2010</year>
          ), http://code.google.com/ p/language-detection/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>