<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLEF 2017 Microblog Cultural Contextualization Content Analysis Task Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liana Ermakova</string-name>
          <email>liana.ermakova@univ-lorraine.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josiane Mothe</string-name>
          <email>josiane.mothe@irit.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric SanJuan</string-name>
          <email>eric.sanjuan@univ-avignon.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRIT, UMR5505 CNRS, ESPE, Universite de Toulouse</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIA</institution>
          ,
          <addr-line>Universite d'Avignon</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LISIS (UPEM, INRA, ESIEE, CNRS), Universite de Lorraine</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The MC2 CLEF 2017 Content Analysis task deals with classi cation, ltering, language recognition, localization, entity extraction, linking open data, and summarization. Festivals have a large presence on social media. The resulting microblog stream and related URLs are appropriate to experiment on advanced social media search and mining methods. For content analysis, topics were in any language and results were expected in four languages: English, Spanish, French, and Portuguese.</p>
      </abstract>
      <kwd-group>
        <kwd>Information retrieval</kwd>
        <kwd>Tweet contextualization</kwd>
        <kwd>Microblog analysis</kwd>
        <kwd>CLEF evaluation forum</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Microblog Contextualization was introduced as a Question Answering task of
INEX 2011 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The main idea was to help Twitter users to understand a tweet by
providing some context associated to it. It has evolved in a Focus IR (Information
Retrieval) task over Wikipedia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The CLEF 2016 Cultural Microblog Contextualization Workshop considered
speci c cultural Twitter feeds [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this restricted context, implicit localization
and language identi cation appeared to be important issues. It also required
identifying implicit timelines over long periods. The MC2 CLEF 2017 lab has
been centered on Cultural Contextualization based on microblog feeds. It dealt
with how cultural context of a microblog a ects its social impact at large [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
This involved microblog search, classi cation, ltering, language recognition,
localization, entity extraction, linking open data, and summarization.
      </p>
      <p>Given a stream of microblogs, the task consists in:
1. ltering microblogs dealing with festivals;
2. language identi cation;
3. event localization;
4. author categorization (o cial account, participant, follower or scam);
5. Wikipedia entity recognition and translation into four target languages:
English, Spanish, Portuguese, and French;
6. automatic summarization of linked Wikipedia pages in the four target
languages.</p>
      <p>Each item is evaluated independently, however language identi cation could
impact Wikipedia linking and the resulting summaries.</p>
      <p>In this paper, Section 2 depicts the data used. Section 3 describes the
baselines and state-of-the-art system. Section 4 describes participant approaches.
Finally, Section 5 draws some conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>The MC2 Content Analysis 2017 task provides a set of 1,100 microblogs in 20
languages to be mapped into textual extracts from English, Spanish, French,
and Portuguese Wikipedia.
2.1</p>
      <sec id="sec-2-1">
        <title>Wikipedia XML Corpus</title>
        <p>Wikipedia is under a Creative Commons license and its content can be used to
contextualize tweets or to build complex queries referring to Wikipedia entities.</p>
        <p>We have extracted an average of 10 million XML documents from Wikipedia
per year since 2012 in the four main Twitter languages: English (en), Spanish
(es), French (fr), and Portuguese (pt). The corpus and tools to process them are
available on the Tweet Contextualization website4.</p>
        <p>These documents reproduce, in an easy-to-use XML structure, the content
of the main Wikipedia pages: title, abstract, section, and subsections, as well as
Wikipedia internal links. Other content such as images, footnotes, and external
links is stripped out in order to obtain a corpus that is easier to process using
standard NLP (Natural Language Processing) tools.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Queries</title>
        <p>
          The query collection is a pool of 1,100 microblogs extracted from the microblog
stream presented at the CLEF 2016 workshop [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ](see also [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]). These microblogs
have more than 80 characters, they do not contain URLs and are written in
more than 20 di erent languages. The main languages are: en (60%), es (14%),
fr (5%), pt (4%), it (2%). Other languages are: ja, de, nl, tr, id, ca, eu, zh, ru,
sr, pl, ko, and ar.
4 http://tc.talne.eu
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Baselines and State-of-the-art System</title>
      <p>For each Wikipedia we provided an XML retrieval system powered by Indri, a
Perl API for the XML retrieval system using standard LWP (short for "Library
for WWW in Perl"), the corpus in a single XML le (gzip compression), and
the corpus split into 1,000 folders, one le per page (tgz archive). However these
baselines did not provide text segmentation into sentences nor an automatic
summarization tool. They only allowed to retrieve XML elements based on nested
language models.</p>
      <p>Based on these resources the available baselines are:
{ ltering: based on the word "festival";
{ language: based on Twitter local code;
{ entity extraction: top ranked Wikipedia page titles based on a document
language model;
{ summarization: based on Wikipedia page abstracts.</p>
      <p>
        A state-of-the-art contextualization system has also been used to generate
a complete run available for active participants. This reference system is based
on the Terrier platform5. Wikipedia pages in English, French, Spanish, and
Portuguese were stemmed by the SnowBall stemmer. The pages retrieved by the
InL2 model with Bo1 query expansion technique were interpreted as a baseline
for the entity recognition subtask. Then, documents were parsed by Stanford
CoreNLP in order to perform sentence chunking and lemmatization 6. For the
automatic summarization subtask we used the following baselines:
{ the rst passage from the top-scored Wikipedia page;
{ the cosine similarity between a tweet and a candidate sentence;
{ word2vec similarity between a tweet and a candidate sentence [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ];
{ the system based on local context analysis presented at CLEF-2015 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Participants Approaches</title>
      <p>
        Each item has been evaluated independently, however, language identi cation
could impact Wikipedia linking and the resulting summaries. The ltering and
author categorization subtasks were inspired by the ltering and priority tasks
at RepLab 2014 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>Filtering and Opinion Mining</title>
        <p>One participant (LIA-FR) scored all microblogs by proximity with a festival
topic [10]. Opinion mining was not initially considered, however one
participant (ISAMM-TN) did apply binary opinion classi ers [11]. It appeared that
microblog interestingness about festivals assessed by organizers mostly relies on
neutral microblogs because they are easier to understand without context.
5 http://terrier.org/
6 https://stanfordnlp.github.io/CoreNLP/
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Language Identi cation</title>
        <p>Language identi cation is challenging over short content that tends to mix
several languages. Indeed, festival names over tweets often appear in English but
the rest of the content can be in any other language. Moreover, festival attendees
tend to add terms from various dialects to highlight the local context.</p>
        <p>Using linguistic resources for main languages as Syllabs-FR did, allow to
reach the best precision scores [12]. However, based on statistical approaches,
the LIA-FR identi ed 121 errors in microblog local information among the 1,100
[13]. After evaluation, it appeared that 90 among the 121 were true errors: 30%
about en, 20% about pt, 16% about es, 10% about id. The rest of true errors
were about it, de, sh, fr, nl, ceb, ca, and sv.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Event Localization</title>
        <p>Event localization requires external resources. For large festivals, Wikipedia
often contains the information and it can be retrieved based on state-of-the-art
QA (Question Answering) approaches. However for small events it is necessary
to query the public web or social networks. The Syllabs-FR team managed to
localize festivals in France using public information [12].
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Entity Recognition and Automatic Summarization</title>
        <p>
          The two subtasks: Wikipedia Entity Recognition and Automatic Summarization
refer to previous experiments around Tweet Contextualization[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The most
efcient methods proceed in two steps: 1) retrieve the most relevant Wikipedia
pages, 2) propose a multidocument summary of them. Wikifying tweets is
complex due to the lexical gap between tweets and Wikipedia pages. Extracting
summaries looked easier by aggregating sentences from pages, however ensuring
and evaluating readability is an issue, especially with languages that have less
resources than English.
        </p>
        <p>The FELTS system managed to identify all Wikipedia page titles that
explicitly appear in the 1,100 microblogs for the four target languages [13]. Multiword
titles are often unambiguous. Among the 1,100 queries, 818 of them contained
explicit references to unambiguous Wikipedia pages in English, 536 in Spanish,
485 in French, and 459 in Portuguese. By considering the Wikipedia abstract of
these pages, it was then possible to directly extract high quality summaries
contextualizing almost half of the topics in the four target languages. This approach
has bean scaled to process microblog streams in real time.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Dealing with a massive multilingual multicultural corpus of microblogs reveals
the limits of both statistical and linguistic approaches. It also requires linguistic
resources for each language or for speci c cultural events. Therefore language
and festival recognition appeared to be the key points of the overall MC2 CLEF
2017 lab o cial tasks.</p>
      <p>Researchers interested in using MC2 Lab data and infrastructure, but who
did not participate to the 2017 edition, can apply until March 2019 to get access
to the data and baseline system for their academic institution by contacting
eric.sanjuan@talne.eu. Once the application is accepted, they will get a
personal private login to gain access to lab resources for research purposes.
2014. Proceedings. Volume 8685 of Lecture Notes in Computer Science., Springer
(2014) 307{322
10. Linhares Pontes, E., Huet, S., Torres-Moreno, J.M., Carneiro Linhares, A.: . (2017)
11. Ouertatani, A., Gasmi, G., Latiri, C.: Opinion polarity detection in Twitter data
combining sequence mining and topic modeling. (2017)
12. Hamon, O., Monnin, C., de Loupy, C.: Syllabs Team at CLEF MC2 Task 1: Content</p>
      <p>Analysis. (2017)
13. Jourlin, P.: Entity Recognition and Language Identi cation with FELTS. (2017)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Moriceau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
          </string-name>
          , J.:
          <article-title>Overview of the INEX 2011 question answering track (qa@inex)</article-title>
          . In Geva, S.,
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
          </string-name>
          , R., eds.:
          <article-title>Focused Retrieval of Content and Structure, 10th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2011</article-title>
          , Saarbrucken, Germany, December 12-
          <issue>14</issue>
          ,
          <year>2011</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Selected</surname>
          </string-name>
          <article-title>Papers</article-title>
          . Volume
          <volume>7424</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2011</year>
          )
          <volume>188</volume>
          {
          <fpage>206</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bellot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moriceau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>INEX Tweet Contextualization task: Evaluation, results and lesson learned</article-title>
          .
          <source>Information Processing Management</source>
          <volume>52</volume>
          (
          <issue>5</issue>
          ) (
          <year>2016</year>
          )
          <volume>801</volume>
          {
          <fpage>819</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ermakova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.:
          <article-title>Cultural micro-blog Contextualization 2016 Workshop Overview: data and pilot tasks</article-title>
          . In: Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          . (
          <year>2016</year>
          )
          <volume>1197</volume>
          {
          <fpage>1200</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Murtagh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Semantic mapping: Towards contextual and trend analysis of behaviours and practices</article-title>
          . In: Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          . (
          <year>2016</year>
          )
          <volume>1207</volume>
          {
          <fpage>1225</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murtagh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.:
          <article-title>Overview of the CLEF 2016 Cultural Micro-blog Contextualization Workshop</article-title>
          . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 7th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2016</year>
          , Evora, Portugal, September 5-
          <issue>8</issue>
          ,
          <year>2016</year>
          , Proceedings. (
          <year>2016</year>
          )
          <volume>371</volume>
          {
          <fpage>378</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , C., eds.: Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          . Volume 1609 of CEUR Workshop Proceedings., CEUR-WS.org (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deoras</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burget</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cernocky</surname>
          </string-name>
          , J.:
          <article-title>Strategies for training large scale neural network language models</article-title>
          .
          <source>2011 IEEE Workshop on Automatic Speech Recognition and Understanding</source>
          ,
          <string-name>
            <surname>ASRU</surname>
          </string-name>
          <year>2011</year>
          ,
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          (
          <year>2011</year>
          )
          <volume>196</volume>
          {
          <fpage>201</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ermakova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A Method for Short Message Contextualization: Experiments at CLEF/INEX</article-title>
          . In: Experimental IR Meets Multilinguality, Multimodality, and
          <source>Interaction: 6th International Conference of the CLEF Association</source>
          , CLEF'
          <fpage>15</fpage>
          , Toulouse, France, September 8-
          <issue>11</issue>
          ,
          <year>2015</year>
          , Proceedings. Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2015</year>
          )
          <volume>352</volume>
          {
          <fpage>363</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Carrillo de Albornoz, J.,
          <string-name>
            <surname>Chugur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corujo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meij</surname>
            , E., de Rijke,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Overview of replab 2014:
          <article-title>Author pro ling and reputation dimensions for online reputation management</article-title>
          . In
          <string-name>
            <surname>Kanoulas</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Lupu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanbury</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toms</surname>
          </string-name>
          , E.G., eds.: Information Access Evaluation. Multilinguality, Multimodality, and Interaction - 5th
          <source>International Conference of the CLEF Initiative, CLEF</source>
          <year>2014</year>
          ,
          <article-title>She eld</article-title>
          , UK, September
          <volume>15</volume>
          -18,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>