<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Queries Microblogs Reposts Authors Languages</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CLEF 2017 MC2 Search and Timeline tasks Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorraine Goeuriot</string-name>
          <email>lorraine.goeuriot@imag.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe Mulhem</string-name>
          <email>Philipe.mulhem@imag.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric SanJuan</string-name>
          <email>eric.sanjuan@univ-avignon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIA</institution>
          ,
          <addr-line>Universite d'Avignon</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. Grenoble Alpes</institution>
          ,
          <addr-line>CNRS, Grenoble INP, LIG, F-38000 Grenoble</addr-line>
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <volume>472</volume>
      <issue>35</issue>
      <abstract>
        <p>MC2 CLEF 2017 lab investigates the relationship between cultural microblogs and their social context. This involves microblog search, classi cation, ltering, language recognition, localization, entity extraction, linking open data, and summarization. The goal of the timeline illustration track is to study approaches that better retrieve microblogs issued during a cultural event, in order to get a glimpse of the attendees' perception. Regular Lab participants have access to the private massive multilingual microblog stream of The Festival Galleries project. Festivals have a large presence on social media. The topics were in four languages: Arabic, English, French and Spanish, and results were expected in any language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>MC2 CLEF 2017 lab investigates the relationship between cultural microblogs
and their social context. This involves microblog search, classi cation,
ltering, language recognition, localization, entity extraction, linking open data, and
summarization. The goal of the timeline illustration track is to study approaches
that better retrieve microblogs issued during a cultural event in order to get a
glimpse of the attendees' perception.</p>
      <p>
        In 2016, the CLEF MC2 Workshop considered speci c cultural twitter feeds [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
In this context, restricted context, implicit localization and language identi
cation appeared to be important issues. It also required identifying implicit
timelines over long periods. The MC2 CLEF 2017 lab has been centered on Cultural
Contextualization based on microblog feeds. It deals with how the cultural
context of a microblog a ects its social impact at large [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The overall usage scenario for the lab has been centered on festival attendees:
{ an insider attendee who receives a microblog about the cultural event which
he will participate in will need context to understand it (microblogs often
contain implicit information);
{ a participant in a speci c location wants to know what is going on in
surrounding events related to artists, music, or shows that he would like to
see. Starting from a list of bookmarks in the Wikipedia app, the participant
will seek for a short list of microblogs summarizing the current trends about
related cultural events. We hypothesize that she/he is more interested in
microblogs from insiders than outsiders or o cials.</p>
      <p>These scenarios lead to three tasks lab participants could answer to: (1) Content
analysis, (2) Microblog search, (3) Timeline illustration.</p>
      <p>This paper describes the Timeline illustration task. The purpose of the task
is to provide a glimpse of the atmosphere of a festival. To do so, participants are
asked to retrieve for each events within a festival (concerts, plays, etc.) all the
relevant tweets from the dataset.</p>
      <p>Section 2 describes the datasets provided to participants. Participants'
submissions are described in Section 3, and conclusions are given in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>The lab gave registered participants access to a massive collection of microblogs
and URLs related to cultural festivals around the world.</p>
      <p>A personal login was required to acces the data. Once registered on CLEF
each registered team can obtain up to 4 extra individual logins by writing to
admin@talne.eu. This collection is still accessible on demand. Any usage
requires to make a reference to the following paper: \L. Ermakova, L. Goeuriot,
J. Mothe, P. Mulhem, J.-Y. Nie, and E. SanJuan, CLEF 2017 Microblog
Cultural Contextualization Lab Overview, Proceedings of Experimental IR Meets
Multilinguality, Multimodality, and Interaction 8th International Conference of
the CLEF Association, CLEF 2017, LNCS 10439, Dublin, Ireland, September
11-14, 2017". Updates will be frequently posted on the lab website3.</p>
      <p>An Indri index with a web interface is available to query the whole set of
microblogs. Online Indri indexes are available in English, Spanish, French, and
Portuguese for Wikipedia search.
2.1</p>
      <sec id="sec-2-1">
        <title>Microblog Collection</title>
        <p>
          The document collection is an updated extension of the microblog stream
presented at the CLEF 2016 workshop [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ](see also [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]).
        </p>
        <p>It was provided to registered participants by ANR GAFES project4. It
consists in a pool of more than 50M unique microblogs from di erent sources with
their meta-information as well as ground truth for the evaluation.</p>
        <p>The microblog collection contains a very large pool of public posts on
Twitter using the keyword \festival" that have been collected since June 2015. These
microblogs were collected using private archive services based on streaming API.
The average number of unique microblog posts (i.e. without re-tweets) between
June and September is 2,616,008 per month. The total number of collected
microblog posts after one year (from May 2015 to May 2016) is 50,490,815</p>
        <sec id="sec-2-1-1">
          <title>3 https://mc2.talne.eu/lab/</title>
          <p>4 http://www.agence-nationale-recherche.fr/?Projet=ANR-14-CE24-0022
(24,684,975 without re-posts). These microblog posts are available online on
a relational database with associated elds.</p>
          <p>Because of privacy issues, they cannot be publicly released but can be
analyzed inside the organization that purchased these archives and among
collaborators under a privacy agreement. The MC2 lab provides this opportunity to
share this data amongst academic participants. These archives can be indexed,
analyzed and general results acquired from them can be published without
restriction.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Linked Web Pages</title>
        <p>66% of the collected microblog posts contain Twittert.co compressed URLs.
Sometimes these URLs refer to other online services like adf.ly, cur.lv, dlvr.it,
ow.ly that hide the real URL. We used the spider mode of the GNU wget tool
to get the real URL, this process required multiple DNS requests.</p>
        <p>The number of unique uncompressed URLs collected in one year is 11,580,788
from 641,042 distinct domains.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Topics for Microblog Search</title>
        <p>Given a cultural query about festivals in Arabic, English, French, or Spanish,
the task is to search for the 64 most relevant microblogs in a collection covering
18 months of news about festivals in all languages.</p>
        <p>Queries have been extracted from resources suggested by participants.</p>
        <p>
          Arabic and English queries were extracted from the Arab Spring Microblog
corpus [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We considered the content of all the tweets dealing with festivals
during the Arab Spring period. The task consisted in searching for traces of
these festivals or artists in the lab corpus two years after this period. The usual
case was to follow up artists involved in the Arab spring festivals two or three
years later. There were 71 topics in arabic, 81 in English with an average of 10
tokens per topic and without URLs.
        </p>
        <p>
          French queries were extracted from the Vodkaster Micro Film Reviews [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Vodkaster is a French social network about lms. Users can post and share micro
reviews in French about movies as they watch them. There were 233 topics in
French with an average of 22 words per topic.
        </p>
        <p>Spanish queries are a representative sample of sentences dealing with festivals
from the Mexican newspaper La jornada5. We considered all the sentences from
the newspaper mentioning a festival and extracted a random sample from this
pool. These were well formed sentences that were easy to analyze but much
harder to contextualize. There were 142 topics in Spanish with an average of 25
words per topic.</p>
        <sec id="sec-2-3-1">
          <title>5 http://www.jornada.unam.mx</title>
          <p>2.4</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Topics for Timeline Illustration</title>
        <p>The goal of this task is to retrieve all relevant tweets dedicated to each event of
a festival according to the program provided. In this case, we were looking at a
kind of "total recall" retrieval based on the initial artists' names and the names,
dates, and times of shows.</p>
        <p>For this task, we focused on 4 festivals. Two French Music festivals, one
French theater festival and one Great Britain theater festival:
{ Vielles Charrues (2015),
{ Transmusicales (2015),
{ Avignon (2016),
{ Edinburgh (2016).</p>
        <p>Each topic was related to one cultural event. In our terminology, one event
is one occurrence of a show (theater, music, ...). Several occurrences of the same
show correspond then to several events (e.g. plays can be presented several times
during theater festivals). More precisely, one topic is described by: one ID, one
festival name, one title, one artist (or band) name, one time slot (date/time
begin and end), and one venue location.</p>
        <p>Participants were required to use the full dataset to conduct their
experiments.</p>
        <p>The runs were expected to respect the classical TREC top les format. Only
the top 1000 results for each query run must be given. Each retrieved document
is identi ed using its tweet ID. The evaluation is achieved on a subset of the
full set of topics according to the richness of the results obtained. The o cial
evaluation measures were interpolated at a precision of 1% and recall values at
5%, 10%, 25%, 50% and 100% .
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Baselines</title>
      <sec id="sec-3-1">
        <title>Microblog Search Task</title>
        <p>A language model index powered by Indri and accessible through a web API has
been provided. To deal with reposts, there was one document grouping all the
users posts including his/her reposts. Each document has an XML structure (cf.
Fig 1). Fig. 2 gives an example of such XML document.</p>
        <p>This XML structure allows for one to work with complex queries like:
\# combine[m](</p>
        <p>Instagram.c es.l \# 1(2016 05).d conduccin
\# syn(pregoneros pregonero) \# syn(festivales festival))
This query will look for microblogs ([m]) posted from Instagram (.c) using
Spanish locale (.l) in May 2016 (.d) dealing with pregonero(s) and festival(es).</p>
        <p>For each set of queries two sets of queries have been generated, one retrieved
authors with all their posts, the other focused on the posts themselves. For
&lt;!ELEMENT xml (f, m)+&gt;
&lt;!ELEMENT f ($\#$ user\_id)&gt;
&lt;!ELEMENT m (i, u, l, c d, t)&gt;
&lt;!ELEMENT i ($\#$ microblog\_id)&gt;
&lt;!ELEMENT u ($\#$ user)&gt;
&lt;!ELEMENT l ($\#$ ISO\_language\_code)&gt;
&lt;!ELEMENT c ($\#$ client&gt;
&lt;!ELEMENT d ($\#$ date)&gt;
&lt;!ELEMENT t ($\#$ PCDATA)&gt;
English, Spanish and French no preprocessing or stop word list was applied.
This resulted in long queries that were long to process, especially in the case
of Focus retrieval. For Arabic, a stop word list was applied which improved
e ciency. Table 1 provides the statistics about authors and microblogs retrieved
using this baseline index powered by Indri over plain bag of words queries with
language model and default Dirichlet model. Average numbers per queries are
indicated into parenthesis.</p>
        <p>
          Arabic and English topics are tweets about festivals during the Arabic spring
period, although they are comparable, English topics cover a larger number of
reposts and a wider range of languages. Arabic tweets are also posted by a reduce
number of authors. French topics led to an extraction of an even greater number
of noisy reposts than English. This is due to the fact that the majority are micro
critics about lms and part of them refer to the Cannes Festival which generates
a massive number of tweets in French. Finally, it is the Spanish corpus that is
composed of sentences from news about festivals that appears to be the most
speci c since it encompasses a reduced number of di erent languages.
The timeline illustration task provided a baseline based on the Terrier system [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
The microblogs indexed were ltered: the ltering is based on the tweets'
timestamp (which corresponds to the dates of the festivals), and text matching
patterns (location or festival name for instance). The subset obtained consists of
243,643 tweets. We chose to keep the entire text of the initial tweets: we removed
the `@' and `#' characters, and used a classical stoplisting process and Porter
stemmer.
        </p>
        <p>The content-based retrieval uses the BM25 model with the default
parameters (stoplist, Porter stemming, b = 0:75).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Participant approaches and evaluation</title>
      <p>
        For Multilingual Microblog Search, we applied the same methodology based on
textual references instead of document qrels. Seven trilingual annotators (whom
all together were uent in 13 di erent languages: Arabic, Hebrew, Euskadi,
Catalan, Mandarin Chinese, English, French, German, Italian, Portuguese, Russian,
Spanish and Turkish) produced an initial textual reference. This reference was
extended to Korean, Japanese and Persian based on Google translate. However
this automatic extension appeared to be noisy and had to be dropped out from
the reference. Only results in one of the assessors language could then be
evaluated. Final textual references to evaluate microblog search run informativity
will be presented at CLEF 2017 CLEF lab sessions in Dublin. Informativity is
de ned and computed based on [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>For Timeline Illustration it was anticipated that re-tweets would be excluded
from the pools. But the fact that it was a recall-oriented task led participants to
return all re-tweets. Excluding re-tweets would have disquali ed recall oriented
runs that missed one original tweet. Moreover, it emerged during the evaluation
that re-tweets are often more interesting than original ones. Indeed, original
tweets are often posted by festival organizers, meanwhile reposts by individuals
are more informative about attendees' participation in the festival.</p>
      <p>Therefore, building a set of document qrels for time-line illustration was a
two step process.</p>
      <p>Firstly, tweet relevance on original tweets from baselines (each participant
was asked to provide a baseline) was assessed on a 3-level scale:
{ Not relevant: the tweet is not related to the topic,
{ Partially relevant: the tweet is somehow related to the topic (e.g. the tweet
is related to the artist, song, play but not to the event, or is related to a
similar event with no possible way to check if they are the same).
{ Relevant: the tweet is related to the event.</p>
      <p>Secondly, the qrels were expanded to any microblog containing the text of a
tweet previously assessed as relevant. In this manner, the qrels were expanded
to all reposts. Participant runs were then ranked using the TRECEVAL
program provided by NIST TREC6. All measures were provided since they lead to
di erent rankings.</p>
      <p>5 teams participated in the multilingual microblog search but none managed
to process the four sets of queries. All teams were able to process the English
set. Three of them manged to process French queries, one also processed Arabic
queries and another, Spanish queries. Building reliable multilingual stop word
lists was a major issue and required linguistic expertise. 4 teams participated
in the timeline illustrations task but only one outperformed the BM25 baseline.
The main issue was to identify microblogs related to one of the four festivals
chosen by organizers. This selection couldn't be solely based on festival names
since some relevant microblogs didn't include the festival hashtag. Nor could it
be based on the dates of the festival since microblogs about videos posted by
festivals later on after the event were considered as relevant.</p>
      <p>The most e ective approaches have been:
{ MIcroblog Search: LIPAH based on LDA query reformulation for Language</p>
      <p>Model;
{ Timeline Illustration: IITH using BM25 and DRF based on the artist's name,
the festival name and the top hashtags of each of the events' features.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Dealing with a massive multilingual multicultural corpus of microblogs reveals
the limits of both statistical and linguistic approaches. Raw utf8 text needs to
be indexed without chunking. Synonyms and ambiguous terms over multiple
languages have to be managed at query level. This requires positional indexes,
however the usage of utf8 encoding makes them slow. It also requires linguistic
resources for each language or for speci c cultural events. Therefore, language
and festival recognition appeared to be the key points of MC2 CLEF 2017's
o cial tasks.</p>
      <sec id="sec-5-1">
        <title>6 http://trec.nist.gov/trec_eval/</title>
        <p>The CLEF 2017 MC2 also expanded from a regular IR evaluation task to
a task search. Almost all participants used the data and infrastructure to deal
with topics that were beyond the initial scope of the lab. For example:
{ the LSIS-EJCAM team used this data to analyze the role of social media in
propagating controversies,
{ the My Local In uence and U3ICM team experimented using sociological
needs to characterize pro les and contents for Microblog search.</p>
        <p>Researchers interested in using MC2 Lab data and infrastructure, but who
didn't participate to the 2017 edition, can apply untill march 2019 to get access
to the data and baseline system for their academic institution by contacting
eric.sanjuan@talne.eu. Once the application accepted, they will get a
personal private login to access lab resources for research purposes.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murtagh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.:
          <article-title>Overview of the CLEF 2016 Cultural Micro-blog Contextualization Workshop</article-title>
          . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 7th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2016</year>
          , Evora, Portugal, September 5-
          <issue>8</issue>
          ,
          <year>2016</year>
          , Proceedings. (
          <year>2016</year>
          )
          <volume>371</volume>
          {
          <fpage>378</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Murtagh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Semantic mapping: Towards contextual and trend analysis of behaviours and practices</article-title>
          . In: Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          . (
          <year>2016</year>
          )
          <volume>1207</volume>
          {
          <fpage>1225</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , C., eds.: Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , Evora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          . Volume 1609 of CEUR Workshop Proceedings., CEUR-WS.org (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Features Extraction To Improve Comparable Tweet corpora Building,
          <source>JADT</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cossu</surname>
            ,
            <given-names>J.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaillard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juan-Manuel</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El Beze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Contextualisation de messages courts :l'importance des metadonnees</article-title>
          .
          <source>In: EGC'2013 13e Conference Francophone sur l'Extraction et la Gestion des connaissances</source>
          , Toulouse, France (
          <year>January 2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plachouras</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Terrier: A High Performance and Scalable Information Retrieval Platform</article-title>
          . In: SIGIR'06 Workshop on Open Source Information Retrieval,
          <source>(OSIR'06)</source>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bellot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moriceau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>INEX Tweet Contextualization task: Evaluation, results and lesson learned</article-title>
          .
          <source>Information Processing Management</source>
          <volume>52</volume>
          (
          <issue>5</issue>
          ) (
          <year>2016</year>
          )
          <volume>801</volume>
          {
          <fpage>819</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>