<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIG at CLEF 2016 Cultural Microblog Contextualization: TimeLine Illustration based on Microblogs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nayanika DOGRA</string-name>
          <email>nayanika.dogra@e.ujfgrenoble.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe MULHEM</string-name>
          <email>philippe.mulhem@imag.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nawal OULD AMER</string-name>
          <email>nawal.ouldamer@imag.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorraine GOEURIOT</string-name>
          <email>lorraine.goeuriot@imag.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNRS LIG laboratory, MRIM group Grenoble</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UGA LIG laboratory, MRIM group Grenoble</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the approach used by the LIG-MRIM research group to the participation of the task 3 (TimeLine illustration based on Microblogs) for the CLEF of Cultural Microblog Contextualization track. This task deals with the retrieval of tweets related to cultural events (music festivals) . For the content-based elements, we use the classical BM25 model [4]. Then, we diversify the results based on duplicate removal, using tf-based representations of tweets. In a third step, we apply optional re-ranking related to time-line, activity and popularity of authors of tweets. CCS Concepts •Information systems → Information retrieval; &lt;topic&gt; &lt;id&gt;1&lt;/id&gt; &lt;title&gt;Khun Narin's Electric&lt;/title&gt; &lt;festival&gt;Transmusicales&lt;/festival&gt; &lt;begindate&gt;04/12/15-14:00&lt;/begindate&gt; 1 https://mc2.talne.eu/˜cmc/spip/Tasks/task-3-timelineillustration-based-on-microblogs.html</p>
      </abstract>
      <kwd-group>
        <kwd>tweet retrieval</kwd>
        <kwd>diversification</kwd>
        <kwd>reranking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The goal of the Timeline illustration based on Microblogs subtask1 is to provide, for
each event of a cultural festival, the most interesting tweets. The Timeline Illustration
Subtask focuses on two French music festivals ( the “festival des vieilles charrues”
and the “Transmusicales”), and the topics are all the live-events of one full day for
each festival. Overall, there are 53 topics evaluated for this subtask. These topics are
selected by the task organizers as live events corresponding to one day of each
festival, and the goal is to retrieve relevant and diverse tweets related to each event.
One example of topic depicts the show of KhunNarin’sElectric that took place at the
Transmusicales the 03/12/15:
&lt;enddate&gt;04/12/15-16:30&lt;/enddate&gt;
&lt;/topic&gt;
One of our goals four our participation to this retrieval task was to study the use of an
information retrieval documents index as a basis for quasi-duplicates removal. Using
such index allows to avoid complex partial string inclusions processes, and to use
more simple overlap measures. Our overall approach is described in Figure 1,
corresponding to the following organization of the paper. From the initial tweets set
provided for the task, we filter (pre-process) the tweets the potentially relevant tweets
as described in Section 2. Then Section 3 presents the content-based retrieval
achieved. In a second step, a diversification process is achieved through a simple
instance-based duplicate removal, as presented in Section 4. The reranking of the
diversified tweets, in Section 5, is then performed using three different ways:
timeline, tweet author activity, and tweet author popularity. We conclude this work in
Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Pre-Processing of the official Tweet Corpus</title>
      <p>The official corpus contains the tweets crawled during the months of July and
december 2015.</p>
      <p>Before indexing the tweet and processing the queries, we filtered the dataset to work
on a subset of the official set of tweets provided. The filtering is based on timestamps,
corresponding to the dates of the festivals, and text matching patterns (location or
festival name for instance). The subset obtained consist on 243,643 tweets.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Content-based Matching</title>
      <p>The content-based retrieval is a simple process that uses the topic as the query, each
query matched against the documents of the filtered corpus described in Section2. The
content based retrieval uses BM25 [4] model.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Diversification</title>
      <p>The second step of the query processing is dedicated to diversify the results. In the
state of the art, several ways to diversify the results are proposed [1]. The authors of
[2] mention that most of the diversification processes of the state of the art are
achieved on after a first step of retrieval, and that is also our approach here. In the
case of tweets, i.e., very short documents from which the content is very small, we
chose to tackle this problem by removing duplicate tweets that correspond to
retweets2 In fact, our proposal does not limit the process to reteweets but to very
similar tweets (that contain retweeted tweets). Here we propose:
to keep the original tweet t when t and its retweets are in the result list;
to keep the most relevant retweet of one tweet t, when several retweets are in
the result list, but t is not retrieved.</p>
      <p>Having obtained content-based tweets, several ways of reranking them after the step 4
are explored:</p>
      <p>No re-ranking (NO): The result of step 4 is directly given as an answer;
Timeline re-ranking (TIM): The result is re-ranked according to the creation
date of the tweets. This kind of presentation allows the organizers of one
event to pinpoint when something happened;
Unlike we may think, this approach is not similar to achieving a flat clustering on the
tweets, as we define a iterative process that goes from the top results to the last ones.
To avoid storing the original tweet in addition to their index, such filtering is not
achieved on the initial text of the tweets, but directly their index that contain the
(possibly stemmed) terms with their t f value). We use then an overlap function over
the index of compared tweets, and a threshold above which the tweets are considered
similar. If two tweets are considered similar, we keep one of these duplicates as
described above. The result expected is then a short list of diverse tweets that describe
the event.
5</p>
      <p>Re-ranking


2 One feature of Twitter is to allow users to “forward" (with or without alteration), or retweet,
received tweets.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental results</title>
      <sec id="sec-5-1">
        <title>6.1 Parameters Settings</title>
        <p>All our submitted runs are applied on the filtered corpus. The content-based retrieval
uses the Terrier system [3], that implements BM25, using the default parameters
(stoplist, Porter stemming, b = 0:75).</p>
        <p>We tested three overlap values:
After some preliminary tests, for a and b coefficients the overlap threshold value is
fixed to 0:75; and for the c coefficient, the overlap is fixed to 0.8. Because we do not
have evaluation results for our runs, we only discuss the number of results obtained
by these runs.</p>
      </sec>
      <sec id="sec-5-2">
        <title>6.2 Runs submitted</title>
        <p>We submitted the 7 following runs:</p>
        <p>Social-based re-ranking: we defined two social based re-ranking functions as
follows:
 ACT: this re-ranking function is related to the activity of a tweet author.</p>
        <p>We assume that, the more active an author is, the more interesting are
his tweets;
 POP: this re-raking function is based on the popularity of tweet author.</p>
        <p>The underlying assumption being that the more the author is mentioned
in tweets of the corpus, the more interesting his tweets are.
the Jaccard overlap coefficient,
the Szymkiewicz-Simpson cofficient,
the Sorensen-Dice coefficient.</p>
        <p>RUN1: The content-only run, after the step 1 of the query processing
described in Section 3. On average, each topic obtain a result list of 67
tweets;
RUN2: Jaccard coefficient diversified-only run, obtained as the result of the
step 2 of the query processing described in Section 4. On average, each topic
obtained a 36 tweets long result list, so the diversity removes 45% results
from the RUN1. Because the runs RUN5, RUN6 and RUN7 only reorder the
results, they have the same result sizes;
RUN3: Szymkiewicz-Simpson coefficient diversified-only run, obtained as
the result of the step 2 of the query processing described in Section 4. On
average, each topic obtained a 28 tweets long result list, so the diversity
removes 59% results from the RUN1;



</p>
        <p>RUN4: Sorensen-Dice Coefficient coefficient diversified-only run, obtained
as the result of the step 2 of the query processing described in Section 4. On
average, each topic obtained a 42 tweets long result list, so the diversity
removes 38% results from the RUN1;
RUN5: The results corresponding to the timeline reranking, TIM, as
described in Section 5;
RUN6: The results corresponding to the social activity-based reranking,
ACT, as described in Section 5;
RUN7: The results corresponding to the social popularity-based reranking,
POP, as described in Section 5.
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The participation to the subtask TimeLine illustration based on Microblogs of the
Cultural Microblog Contextualization Workshop allowed us to define a
comprehensive process for the retrieval of tweets. The pre-processing allows us to
focus on a subset of the whole official set of tweets provided for the task. The
content-based retrieval is a classical one. We used three variations of duplicate
removal (diversification) methods that take into account the specificity of the tweets.
We applied 3 ways to rerank the results in a third step of the query processing.
The impact of the pre-processing of the original corpus should be measured in the
future, because it impacts the content-based matching, but also the activity and
popularity values of tweet authors. Other variations of diversity algorithms also have
to be studied, taking into account the specificity of tweets (especially their length, and
their metadata), or even the choice of the kept tweet when we have duplicates.</p>
    </sec>
    <sec id="sec-7">
      <title>References</title>
      <p>1.
In SIGIR'06 Workshop on Open Source Information Retrieval, (OSIR'06),
2006.
4. S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M.</p>
      <p>Gatford. Okapi at trecâĂŞ3. In Overview of the Third Text Retrieval
Conference (TREC-3), pages 109-126. Gaithersburg, MD: NIST, January
1995.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gollapudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halverson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ieong</surname>
          </string-name>
          .
          <article-title>Diversifying search results</article-title>
          .
          <source>In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM '09</source>
          , pages
          <fpage>5</fpage>
          -
          <lpage>14</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Kuoman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tollari</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Detyniecki</surname>
          </string-name>
          .
          <article-title>Using tree of concepts and hierarchical reordering for diversity in image retrieval</article-title>
          .
          <source>In Content-Based Multimedia Indexing (CBMI)</source>
          ,
          <year>2013</year>
          11th International Workshop on, pages
          <fpage>251</fpage>
          -
          <lpage>256</lpage>
          ,
          <year>June 2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>