<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sreekanth Madisetty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maunendra Sankar Desarkar</string-name>
          <email>maunendrag@iith.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, IIT Hyderabad</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Retrieving relevant tweets corresponding to cultural events can be used in various applications like event reporting, event recommendation, etc. This type of retrieval is challenging due to short length of the tweet, noise, out of vocabulary words, abbreviations in the tweet. In this paper, we focus on the problem of retrieving relevant tweets related to given cultural event of a festival. We consider several factors like BM25, DFR, presence of artist name, relevant hashtag, festival name for nding the relevance of tweets to the event. We apply BM25 + DFR model to retrieve candidate set of tweets related to each event of a festival. We nd the top hashtags for each event by exploring meta-attributes of an event. We re-rank the initial rank list from BM25 + DFR based on two strategies, namely, presence of the event meta-attributes (artist name, festival name, title, etc.) and the identi ed top hashtags in the tweet, and based on the timestamp of the event. We experimented on a subset of CLEF 2017 cultural microblog contextualization dataset. The experimental results show that the proposed method is able to put relevant tweets at the top of the retrieval list.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        There are three tasks in CLEF 2017 microblog cultural contextualization track,
namely, content analysis, microblog search, and timeline illustration [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Festivals
dataset is provided for all the tasks. There are 70,000,000 microblogs available
in the dataset which is collected between May 2015 and October 2016. We focus
on task 3, timeline illustration that aims to retrieve all relevant tweets related
to each event of a festival provided in the topic queries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Each topic is related
to one cultural event. Here, one event means one occurrence of a show
(theater, music, etc.). Same show can be performed on di erent days of the festival.
Each such show is treated as a separate event. Each topic is described by
metaattributes such as id, festival name, title, artist (or band), start date, end date,
and venue. We apply BM25, BM25 + DFR and several re-ranking methods for
retrieving the tweets of cultural events. Re-ranking by event timestamp gives
better performance as most of the tweets are posted at few days before, during
or after the events scheduled time.
      </p>
      <p>Rest of the paper is organized as follows: Section 2 describes the problem
de nition. Next in Section 3, details of the proposed method are presented.
Submitted runs for evaluation are described in Section 4. We conclude the work
by providing directions for future research in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem De nition</title>
      <p>Here we brie y de ne the problem addressed in this paper: Given an event
E of a festival with its meta attributes title, artist name, festival name, start
date, end date, and venue, retrieve all relevant tweets related to the event. Such
information is useful for attendees of festivals, for people that are interested in
knowing what happens in a festival, and for organizers to get feedback [4]. Each
event is represented as a topic. The following is the example of a topic in the
festival.
&lt;topics&gt;
...
...
&lt;/topics&gt;
&lt;topic&gt;
&lt;id&gt; 1 &lt;/id&gt;
&lt;title/&gt;
&lt;artist&gt; Anna calvi &lt;/artist&gt;
&lt;festival&gt; charrues &lt;/festival&gt;
&lt;startdate&gt; 16/07/15 - 18:45 &lt;/startdate&gt;
&lt;enddate&gt; 16/07/15 - 19:45 &lt;/enddate&gt;
&lt;venue&gt; Kerouac &lt;/venue&gt;
&lt;/topic&gt;</p>
      <p>In the topics dataset provided, the id attribute ranges between 1 to 664,
i.e., there are 664 topics in total for this task. The topic given in the above
example speci es the live music show given by Anna Calvi in Vielles Charreus
2015 festival without any speci c title, the title eld is empty. The artist can
be a single artist, a list of artist names, orchestra name, as they appear in the
o cial programs of the festivals. The festival labels in the dataset are: charrues
for Vielles Charrues 2015, transmusicales for Transmusicales 2015, avignon for
Avignon 2016, edinburgh for Edinburgh 2016. Startdate is start date and time
of the festival whereas enddate is end date and time of the festival. The venue is
a string that corresponds to the name of the event location, given in the o cial
program description.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section, we describe our methodology to retrieve relevant tweets for given
cultural events. There are three phases involved in our method, preprocessing
the tweets, identifying relevant tweets, and re-ranking the tweets. Each of these
phases is explained in detail in the following subsections.
3.1</p>
      <sec id="sec-3-1">
        <title>Pre-processing the tweets</title>
        <p>
          This is the rst phase in our method. The dataset contains 70 million tweets [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
We have observed that discussions about events often happen few days before or
after the events scheduled time. So, we lter the dataset based on timestamps of
the festival before the indexing step. For an event, while retrieving the relevant
tweets, we consider only those tweets that were posted within 15 days of the
event. For example, if a certain event is happening on 15th June then we consider
the tweets posted in the month of June. We assume that most of the tweets about
this event are posted in this duration only. Although, there are tweets related to
this event which are posted beyond this period, they are usually very less. This
ltered dataset is used for faster indexing and retrieval of the tweets.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Identifying relevant tweets</title>
        <p>This is the second phase in our method. We use two independent strategies to
retrieve relevant tweets for a festival.</p>
        <p>BM25: First, we apply BM25 (BM stands for Best Matching) also called as
Okapi BM25 [7] scoring mechanism to retrieve the tweets related to an event. It is
a probabilistic retrieval model. The ranking function used in BM25 is not a single
function but a family of scoring functions, with a small change in parameters
and components. Content matching to the given query is done by the BM25
algorithm. The ranking function of BM25 is as follows.</p>
        <p>Score =</p>
        <p>X log( N
t2q</p>
        <p>
          ):
dft k1((1
(k1 + 1)tftd
b) + b:( avdgl dl ) + tftd
(1)
where N is total number of documents, dft is document frequency of the term,
tftd is term frequency in document d, dl is document length, avg dl is the average
document length in the whole collection, k1 is tuning parameter for controlling
the scaling of term frequency, and b is tuning parameter for controlling the scaling
of document length. We have used k1=1.2 and b=0.75 in our experiments.
BM25+DFR: DFR (DFR stands for Divergence From Randomness) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a
method in Information Retrieval to retrieve documents related to a given query.
There are three building blocks in this model: basic randomness model selection,
rst normalization, and term frequency normalization.
        </p>
        <p>
          The rst step in this model is to select basic randomness model to nd the
weight of a term in the document. Several possible randomness models are
Poisson approximation of the binomial, Divergence approximation of the binomial,
Bose-Einstein distribution, Geometric approximation of the Bose-Einstein,
Inverse term frequency model, Inverse document frequency model, Inverse expected
document frequency model [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Next step is to apply the rst normalization. If
a rare term does not occur in the document, then the probability of that term
in the document is zero and is less informative. On the other hand, if a rare
term occurred many times in the document then the probability of that term is
high and is more informative. The risk component for the term is used in this
step similar to [6]. This risk component is multiplied to the weight of the term
described in the rst step. Some of the rst normalization techniques are ratio
of two Bernoulli processes [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], Laplace's law of succession [3]. The last step is to
normalize the term frequencies. Document length is considered for this
normalization. The following equation describes the term frequency normalization.
avg dl
dl
)
tf normalized = tf log(1 + c
(2)
where tf normalized is the normalized term frequency, tf is the term frequency,
c is a parameter, avg dl is the normalized document length, and dl is the
document length. Equation 2 is referred as Normalization 2. BM25 can be calculated
using DFR model. So, we use DFR version of BM25 in this method. We call
it as BM25 + DFR. The components used in DFR model in our method are
Inverse document frequency model for randomness, Laplace succession for the
rst normalization, and Normalisation 2 for term frequency normalization. The
tweet set which is obtained from BM25 + DFR algorithm is denoted by initial
rank list. The score obtained after applying BM25 + DFR model is denoted by
ScoreBD. Next, we will apply di erent re-ranking methods to re-rank the tweets
from initial rank list.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Re-ranking</title>
        <p>After pre-processing and identifying relevant tweets, the next phase is to re-rank
the identi ed tweets as described in Section 3.2. The following are the di erent
re-ranking mechanisms that we applied to re-rank the results of initial rank list.
{ Meta-Attributes: The meta-attributes of an event are artist name, festival
name, etc. We also come up with a list of top hashtags for each event. Top
hashtags are obtained by nding the similarity between meta-attributes and
hashtags. This re-ranking strategy checks for the presence of artist name in
the tweet, festival name, and one of the top hashtags in the tweet. Frequency
is also used along with similarity features. Tweets that are not having any
of these features are ignored. For each tweet we compute a meta score as
follows:</p>
        <p>Scorem(tweet) = (tweet contains artist name)
+ (tweet contains festival name)</p>
        <p>+ (tweet contains any top hashtag) (3)
Here is a boolean function that returns 1 if and only if the predicate is
true. We re-rank the tweets based on the combined score as:</p>
        <p>ScoreBDM (tweet) = ScoreBD(tweet) + Scorem(tweet)
(4)
where ScoreBD(tweet) is the score obtained by BM25 + DFR model.
{ Time: This method re-ranks the tweets based on the performance time
of the event in the festival. We observe that some events are repeated on
di erent days of the festival. In order to di erentiate the tweets for repeated
events scheduled on di erent days, the timestamp of the event will be helpful.
Preference will be given to the tweets whose creation time is close to starting
time of the show. Each tweet is assigned a time-based score to an event. If
t is the absolute time di erence between the event start time and the tweet
creation time, then the Scoretime is computed as:</p>
        <p>Scoretime(tweet) =
t +
1 +
(5)
In our setting, we put = 0:9 and = 0:1. The value of Scoretime(tweet)
is ranges between 0 and 1. The graph with di erent time values is shown
in Figure 1. If the absolute time di erence between event start time and
tweet post time is less, then the time-based score of the tweet will be close
to 1. Here, we have taken the time di erence in hours. We used this function
to suppress the tweets for which the tweet creation time is far away from
event time and to give importance to the tweets for which the tweet creation
time is close to the event time. We wanted this function to be non-linear in
nature. If the time di erence is small, then the importance dampens quickly,
and after some time it will stabilize to some small but non-zero value. In
this way, tweets that are posted far from the event time are still considered
if the tweet content appears to be relevant to the event and get a chance to
show up in the nal ranking. We re-rank the tweets based on the combined
score of BM25 + DFR, meta-attributes, and timestamp as:</p>
        <p>Scorefinal(tweet) = ScoreBDM (tweet) + Scoretime(tweet)
(6)
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        The dataset consists of 70 million microblogs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Each microblog has the
following attributes.
      </p>
      <p>{ id: unique id of the microblog
{ from user: author of the tweet (screen name)
{ from userid: unique id of the author
{ iso language code: encoding of the tweet (en, es, fr, pt)
{ source: interface used for posting the tweet (frequent tags: Twitter Web</p>
      <p>Client iPhone and Twitterfeed clients)
{ pro le image url: url of the pro le image
{ wday: week day
{ created at: tweet creation date
{ time s: quantitative variable (integer)
{ time ord: quantitative variable (integer)
{ content: tweet content
The following is an example of a microblog.</p>
      <p>658495097328312321 loveethewayy 3318630764 en IFTTT
http://pbs.twimg.com/profile images/633606614843392000/ZdnClHU8 normal.jpg
Mon 2015-10-26 14840 1445832440 https://t.co/zf4cIQhXru
Le Festival dAvignon ou La passion thtre https://t.co/Aqjglu4QhP
via JulienGue
4.1</p>
      <sec id="sec-4-1">
        <title>Runs Submitted</title>
        <p>We have used Terrier IR platform [5] for indexing and retrieval of the tweets.
We have submitted the following runs for task 3, timeline illustration, in CLEF
2017 microblog cultural contextualization track.</p>
        <p>{ Baseline1: Only BM25 is used in this method.
{ Baseline2: BM25 + DFR combined run to retrieve the top ranked documents
from each event of the festival.
{ Meta Attributes: This run is performed using meta-attributes of the event
like artist name, festival name. Top hashtags are also used as described in
Section 3.3.
{ MetaAttributes+Time: This run is performed based on Baseline2 +
MetaAttributes + timestamp of an event as described in Section 3.3.</p>
        <p>The precision results of BM25, BM25 + DFR, Meta Attributes, and
MetaAtrributes+Time are shown in Figure 2. The number of queries used for
evaluation is 35. We observe that Meta Attributes precision values are higher than
BM25 and BM25 + DFR models. Precision values of MetaAttributes+Time are
greater than all other methods. This shows the importance of selected features,
meta-attributes, and timestamp.
We assume that event to be a pre-planned event that is having context features
like title, artist name, festival name, event start date, event end date, and venue.
We have employed a method to retrieve relevant tweets for di erent events as
part of time line illustration task, CLEF 2017 microblog cultural
contextualization. We have used four di erent methods in this task. The rst method use
content matching for retrieving the tweets. The second method is DFR version
of BM25. Remaining methods are based on re-ranking the initial rank list. We
observe that re-ranking method which combines meta-attributes, and timestamp
is performing better. For future work, we would like to see other methods which
will further improve the recall.
3. Feller, W.: An introduction to probability theory and its applications, vol. 2. John</p>
        <p>Wiley &amp; Sons (2008)
4. Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the
news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on
Knowledge discovery and data mining. pp. 497{506. ACM (2009)
5. Macdonald, C., McCreadie, R., Santos, R.L., Ounis, I.: From puppy to maturity:</p>
        <p>Experiences in developing terrier. Proc. of OSIR at SIGIR pp. 60{63 (2012)
6. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval.</p>
        <p>In: Proceedings of the 21st annual international ACM SIGIR conference on Research
and development in information retrieval. pp. 275{281. ACM (1998)
7. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al.:
Okapi at trec-3. Nist Special Publication Sp 109, 109 (1995)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amati</surname>
          </string-name>
          , G.,
          <string-name>
            <surname>Van Rijsbergen</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 20(4)</source>
          ,
          <volume>357</volume>
          {
          <fpage>389</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ermakova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.:
          <article-title>Clef 2017 microblog cultural contextualization lab overview</article-title>
          .
          <source>In: International Conference of the Cross-Language Evaluation Forum for European Languages Proceedings</source>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>