=Paper=
{{Paper
|id=Vol-1737/T2-6
|storemode=property
|title=Using Relevancer to Detect Relevant Tweets: The Nepal Earthquake Case
|pdfUrl=https://ceur-ws.org/Vol-1737/T2-6.pdf
|volume=Vol-1737
|authors=Ali Hürriyetoğlu,Antal van den Bosch,Nelleke Oostdijk
|dblpUrl=https://dblp.org/rec/conf/fire/HurriyetogluBO16
}}
==Using Relevancer to Detect Relevant Tweets: The Nepal Earthquake Case==
<pdf width="1500px">https://ceur-ws.org/Vol-1737/T2-6.pdf</pdf>
<pre>
     Using Relevancer to Detect Relevant Tweets: The Nepal
                       Earthquake Case

               Ali Hürriyetoǧlu                  Antal van den Bosch                               Nelleke Oostdijk
        Centre for Language Studies             Centre for Language Studies                    Centre for Language Studies
             Radboud University                      Radboud University                             Radboud University
        P.O. Box 9103, NL-6500 HD,              P.O. Box 9103, NL-6500 HD,                     P.O. Box 9103, NL-6500 HD,
         Nijmegen, the Netherlands               Nijmegen, the Netherlands                      Nijmegen, the Netherlands
         a.hurriyetoglu@let.ru.nl              a.vandenbosch@let.ru.nl                            n.oostdijk@let.ru.nl

1.   INTRODUCTION                                                                 1000
   In this working note we describe our submission to the
FIRE 2016 Microblog track Information Extraction from                             800
Microblogs Posted during Disasters [1]. The task in this
track was to extract all relevant tweets pertaining to seven                      600


                                                                    tweet count
given topics from a set of tweets. The tweet set was collected
using key terms related to the Nepal Earthquake1 .                                400
   Our submission is based on a semi-automatic approach
in which we used Relevancer, a complete analysis pipeline                         200
designed for analyzing a tweet collection. The main analy-
sis steps supported by Relevancer are (1) preprocessing the
                                                                                    0
tweets, (2) clustering them, (3) manually labeling the coher-                            26 27 28 29 30 01 02 03 04 05 06 07 08 09 10
                                                                                                        May
ent clusters, and (4) creating a classifier that can be used for                                       2015
classifying tweets that are not placed in any coherent clus-                                                  date
ter, and for classifying new (i.e. previously unseen) tweets
using the labels defined in step (3).                                      Figure 1: Temporal distribution of the tweets
   The data and the system are described in more detail in
Sections 2 and 3, respectively.

2.   DATA                                                          3.1              Normalisation
  At the time of download (August 3, 2016), 49,660 tweet              Normalisation starts with converting user names and URLs
IDs were available out of the 50,068 tweet IDs provided for        that occur in the tweet text to the dummy values ‘usrusrusr’
this task. The missing tweets had been deleted by the peo-         and ‘urlurlurl’ respectively.
ple who originally posted them. We used only the English              After inspection of the data, we decided to normalise a
tweets, 48,679 tweets in all, based on the language tag pro-       number of phenomena. First, we removed certain automat-
vided by the Twitter API. Tweets in this data set were al-         ically generated parts at the beginning and at the end of a
ready deduplicated by the task organisation team as much           tweet text. We determined these manually, e.g. ‘live up-
as possible.                                                       dates:’, ‘I posted 10 photos on Facebook in the album’ and
  The final tweet collection contains tweets that were posted      ‘via usrusrusr’. After that, words that end in ‘. . . ’ were
between April 25, 2015 and May 10, 2015. The daily distri-         removed as well. These words are mostly incomplete due
bution of the tweets is visualized in Figure 1.                    to the length restriction of a tweet text, and are usually at
                                                                   the end of tweets generated from within another application.
3.   SYSTEM OVERVIEW                                               Also, we eliminated any consecutive duplication of a token.
                                                                   Duplication of tokens mostly occurs with the dummy forms
  The typical analysis steps of the Relevancer were applied        for user names and urls, and event-related key words and
to the data provided for this task. The current focus of           entities. For instance, two of three consecutive tokens at
the Relevancer tool is the text and the date of posting of         the beginning of the tweet #nepal: nepal: nepal earthquake:
a tweet. Relevancer aims at discovering and distinguishing         main language groups (10 may 2015) urlurlurl #crisisman-
between the different topically coherent information threads       agement were removed in this last step of normalization.
in a tweet collection[3, 2]. Tweets are clustered such that        This last step facilitates the process of identifying the ac-
each cluster represents an information thread and the clus-        tual content of the tweet text.
ters can be used to train a classifier.
  Each step of the analysis process is described in some de-       3.2              Clustering and labeling
tail in the following subsections2 .
                                                                      The clustering step aims at finding topically coherent groups
1
  https://en.wikipedia.org/wiki/April 2015 Nepal                   of tweets that we call information threads. These groups are
earthquake
2
  See http://relevancer.science.ru.nl and https://bitbucket.       org/hurrial/relevancer for further details.
labeled as relevant, irrelevant, or incoherent. Coherent             igrams and bigrams were used as features. The performance
clusters were selected from the output of the clustering algo-       of the classifier on a 15% held-out data is provided below in
rithm K-Means3 , with k = 200, i.e. a preset number of 200           Tables 1 and 2 7 .
clusters. Coherency of a cluster is calculated based on the
distance between the tweets in a particular cluster and the                                       Irrelevant     Relevant
cluster center. Tweets that are in incoherent clusters (as
                                                                                   Irrelevant          720               34
determined by the algorithm) were clustered again by re-
                                                                                   Relevant             33              257
laxing the coherency restrictions until the algorithm reaches
the requested number of coherent clusters. The second stop
criterion for the algorithm is the limit of the coherency pa-        Table 1: Confusion matrix of the Naive Bayes clas-
rameter relaxation.                                                  sifier on test data. The rows and the columns rep-
   The coherent clusters were extended with the tweets that          resent the actual and the predicted labels of test
are not in any coherent cluster. This step was performed             tweets. The diagonal provides the correct number
by iterating all coherent clusters in descending order of the        of predictions.
total length of the tweets in a cluster and adding tweets
that have a cosine similarity higher than 0.85 with respect
to the center of a cluster to that respective cluster. The                                precision     recall    F1     support
total number of tweets that were transferred to the clusters
                                                                            Irrelevant          .96      .95      .96         754
this way was 847.
                                                                            Relevant            .88      .89      .88         290
   As Relevancer takes dates of posts as relevant informa-
tion, the tool first searches for coherent clusters of tweets               Avg/Total           .94      .94      .94     1,044
in each day separately. Then, in a second step it clusters
all tweets from all days that previously were not placed in          Table 2: Precision, recall, and F1-score of the clas-
any coherent cluster. Applying the two steps sequentially            sifier on the test collection. The recall is based on
enables Relevancer to detect local and global information            the test set.
threads as coherent clusters respectively.
   For each cluster thus identified, a list of tweets is presented     The whole collection was classified with the trained Naive
to an expert who then determines which are the relevant and          Bayes classifier. 11,300 tweets were predicted as relevant.
irrelevant clusters4 . Clusters that contain both relevant and       We continued the analysis with these relevant tweets.
irrelevant tweets are labeled as incoherent by the expert5 .
Relevant clusters are those which an expert considers to be          3.4    Clustering and labeling relevant tweets
relevant for the aim she wants to achieve. In the present               Relevant tweets, as predicted by the automatic classifier,
context more specifically, clusters that are about a topic           were clustered without filtering them based on the coherency
specified as relevant by the task organisation team should           criteria. In contrast to the first clustering step, the output
be labeled as relevant. Any other coherent cluster should be         of K-means was used as is, again with k = 200. These clus-
labeled as irrelevant.                                               ters were annotated using the seven topics as predetermined
                                                                     by the task. To the extent possible, incoherent clusters were
3.3    Creating the classifier                                       labeled using the closest provided topic. Otherwise, the clus-
  The classifier was trained with the tweets labeled as rel-         ter was labeled as irrelevant.
evant or irrelevant in the previous step. Tweets in the in-             The clusters that have a topic label contain 8,654 tweets.
coherent clusters were not used. The Naive Bayes method              Since the remaining clusters, containing 2,646 tweets, were
was used to train the classifier.                                    evaluated as irrelevant, they were not included in the sub-
  We used a small set of stop words. These are a small set           mitted set.
of key words (nouns), viz. nepal, earthquake, quake, kath-
mandu and their hashtag versions6 , determiners the, a, an,
conjunctions and, or, prepositions to, of, from, with, in, on,       4.    RESULTS
for, at, by, about, under, above, after, before, and the news           The result of our submission was recorded under the ID
related words breaking and news and their hashtag versions.          relevancer ru nl. The performance of our results was evalu-
The normalized forms of the user names and URLs usrusrusr            ated by the organisation committee at ranks 20, 1,000, and
and urlurlurl are included in the stop word list as well.            all, considering the tweets retrieved in the respective ranks.
  We optimized the smoothing prior parameter α to be 0.31            As announced by the organisation committee, our results are
by cross-validation, comparing the classifier performance with       as follows: 0.3143 precision at rank 20, 0.1329 and 0.0319
equally separated 20 values of α between 0 and 2. Word un-           recall and Mean Average Precision (MAP) at rank 1,000
3                                                                    respectively, and 0.0406 MAP considering all tweets in our
  We used scikit-learn v0.17.1 for all machine learning tasks
in this study http://scikit-learn.org.                               submitted results.
4
  The first author of this working note had the role of being           We generated an additional calculation for our results
the expert for this task. A real scenario would require a            based on the annotated tweets provided by task organizers.
domain expert.                                                       The overall precision and recall are 0.081 and 0.34 respec-
5
  Although the algorithmic approach determines the clusters          tively. The performance for the topics FMT1 (available re-
that were returned as coherent, the expert may not agree             sources), FMT2 (required resources), FMT3 (available med-
with it.                                                             ical resources), FMT4 (required medical resources), FMT5
6
  This set was based on our observation as we did not have
                                                                     7
access to the key words that were used to collect this data            Since we optimize the classifier for this collection, the per-
set.                                                                 formance of the classifier on unseen data is not relevant here.
(resource availability at certain locations), FMT6 (NGO and        7.   REFERENCES
governmental organization activities), and FMT7 (infras-           [1] S. Ghosh and K. Ghosh. Overview of the FIRE 2016
tructure damage and restoration reports) is provided in the            Microblog track: Information Extraction from
Table 3.                                                               Microblogs Posted during Disasters. In Working notes
                                                                       of FIRE 2016 - Forum for Information Retrieval
                 precision   recall    F1    percentage                Evaluation, Kolkata, India, December 7-10, 2016,
       FMT1           0.17    0.50    0.26          0.27               CEUR Workshop Proceedings. CEUR-WS.org, 2016.
       FMT2           0.35    0.09    0.15          0.14           [2] A. Hürriyetoğlu, C. Gudehus, N. Oostdijk, and
       FMT3           0.19    0.28    0.23          0.16               A. van den Bosch. Relevancer: Finding and labeling
       FMT4           0.06    0.06    0.06          0.05               relevant information in tweet collections. In E. Spiro
       FMT5           0.05    0.06    0.06          0.09               and Y.-Y. Ahn, editors, Social Informatics, volume
       FMT6           0.05    0.74    0.09          0.18               10046. Springer International Publishing, November
       FMT7           0.25    0.08    0.12          0.12               2016.
                                                                   [3] A. Hürriyetoğlu, A. van den Bosch, and N. Oostdijk.
Table 3: Precision, recall, and F1-score of our sub-                   Analysing role of key term inflections in knowledge
mission and the percentage of the tweets in the an-                    discovery on twitter. In International Workshop on
notated tweets per topic.                                              Knowledge Discovery on the Web, Cagliari, Italy,
                                                                       September 2016.
  On the basis of these results, we conclude that the success
of our method differs drastically across topics. In Table 3,
we observe that there is a clear relation between the F1-score
and the percentage of the tweets per topic in the manually
annotated data. Consequently, we conclude that our method
performs better in case the topic is presented well in the
collection.

5.   CONCLUSION
   In this study we applied the methodology supported by
the Relevancer system in order to identify relevant informa-
tion by enabling human input in terms of cluster labels. This
method has yielded an average performance in comparison
to other participating systems.
   We observed that clustering tweets for each day separately
enabled the unsupervised clustering algorithm to identify
specific coherent clusters in a shorter time than the time
spent on clustering the whole set. Moreover, this setting
provided an overview that realistically changes each day, for
each day following the day of the earthquake.
   Our approach is optimized to incorporate human input.
In principle, an expert should be able to refine a tweet collec-
tion until she reaches a point where the time spent on a task
is optimal and the performance is sufficient. However, with
this particular task, an annotation manual was not available
and the expert had to stop after one iteration without being
sure to what extent certain information threads were actu-
ally relevant to the task at hand; for example, are (clusters
of) tweets pertaining to providing or collecting funds for the
disaster victims considered to be relevant or not.
   It is important to note that the Relevancer system yields
the results in random order, as it has no ranking mechanism
that ranks posts for relative importance. We speculate that
rank-based performance metrics are not optimally suited for
evaluating it.
   In our future work we will aim to increase the precision
and diminish the performance differences across topics, pos-
sibly by downsampling or upsampling methods to tackle
class imbalance.

6.   ACKNOWLEDGEMENTS
  This research was funded by the Dutch national research
programme COMMIT.

</pre>