UAMCLyR at Replab2013: Monitoring Task ?
                Notebook for RepLab at CLEF 2013

              Christian Sánchez-Sánchez, Héctor Jiménez-Salazar,
                        Wulfrano Arturo Luna-Ramı́rez

                  Departamento de Tecnologı́as de la Información
             Universidad Autónoma Metropolitana, Unidad Cuajimalpa,
                 Vasco de Quiroga 4871 Col. Santa Fe, México D.F.
                 {csanchez,wluna,hjimenez}@correo.cua.uam.mx


       Abstract. In this article we deal with the Topic Detection and Priority
       Detection subtasks from RepLab 2013, trying clustering and classifica-
       tion methods as well as term selection techniques in order to know its
       performance in two sub collections of tweets: single and extended (single
       tweet plus derived tweets). Our tests show good performance in spite of
       we used very few resources.

       Keywords: Tweet Clustering, Tweet Classification, Term Selection Tech-
       niques


1     Introduction
Twitter has become a very popular social interaction place where the users
give their opinions about the companies and their products via tweets. Because
of that Twitter has become a significant repository of opinions on companies,
brands, and persons, such entities have interest to protect their reputation and
try to deal with non founded gossips that can affect their image and incomes.
RepLab 2013 [12] faces some research challenges on Twitter, one of these is the
Monitoring task which consists on:
 – clustering tweets based on their attributes
 – ordering tweets by priority for each entity
    Performing such a monitoring of tweets is a significantly challenging task
given that the tweet messages are very short and noisy. Some authors try to face
these problems through the idea of concept term expansion in tweets performing
one or more clustering phases of and priority level, as well as unsupervised
clustering techniques. Additionally, some authors use supervision for priority
level assessment [1]. The main motivation of our experiments was:
 – To evaluate how much previous topics and terminology (supervised ap-
   proach) can help to identify new topics.
?
    This work was partially supported by CONACyT México Project Grant CB-
    2010/153315, and SEP-PROMEP Project Grant UAM-C-CA-31/10847.
2        Sánchez, Jiménez and Luna

    – To estimate how much derived conversations from one tweet can ease to
      detect topics.
    – To determine which kind of terms can improve Topic Detection.
    – To evaluate how much features extracted from tweets’ metadata are useful
      to determine priority of tweets.

    In this paper we explored clustering and classification methods as well as
term selection techniques in order to know its performance over two sub collec-
tions of tweets: single and extended (single tweet plus derived tweets). Our tests
show good performance in spite of we used very few resources. In the following
section we describe the data and its preprocessing. Section 3 outlines the meth-
ods applied to data collection for Topic Detection subtask. At Section 4 we detail
attributes and our approach followed for tweet Priority Detection subtask. The
results are presented at Section 5, and finally, our conclusions are depicted in
Section 6.


2      Data description and preprocessing

The corpus of tweets that was used in the experiments was formed considering
two kinds of texts: main text of the tweet (Main) as well as the derived con-
versation from the main text (All) Thus, from everyone, training and testing
collections, we build up two collections, namely:

    – Main-Training,
    – All-Training,
    – Main-Testing, and
    – All-Testing.

These collections allow us to organize the experiments on pairs: given a particular
method for some subtask, it was applied on each collection as two independent
runnings.
    In order to perform Topic Detection subtask it is important to mention that
in all experiments it was only taken into account the information contained in
the text of the tweets. It is worth mentioning that we classify and cluster tweets
for each entity in order to ease the subtask. Additionally, for each entity we
work on two subsets: English and Spanish tweets, filtered through the Language
attribute.
    All experiments for Topic Detection subtask carried out the same prepro-
cessing: removing stop words, morphological and inflectional endings (Porter
Stemming)[2], as well as internal links and user names from tweets were re-
moved. Furthermore, text representation of tweets was supported by some term
selection techniques and with the purpose of clustering and classifying the infor-
mation, when WEKA[3] was used, we model it through Bag of Words (BOW)
representation with boolean weighting scheme.
    Also, Priority Detection subtask was applied to each entity, considering some
attributes extracted from the Training and Testing collection; it was not used
                               UAMCLyR at Replab2013: Monitoring Task           3

Main/All collection as in Topic Detection subtask. Furthermore, at Priority De-
tection subtask the tweet text and some of the related attributes were extracted
from the tweet html file and they were stored as plain text file. When there were
responses to the tweet text, they were attached to the text plain text file. At
this process, we observed, in the gold standard, some tweets were not related to
any entity and then, they were discarded.


3     Topic detection subtask

At the first four experiments of Topic Detection subtask, we pretended to know
how useful is training set aiming to identify topics in twitter. Experiments seven
and eight are completely unsupervised and try to compare its performance to
the previous first to sixth tests.


3.1   Supervised detection

We applied two classification algorithms, Naive Bayes and Sequential Minimal
Optimization Support Vector Machines (SMO SVM) [4][5]. After testing some
combination and configuration of the aforementioned algorithms, using Main-
Training and All-Training collections, the best configuration was SMO SVM
( with a polykernel and standardized data), then this election was applied to
Testing collection, for classifying its elements.


3.2   Unsupervised detection

We performed three pairs of experiments on an unsupervised manner, two of
them select terms accordingly with the percentage of terms which obtains the
best performance on the training dataset. A final pair of experiments applies a
method to automatically select the set of terms used at the representation of
tweets.


3.3   DF term selection

In order to improve unsupervised Topic Detection, two term selection methods
were tested: the well known document frequency index, DF, number of documents
which contain the term; and the transition point, TP, frequency which divides
into high and low the term frequencies [6]. After evaluating some combination
and configuration of the aforementioned methods, over both collections (Main
and All), the best results were gotten using DF with 43% of the terms of highest
DF value (Spanish subset used only 34%).
4       Sánchez, Jiménez and Luna

3.4    Unifier term selection

This test was supported on the diversification and unification concepts proposed
by Zipf [7], which have been used at clustering of web services [8]. Two measures
were used aiming to select terms: unifier degree of a term, U , and the saturation
of a set of terms, Ŝ.
    Given a collection of documents C = {d1 , . . . , dn }, it is defined as
                                         1X
                             U (ti ) =      sim(t¯i , t¯j ),
                                         r
                                           j6=i

and
                                      2     X
                         Ŝ(C) =              sim(di , dj ),
                                   n(n − 1)
                                                  i6=j

where t̄ is the representation of the term t given by the classes in which t occurs,
r = #{tj |sim(t¯i , t¯j ) 6= 0)}, and sim is a similarity measure. In our experiments
we used Jaccard coefficient as similarity measure, and the classes, in order to
represent terms (t̄), were provided by the clustering of the tweets on the same
working collection (without term selection). Here we used the K-Star cluster-
ing algorithm [9]. In these experiments we discard those tweets with no words
contained in the term selection.
   Summarizing, the method follows two steps:

 1. Select terms basing on U and Ŝ:
    (a) Calculate U (t) for all terms of C and sort them in increasing order,
        namely TU = [U (t1 ), . . . , U (tk )].
    (b) Divide TU into m parts, in order to provide m sets of terms: Vi , (1 ≤ i ≤
        m) it represents the first i parts of terms (our experiments used m = 10).
    (c) Compute the array [Ŝ(Ci )] whose elements correspond to each selection
        set Vi , and determine the index of the maximum descending value of
        Ŝ(Ci ): j.
 2. Apply of the K-Star clustering algorithm to Cj .


4     Priority detection subtask

4.1    Attributes used for priority subtask.

From the plain text files, a set of seven attributes were calculated, those are
described as follows:

 1. Referenced users: calculated from the number of user tags or email found
    within the tweet text, i.e. the number of tokens with the form @string are
    considered as referenced users.
 2. Hashtags: the number of hashtags symbols are counted (#).
 3. Web addresses: the number of http tokens are considered.
 4. Tweet length in characters.
                                UAMCLyR at Replab2013: Monitoring Task             5

 5. Frequency of retweets. This is a measure contained in the tweet information
    and it is considered as an attribute.
 6. Frequency of favorites. This is a measure of popularity of the tweet, it
    is contained in the tweet information and also is considered as an attribute
    too.
 7. Conversation Generated. This is a boolean attribute calculated from the
    presence or absence of responses to each tweet.

   So, every tweet text file belonging to each entity was processed in order
to calculate those seven attributes for the sake to classify them as MILDLY
IMPORTANT, ALERT and UNIMPORTANT according to the training set.


4.2   Supervised detection

It was used the WEKA application to perform three runs in order to classify
the test tweets as it was required for the Priority Detection subtask. So, three
classifiers where applied to the files of attributes calculated for each tweet be-
longing to the entities: the tree inducer algorithm J48, the Naive Bayes and the
SMO function (Support Vector Poly Kernel) [10] [4] [5] [11].
    At it was stated, the training was executed over the training collection In
some cases there were entities with missing tweets belonging to the ALERT class.
Those files were filled with 30 tweets from all the entities for the sake to preserve
the three classes in all the training set and, thus, to obtaining appropriated
classifier’s models. So, the three classifiers were applied to the test set as were
mentioned.


5     Experimental results

5.1   Topic detection experiments

Four pair of experiments were carried out. Each pair deals with a pair of col-
lections: Main-Testing (Main-Training), and All-Testing (All-Training), as de-
scribed at Sec. 2. By instance, the Pair One consists of the runnings UAM-
CLyR topic detection 1 and UAMCLyR topic detection 2 which use the collec-
tions Main-Testing and All-Testing, respectively.
    Table 1 depicts how each experiment pair was performed: method, approach
(classification/clustering), and the term selection criterion.
    Finally, at Table 2 we show for each run the used collection, and in descending
order the F values based on Reliability and Sensitivity as well as the Baseline,
defined in [12], for the Topic Detection subtask.
    As we can see, all our experiments are above of the Baseline. We observed
a better performance with clustering than classification. Term selection based
on U and Ŝ provided the best result, however, this method was not able to
determine best terms when the collection was extended from Main to All.
6       Sánchez, Jiménez and Luna

                Table 1. Summary of Topic Detection experiments.

                 Experiment Method Approach Selection
                 Pair One   K-Means Classification DF
                 Pair Two   SMO SVM Classification DF
                 Pair Three K-Means Clustering     DF
                 Pair Four  K-Star  Clustering     U, Ŝ


        Table 2. F (R, S) values of the UAMCLyR Topic Detection subtask.

                Run                              Dataset       F
                UAMCLYR topic detection 07        Main       0.238
                UAMCLYR topic detection 05        Main       0.224
                UAMCLYR topic detection 06         All       0.224
                UAMCLYR topic detection 08         All       0.212
                UAMCLYR topic detection 03        Main       0.198
                UAMCLYR topic detection 04         All       0.192
                UAMCLYR topic detection 01        Main       0.184
                UAMCLYR topic detection 02         All       0.177
                BASELINE                            -        0.173


5.2   Priority Detection Experiments
Priority Detection experiments were carried out over the All-Testing collection
using three classification methods as can be seen in Table 3


               Table 3. Summary of Priority Detection experiments.

                 Experiment                           Method
                 UAMCLYR priority detection 01           J48
                 UAMCLYR priority detection 02       Naive-Bayes
                 UAMCLYR priority detection 03       SMO SVM


    The results of the three experiments of Priority Detection subtask can be
seen in Table 4 following a descending order according to F value based on
Reliability and Sensitivity and the Baseline, as defined in [12], for this subtask
including the Accuracy measure for all the runs performed.


6     Conclusions and future work
In all cases of Topic Detection, clustering approach outperformed to classification
approach. Additionally, we realized that when the derived conversation from
tweets was included, the detection got worse. Particularly, it can be observed that
                                 UAMCLyR at Replab2013: Monitoring Task               7

Table 4. F (R, S) and Accuracy values of the UAMCLyR team on Priority Detection
subtask.

                Run                                     F Accuracy
                BASELINE                              0.296 0.600
                UAMCLYR priority detection 02         0.201 0.459
                UAMCLYR priority detection 01         0.172 0.559
                UAMCLYR priority detection 03         0.088 0.573


in all supervised Topic Detection experiments, the term selection method was
unable to correctly discriminate the relevant terms when extending the corpus;
i.e. from Main to All collection. The term selection based on unification provided
the best results, perhaps because it was calculated directly from test collection.
However, unification term selection it is not sensitive to the increasing of the
vocabulary. We plan to carry out additional tests mainly to the term selection
techniques using at clustering of tweets.
     It can also be claimed that it is possible to detect priority in tweets based
on models of classification that rely only in some attributes calculated from the
metadata features of tweets. From this models it can be obtained acceptable
results when classifying new instances. As further work, the method of Priority
Detection could be tested in two ways in order to be improved:
 – to separate the tweets by language
 – and the extraction of models based on other attributes which take into ac-
   count linguistic features of texts.


References
1. T. Martı́n, D. Spina, E. Amigó, & J. Gonzalo (2012) UNED at RepLab 2012: Mon-
   itoring Task . CLEF 2012 Working Notes.
2. Porter , M. F. (1997) An algorithm for suffix stripping. Morgan Kaufmann Publish-
   ers Inc. pp. 313-316.
3. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I. H. Witten (2009)
   The WEKA Data Mining Software: An Update, SIGKDD Explorations, Volume 11,
   Issue 1.
4. Steve R. Gunn (1998) Support Vector Machines for Classification and Regres-
   sion.University of Southamptom, Technical Report.
5. Platt, John C. (1998) Sequential Minimal Optimization: A Fast Algorithm for Train-
   ing Support Vector Machines, Technical Report MSR-TR-98-14.
6. H. Jiménez, D. Pinto & P. Rosso (2005) Uso del punto de transición en la selección
   de términos ı́ndice para agrupamiento de textos cortos, Revista Procesamiento del
   Lenguaje Natural No. 35, pp 416-421, España.
7. G. K. Zipf (1949) Human Beahaviour and the Principle of Least-Effort. Addison-
   Wesley, Cambridge, MA.
8. H. Jiménez, Ch. Sánchez, C. Rodrı́guez & W. Luna (2011) Modelación léxico
   semántica de descripciones de servicios web. 8o. Taller de Tecnologı́as del Lenguaje
   Humano, Complejo Cultural Universitario. BUAP, Puebla.
8       Sánchez, Jiménez and Luna

9. K. Shin & S.Y. Han (2003) Fast clustering algorithm for information organization,
   Lecture Notes in Computer Science, Vol. 2588, pp 619-622, Springer.
10. J. Ross Quinlan (1993) C4.5: programs for machine learning. Morgan Kauf-mann
   Publishers Inc., San Francisco, CA, USA.
11. Mitchell, T. (2006) The discipline of machine learning (Technical Report CMU-
   ML-06-108). Carnegie Mellon University.
12. E. Amigó, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martı́n,
   E. Meij, de M. Rijke & D. Spina (2013) Overview of RepLab 2013: Evaluating On-
   line Reputation Monitoring Systems. In Proceedings of the Fourth International
   Conference of the CLEF initiative, CLEF 2013. Springer LNCS, Valencia, Spain