=Paper=
{{Paper
|id=Vol-1179/CLEF2013wn-RepLab-VillatoroTelloEt2013
|storemode=property
|title=UAMCLyR at RepLab 2013: Profiling Task
|pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-RepLab-VillatoroTelloEt2013.pdf
|volume=Vol-1179
|dblpUrl=https://dblp.org/rec/conf/clef/Villatoro-TelloRSL13
}}
==UAMCLyR at RepLab 2013: Profiling Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1179/CLEF2013wn-RepLab-VillatoroTelloEt2013.pdf</pdf>
<pre>
         UAMCLyR at RepLab 2013: Profiling Task⋆
                      Notebook for RepLab at CLEF 2013

                  Esaú Villatoro-Tello1, Carlos Rodrı́guez-Lucatero1,
              Christian Sánchez-Sánchez1, and A. Pastor López-Monroy2
                       1 Departamento de Tecnologı́as de la Información,

                  Universidad Autónoma Metropolitana, Unidad Cuajimalpa,
                 Ave. Vasco de Quiroga Num. 4871 Col Santa Fe, México D.F.
               {evillatoro,crodriguez,csanchez}@correo.cua.uam.mx
                              2 Department of Computer Science,

                Instituto Nacional de Astrofı́sica, Óptica y Electrónica, México.
                                   pastor@ccc.inaoep.mx


       Abstract. This paper describes the participation of the Language and Reasoning
       Group of UAM at RepLab 2013 Profiling evaluation lab. We adopted Distribu-
       tional Term Representations (DTR) for facing the following problems: i) filtering
       tweets that are related to an entity, and ii) identifying positive or negative implica-
       tions for the entity’s reputation, i.e., polarity for reputation. Distributional Term
       Representations help to overcome, to some extent, the small-length/high-sparsity
       issues. DTRs are a way to represent terms by means of contextual information,
       given by term co-occurrence statistics. In order to evaluate our approach, we com-
       pared the proposed approach against the traditional Bag-of-Words representation.
       Obtained results indicate that by means of DTRs it is possible to increase the re-
       liability score of a profiling system.

       Keywords: Bag of words, Distributional term representations,Term co-occurrence
       representation, Term selection, Supervised text classification


1 Introduction
From its inception in 2006, Twitter has become in one of the most important platform
for microblog posts. Recent statistics reveal that there are more that 200 million users
that write more than 400 million posts every day3 , talking about a great diversity of top-
ics. As a consequence, several entities such as companies, celebrities, politicians, etc.,
are very interested in using this type of platform for increasing or even improving their
presence among Twitter users, aiming at obtaining good reputation values. As an im-
portant effort for providing effective solutions to the above problem, RepLab4 proposes
a competitive evaluation exercise for Online Reputation Management (ORM) systems.
As one of the main tasks evaluated in RepLab is the Profiling task. This particular task
⋆ This work was partially supported by CONACyT México Project Grant CB-2010/153315, and

  SEP-PROMEP Project Grant UAM-C-CA-31/10847.
3 http://blog.twitter.com/2013/03/celebrating-twitter7.html
4 http://www.limosine-project.eu/events/replab2013
2                           Villatoro-Tello E. et al.

consists of mining the reputation of a company from online media. Adequate profiling
systems must be able to retrieve several posts from several online sources, and annotat-
ing them according to their relevancy, i.e., to preserve online documents related to the
company and to identify all positive or negative implications for the company contained
in such documents [1].
     As mention in [1], systems that face the profiling task must annotate two different
types of information: i) Filtering: This means that an automatic system must be able
to decide whether a given tweet is related to a particular company or not. Basically
it represents a two class problem since systems must tag a tweet as “related” or “not
related”; and, ii) Polarity for Reputation: The idea of this particular subtask is to
identify if a given tweet contains positive or negative implications for the company’s
reputation. This problem represent a three class problem since an automatic system
have to assigns a “positive”, “negative” or “neutral” tag for each tweet related to a
particular company.
     Our proposed approach for facing both filtering and polarity problems is based on
distributional term representations (DTRs) [3], which are a way to represent terms by
means of contextual information, given by term-co-occurrence statistics. Accordingly,
this paper presents the details of the participation of the Language and Reasoning group
from UAM-C to the CLEF 2013 RepLab profiling task (i.e., filtering and polarity for
reputation). The main objectives of our experiments were:

 1. To test if a richer document representation based on term co-occurrences can be
    successfully applied to filtering and polarity subtasks.
 2. To estimate how useful our previously developed methods for sentiment analysis
    on Twitter can be adopted for detecting positive and negative implications of tweets
    in the context of the RepLab exercise.
 3. To evaluate to what extent supervised techniques are able to solve both filtering and
    polarity problems.

    The rest of this paper is organized as follows. The next section describes all the
steps considered in the pre-processing stage. Section 3 describe the proposed represen-
tation strategy. Section 4 describes the experimental setup we followed, as well as our
results obtained for both filtering and polarity subtasks. Finally, Section 5 presents the
conclusions derived from this work and outlines future work directions.


2 Tweets pre-processing

It is worth mentioning that for performing all our experiments we collected two different
versions of the collection of tweets which are described below:

Main: For this configuration we crawled only the main tweet from each given tweet
    id. In other words, all other tweets contained in the original tweet id (e.g., answers
    or comments generated by the original tweet) are ignored.
All: For this configuration, we crawled both the main tweet and all answers or com-
    ments generated by the original tweet from each given tweet id.
                                              UAMCLyR at RepLab 2013: Profiling Task              3

    When retrieving the All version of the tweets collection, our intuitive idea was to
evaluate the impact of all conversational elements of a tweet when deciding its polar-
ity as well as its relevancy. Notice that this crawling procedure was replicated when
retrieving test tweets.
    As pre-processing steps we applied the following procedures to each tweet in the
two versions of the tweets collection (i.e., Main and All):
 1. All tweets are transform to lowercase.
 2. All users mentions (i.e., @user) are replaced by the tag: AT-USER.
 3. Every outgoing link is replaced by the tag: OUTGOING-LINK, hence, for per-
    formed experiments we did not use the information contained in these links, how-
    ever we believe they can be useful when trying to detect if a tweet is related or not
    to a company.
 4. All hashtags (i.e., #hashtagX) are replaced by the tag: HASHTAG.
 5. All punctuation mark as well as emoticons are deleted.
 6. We apply the Porter stemming [2].
 7. All stopwords are deleted.


3 Tweets representation
Distributional term representations (DTRs) are tools for term representation that rely
on term occurrence and co-occurrence statistics [3]. Intuitively, the meaning of a term
is determined by the context in which it occurs. Where the context is given in terms of
other terms in the vocabulary. In this paper we consider one popular DTR, namely term-
co-occurrence representation. This DTR has been mainly used in term classification and
term clustering tasks, and very recently for short-text categorization [4], where their
potential benefits for term expansion are shown.
    The term co-occurrence representation (TCOR) is based on co-occurrence statistics.
The underlying idea is that the semantics of a term t j can be revealed by other terms it
co-occur with across the document collection. Here, each term t j ∈ T is represented by a
vector of weights w j = hw1, j , . . . , w|T |, j i, where 0 ≤ wk, j ≤ 1 represents the contribution
of term tk to semantic description of t j :

                                                                   |T |
                                   wk,t = t f f (tk ,t j ) · log                                (1)
                                                                    Tk
 where Tk is the number of different terms in the dictionary T that co-occur with t j in at
least one document and
                                      (
                                        1 + log(#(tk ,t j )) i f (#(tk ,t j ) > 0)
                   t f f (tk ,t j ) =                                                   (2)
                                        0                    otherwise

where #(tk ,t j ) denotes the number of documents in which term t j co-occurs with the
term tk . The intuition behind this weighting scheme is that the more tk and t j co-occur
the more important tk is for describing term t j ; the more terms co-occur with tk the less
important is to define the semantics of t j . At the end, the vector of weights is normalized
to have unit 2-norm: ||w j ||2 = 1.
4                           Villatoro-Tello E. et al.

    Finally, let wt j denote the DTR of term t j in the vocabulary, where wt j is the TCOR
representation. The representation of a document di based on this DTR is obtained as
follows:
                                    didtr = ∑ αt j · wt j                               (3)
                                            t j ∈di

where α j is a scalar that weights the contribution of term t j ∈ di into the document
representation. Thus, the representation of a document is given by the (weighted) ag-
gregation of the contextual representations of terms appearing in the document. That is,
the document representation is a summary of the contextual information present in the
terms that appear in the document.
    Under TCOR, a document di is represented by didtr ∈ R|T | , a vector of the same
dimensionality as the vocabulary. The values of didtr indicate the association between
terms in the vocabulary and those terms that occur in di . Notice that scalar αt j aims to
weight the importance that term t j has for describing document di . Many options are
available for defining αt j , in this work we considered the following weights: Boolean
(BOOL), Term-Frequency (TF), and Relative Frequency (TF-IDF).
    Notice that using this type of representations can lead to problems of high dimen-
sionality, since the number of terms (features) usually accomplish that T → ∞. This
fact may lead to problems of over-fitting when training a classifier. A technique that has
been used as a feature selection strategy is by means of preserving terms near to the
transition point ptT [5,6]. The ptT represents a frequency value that divides vocabulary
terms T in two sets, those of low frequency and those of high frequency.
    In a previous work [6], we have shown that by means of preserving high frequency
terms in conjunction with a subset of low frequency terms, it is possible to solve (to
some extent) the problem of assigning polarity values to twitter posts, especially for
a three class problem (i.e., positive, negative and neutral). Accordingly, we defined
a subset of experiments for the polarity subtask employing this strategy as features
selection technique.


4 Experimental Results

For the RepLab 2013 edition participant teams were given a large dataset (61 enti-
ties) from four domains: automotive, banking, universities and music/artists. For trial
dataset, approximately 700 tweets were provided for each entity. Contrary to the Re-
pLab 2012 edition, RepLab 2013 organizers provided as test dataset tweets from the
same 61 entities that where used as trial dataset. For these, approximately 1700 tweets
were crawled.
    Given this situation, i.e., same entities for training and for testing, we decided to
adopt a supervised strategy for solving the problem of filtering and polarity. We report
our results for the test dataset in terms of Reliability, Sensibility and their harmonic
mean[7].
    As we mentioned in Section 1, our goals were to test if by means of employing
a richer documents representation (see Section 3) it would be possible to solve both
sub tasks involved in the profiling problem. Consequently, we defined as our baseline
                                         UAMCLyR at RepLab 2013: Profiling Task        5

method the traditional Bag-of-Words (BOW) representation. Finally, it is worth men-
tioning that we used, for all our experiments; as our main classifier the Weka’s5 Support
Vector Machine implementation considering a linear kernel configuration.

4.1 Filtering results
Table 1 describe the configuration assigned to each experiment for performed experi-
ments in terms of type of representation (BOW or TCOR), weighting scheme (BOOL,
TF or TF-IDF) and type of tweets collection used (Main or All). Notice that each col-
umn, from 2nd to 7th, represent one experiment definition, i.e., one run (6 runs were
submitted in total).


            Table 1. Configuration for submitted experiments: Filtering subtask.

          Configuration/Run ID Run 01 Run 02 Run 03 Run 04 Run 05 Run 06
          Representation        BOW BOW TCOR BOW BOW TCOR
          Weighting            BOOL TF BOOL BOOL TF BOOL
          Tweets                Main Main Main       All    All    All


   Table 2 show obtained results for filtering subtask. Last two rows indicate: i) the
baseline performance as defined in[8], and ii) the average performance of all participant
teams in the RepLab 2013 edition.


                             Table 2. Filtering subtask results

         Run ID               Reliability (R) Sensitivity (S) F (R, S) Accuracy
         UAMCLyR filtering 01     0.6311         0.3960       0.3759 0.9132
         UAMCLyR filtering 02     0.5731         0.3132       0.2918 0.9007
         UAMCLyR filtering 03     0.6964         0.3038       0.3220 0.9041
         UAMCLyR filtering 04     0.5554         0.4015       0.3787 0.9110
         UAMCLyR filtering 05     0.5688         0.3075       0.2858 0.8996
         UAMCLyR filtering 06     0.6292         0.2828       0.2637 0.8906
         BASELINE                 0.4902         0.3199       0.3255 0.8714
         Average                  0.4663         0.2951       0.2596 0.7628


    Notice that by means of using a BOW representation with a boolean weighting
scheme (run 01, and run 04) allows to obtain the higher accuracy values. This might be
an indicator that only by the presence of some words it is possible to decide whether a
tweet is related to a company or not.
    Additionally, it is important to note that our DTR representation (run 03 and run
06) were able to achieve a better performance than the traditional BOW in terms of
5 http://www.cs.waikato.ac.nz/ml/weka/index.html
6                            Villatoro-Tello E. et al.

reliability measure without considerably decreasing the accuracy. Somehow, this results
are an indicator of a better precision, which under a real scenario, it might be more
important than the sensibility.


4.2 Polarity for reputation results

Table 3 describe the configuration assigned to each performed experiment for the po-
larity subtasks, and Table 4 show obtained results for our performed experiments in the
polarity subtask.


      Table 3. Configuration for submitted experiments: Polarity for reputation subtask.

          Configuration/Run ID Run 01 Run 02 Run 03 Run 04 Run 05 Run 06
          Representation        BOW TCOR BOW TCOR BOW TCOR
          Weighting            TF-IDF TF-IDF TF      TF    BOOL BOOL
          Tweets                Main Main     All    All All(t pT ) All(t pT )


    Notice that our bets results in terms of reliability and accuracy were obtained by
means of using a TCOR representation with a TF-IDF weighting scheme using only the
Main version of tweets (i.e., run 02). This represent an interesting result, since indicates
that the polarity of a tweet can be determined by considering the context in which the
tweet’s terms occurs. In general, DTR experiments (run 02, 04 and 06) obtain better
reliability performance.


                              Table 4. Polarity subtask results

          Run ID              Reliability (R) Sensitivity (S) F (R, S) Accuracy
          UAMCLyR polarity 01     0.3461         0.2695       0.2922 0.5827
          UAMCLyR polarity 02     0.3802         0.2651       0.2946 0.6177
          UAMCLyR polarity 03     0.3480         0.2660       0.2891 0.5846
          UAMCLyR polarity 04     0.3696         0.1933       0.2251 0.5836
          UAMCLyR polarity 05     0.3291         0.2864       0.3008 0.5778
          UAMCLyR polarity 06     0.3440         0.1855       0.2157 0.5370
          BASELINE                0.3151         0.2899       0.2973 0.5840
          Average                 0.4833         0.2087       0.2267 0.5007


    It is also important to remark that performed experiments applying a feature selec-
tion strategy by means of the t pT (run 05 an 06) are able to obtain acceptable results in
terms of sensitivity and F(R,S). We think that performing additional experiments under
similar circumstances but using the “Main” version of the tweets collection will allow
to obtain better results.
                                            UAMCLyR at RepLab 2013: Profiling Task              7

5 Conclusions and Future work
In this paper, we have described the experiments performed by the Language and Rea-
soning group from UAM-C in the context of the RepLab 2013 evaluation exercise. Our
proposed system was designed for addressing the problem of filtering tweets (i.e., deter-
mining whether a tweet is related or not to a given entity name) as well as for classify-
ing polarity for reputation, i.e., identifying positive or negative implications contained
in the tweet.
    Our proposed system is based on the use of DTRs as form of representation for
tweets texts. This type of representations assume that the meaning of a term is deter-
mined by the context in which it occurs. Where the context is given in terms of other
terms in the vocabulary. Obtained results showed that DTR representation allows to
obtain a better performance in terms of the reliability measure, indicating to some ex-
tent that this type of representations allow better precision values both in filtering and
polarity subtasks.
    Additionally, we also observed that applying the transition point (t pT ) as feature
selection strategy allowed our system to obtain good results in terms of the sensibility
measure. We believe that this strategy might be useful when employing the “Main”
version of the tweets collection.
    As future work we plan to develop a system that considers information contained on
the entity’s web page, as well as considering all the emoticons and hashtags contained
in tweets texts. Additionally, we plan to evaluate some other DTR representations, since
obtained results motivate us to keep working on this direction.

References
1. Amigó, E., Corujo, A., Gonzalo, J., Meij, E., and Rijke, M. (2012) Overview of RepLab
   2012: Evaluating Online Reputation Management Systems. In Working Notes for the CLEF
   2012 Evaluation Labs and Workshop. Rome, Italy.
2. Porter , M. F. (1997) An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc. pp.
   313-316.
3. Lavelli, A. and Sebastiani, F. and Zanoli, R. (2004) Distributional Term Representations: An
   Experimental Comparison. In Italian Workshop on Advanced Database Systems.
4. Cabrera, J. M., Escalante, H. J., Montes-y-Gómez, M. (2013) Distributional term representa-
   tions for short text categorization. In 14th International Conference on Intelligent Text Pro-
   cessing and Computational Linguistics, CI-CICLING 2013. Samos, Greece.
5. Reyes-Aguirre, B., Moyotl-Hernández, E., y Jiménez-Salazar, H. (2003) Reducción de
   términos ı́ndice usando el punto de transición. En Avances en Ciencias de la Computación.
   pp. 127-130.
6. Leon Martagón, G., Villatoro-Tello, E., Jiménez-Salazar, H., and Sánchez-Sánchez, C. (2013)
   Análisis de Polaridad en Twitter. In Journal of Research in Computer Science. Vol. 62, pp.
   69-78.
7. Amigó, E. and Gonzalo, J. and Verdejo, F. (2013) A General Evaluation Measure for Docu-
   ment Organization Tasks. In Proceedings SIGIR 2013. Dublin, Ireland.
8. Amigó, E. and Carrillo de Albornoz, J. and Chugur, I. and Corujo, A. and Gonzalo, J. and
   Martı́n, T. and Meij, E. and de Rijke, M. and Spina, D. (2013) Overview of RepLab 2013:
   Evaluating Online Reputation Monitoring Systems. In Proceedings of the Fourth International
   Conference of the CLEF initiative, CLEF 2013. Springer LNCS, Valencia, Spain.

</pre>