<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UAMCLyR at RepLab 2013: Profiling Task⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Esau´ Villatoro-Tello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Rodr´ıguez-Lucatero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Sa´nchez-Sa´nchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Pastor Lo´ pez-Monroy</string-name>
          <email>pastor@ccc.inaoep.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Tecnolog ́ıas de la Informacio ́n, Universidad Auto ́noma Metropolitana, Unidad Cuajimalpa, Ave. Vasco de Quiroga Num. 4871 Col Santa Fe</institution>
          ,
          <addr-line>Me ́xico D.F</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Instituto Nacional de Astrof ́ısica</institution>
          ,
          <addr-line>O</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>This paper describes the participation of the Language and Reasoning Group of UAM at RepLab 2013 Profiling evaluation lab. We adopted Distributional Term Representations (DTR) for facing the following problems: i) filtering tweets that are related to an entity, and ii) identifying positive or negative implications for the entity's reputation, i.e., polarity for reputation. Distributional Term Representations help to overcome, to some extent, the small-length/high-sparsity issues. DTRs are a way to represent terms by means of contextual information, given by term co-occurrence statistics. In order to evaluate our approach, we compared the proposed approach against the traditional Bag-of-Words representation. Obtained results indicate that by means of DTRs it is possible to increase the reliability score of a profiling system.</p>
      </abstract>
      <kwd-group>
        <kwd>Bag of words</kwd>
        <kwd>Distributional term representations</kwd>
        <kwd>Term co-occurrence representation</kwd>
        <kwd>Term selection</kwd>
        <kwd>Supervised text classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        From its inception in 2006, Twitter has become in one of the most important platform
for microblog posts. Recent statistics reveal that there are more that 200 million users
that write more than 400 million posts every day3, talking about a great diversity of
topics. As a consequence, several entities such as companies, celebrities, politicians, etc.,
are very interested in using this type of platform for increasing or even improving their
presence among Twitter users, aiming at obtaining good reputation values. As an
important effort for providing effective solutions to the above problem, RepLab4 proposes
a competitive evaluation exercise for Online Reputation Management (ORM) systems.
As one of the main tasks evaluated in RepLab is the Profiling task. This particular task
consists of mining the reputation of a company from online media. Adequate profiling
systems must be able to retrieve several posts from several online sources, and
annotating them according to their relevancy, i.e., to preserve online documents related to the
company and to identify all positive or negative implications for the company contained
in such documents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        As mention in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], systems that face the profiling task must annotate two different
types of information: i) Filtering: This means that an automatic system must be able
to decide whether a given tweet is related to a particular company or not. Basically
it represents a two class problem since systems must tag a tweet as “related” or “not
related”; and, ii) Polarity for Reputation: The idea of this particular subtask is to
identify if a given tweet contains positive or negative implications for the company’s
reputation. This problem represent a three class problem since an automatic system
have to assigns a “positive”, “negative” or “neutral” tag for each tweet related to a
particular company.
      </p>
      <p>
        Our proposed approach for facing both filtering and polarity problems is based on
distributional term representations (DTRs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which are a way to represent terms by
means of contextual information, given by term-co-occurrence statistics. Accordingly,
this paper presents the details of the participation of the Language and Reasoning group
from UAM-C to the CLEF 2013 RepLab profiling task (i.e., filtering and polarity for
reputation). The main objectives of our experiments were:
1. To test if a richer document representation based on term co-occurrences can be
successfully applied to filtering and polarity subtasks.
2. To estimate how useful our previously developed methods for sentiment analysis
on Twitter can be adopted for detecting positive and negative implications of tweets
in the context of the RepLab exercise.
3. To evaluate to what extent supervised techniques are able to solve both filtering and
polarity problems.
      </p>
      <p>The rest of this paper is organized as follows. The next section describes all the
steps considered in the pre-processing stage. Section 3 describe the proposed
representation strategy. Section 4 describes the experimental setup we followed, as well as our
results obtained for both filtering and polarity subtasks. Finally, Section 5 presents the
conclusions derived from this work and outlines future work directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Tweets pre-processing</title>
      <p>It is worth mentioning that for performing all our experiments we collected two different
versions of the collection of tweets which are described below:
Main: For this configuration we crawled only the main tweet from each given tweet
id. In other words, all other tweets contained in the original tweet id (e.g., answers
or comments generated by the original tweet) are ignored.</p>
      <p>All: For this configuration, we crawled both the main tweet and all answers or
comments generated by the original tweet from each given tweet id.</p>
      <p>When retrieving the All version of the tweets collection, our intuitive idea was to
evaluate the impact of all conversational elements of a tweet when deciding its
polarity as well as its relevancy. Notice that this crawling procedure was replicated when
retrieving test tweets.</p>
      <p>
        As pre-processing steps we applied the following procedures to each tweet in the
two versions of the tweets collection (i.e., Main and All):
1. All tweets are transform to lowercase.
2. All users mentions (i.e., @user) are replaced by the tag: AT-USER.
3. Every outgoing link is replaced by the tag: OUTGOING-LINK, hence, for
performed experiments we did not use the information contained in these links,
however we believe they can be useful when trying to detect if a tweet is related or not
to a company.
4. All hashtags (i.e., #hashtagX) are replaced by the tag: HASHTAG.
5. All punctuation mark as well as emoticons are deleted.
6. We apply the Porter stemming [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
7. All stopwords are deleted.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Tweets representation</title>
      <p>
        Distributional term representations (DTRs) are tools for term representation that rely
on term occurrence and co-occurrence statistics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Intuitively, the meaning of a term
is determined by the context in which it occurs. Where the context is given in terms of
other terms in the vocabulary. In this paper we consider one popular DTR, namely
termco-occurrence representation. This DTR has been mainly used in term classification and
term clustering tasks, and very recently for short-text categorization [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where their
potential benefits for term expansion are shown.
      </p>
      <p>The term co-occurrence representation (TCOR) is based on co-occurrence statistics.
The underlying idea is that the semantics of a term t j can be revealed by other terms it
co-occur with across the document collection. Here, each term t j ∈ T is represented by a
vector of weights w j = hw1, j, . . . , w|T |, ji, where 0 ≤ wk, j ≤ 1 represents the contribution
of term tk to semantic description of t j:
(1)
(2)
wk,t = t f f (tk,t j) · log
|T |
Tk
where Tk is the number of different terms in the dictionary T that co-occur with t j in at
least one document and
t f f (tk,t j) =
(1 + log(#(tk,t j)) i f (#(tk,t j) &gt; 0)</p>
      <p>0 otherwise
where #(tk,t j) denotes the number of documents in which term t j co-occurs with the
term tk. The intuition behind this weighting scheme is that the more tk and t j co-occur
the more important tk is for describing term t j; the more terms co-occur with tk the less
important is to define the semantics of t j. At the end, the vector of weights is normalized
to have unit 2-norm: ||w j||2 = 1.</p>
      <p>Finally, let wt j denote the DTR of term t j in the vocabulary, where wt j is the TCOR
representation. The representation of a document di based on this DTR is obtained as
follows:
didtr =
∑ αt j · wt j
t j∈di
(3)
where α j is a scalar that weights the contribution of term t j ∈ di into the document
representation. Thus, the representation of a document is given by the (weighted)
aggregation of the contextual representations of terms appearing in the document. That is,
the document representation is a summary of the contextual information present in the
terms that appear in the document.</p>
      <p>Under TCOR, a document di is represented by didtr ∈ R|T |, a vector of the same
dimensionality as the vocabulary. The values of didtr indicate the association between
terms in the vocabulary and those terms that occur in di. Notice that scalar αt j aims to
weight the importance that term t j has for describing document di. Many options are
available for defining αt j , in this work we considered the following weights: Boolean
(BOOL), Term-Frequency (TF), and Relative Frequency (TF-IDF).</p>
      <p>
        Notice that using this type of representations can lead to problems of high
dimensionality, since the number of terms (features) usually accomplish that T → ∞. This
fact may lead to problems of over-fitting when training a classifier. A technique that has
been used as a feature selection strategy is by means of preserving terms near to the
transition point ptT [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ]. The ptT represents a frequency value that divides vocabulary
terms T in two sets, those of low frequency and those of high frequency.
      </p>
      <p>
        In a previous work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we have shown that by means of preserving high frequency
terms in conjunction with a subset of low frequency terms, it is possible to solve (to
some extent) the problem of assigning polarity values to twitter posts, especially for
a three class problem (i.e., positive, negative and neutral). Accordingly, we defined
a subset of experiments for the polarity subtask employing this strategy as features
selection technique.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>For the RepLab 2013 edition participant teams were given a large dataset (61
entities) from four domains: automotive, banking, universities and music/artists. For trial
dataset, approximately 700 tweets were provided for each entity. Contrary to the
RepLab 2012 edition, RepLab 2013 organizers provided as test dataset tweets from the
same 61 entities that where used as trial dataset. For these, approximately 1700 tweets
were crawled.</p>
      <p>
        Given this situation, i.e., same entities for training and for testing, we decided to
adopt a supervised strategy for solving the problem of filtering and polarity. We report
our results for the test dataset in terms of Reliability, Sensibility and their harmonic
mean[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>As we mentioned in Section 1, our goals were to test if by means of employing
a richer documents representation (see Section 3) it would be possible to solve both
sub tasks involved in the profiling problem. Consequently, we defined as our baseline
method the traditional Bag-of-Words (BOW) representation. Finally, it is worth
mentioning that we used, for all our experiments; as our main classifier the Weka’s5 Support
Vector Machine implementation considering a linear kernel configuration.
4.1</p>
      <sec id="sec-4-1">
        <title>Filtering results</title>
        <p>Notice that by means of using a BOW representation with a boolean weighting
scheme (run 01, and run 04) allows to obtain the higher accuracy values. This might be
an indicator that only by the presence of some words it is possible to decide whether a
tweet is related to a company or not.</p>
        <p>Additionally, it is important to note that our DTR representation (run 03 and run
06) were able to achieve a better performance than the traditional BOW in terms of
5 http://www.cs.waikato.ac.nz/ml/weka/index.html
reliability measure without considerably decreasing the accuracy. Somehow, this results
are an indicator of a better precision, which under a real scenario, it might be more
important than the sensibility.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Polarity for reputation results</title>
        <p>Notice that our bets results in terms of reliability and accuracy were obtained by
means of using a TCOR representation with a TF-IDF weighting scheme using only the
Main version of tweets (i.e., run 02). This represent an interesting result, since indicates
that the polarity of a tweet can be determined by considering the context in which the
tweet’s terms occurs. In general, DTR experiments (run 02, 04 and 06) obtain better
reliability performance.</p>
        <p>It is also important to remark that performed experiments applying a feature
selection strategy by means of the t pT (run 05 an 06) are able to obtain acceptable results in
terms of sensitivity and F(R,S). We think that performing additional experiments under
similar circumstances but using the “Main” version of the tweets collection will allow
to obtain better results.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future work</title>
      <p>In this paper, we have described the experiments performed by the Language and
Reasoning group from UAM-C in the context of the RepLab 2013 evaluation exercise. Our
proposed system was designed for addressing the problem of filtering tweets (i.e.,
determining whether a tweet is related or not to a given entity name) as well as for
classifying polarity for reputation, i.e., identifying positive or negative implications contained
in the tweet.</p>
      <p>Our proposed system is based on the use of DTRs as form of representation for
tweets texts. This type of representations assume that the meaning of a term is
determined by the context in which it occurs. Where the context is given in terms of other
terms in the vocabulary. Obtained results showed that DTR representation allows to
obtain a better performance in terms of the reliability measure, indicating to some
extent that this type of representations allow better precision values both in filtering and
polarity subtasks.</p>
      <p>Additionally, we also observed that applying the transition point (t pT ) as feature
selection strategy allowed our system to obtain good results in terms of the sensibility
measure. We believe that this strategy might be useful when employing the “Main”
version of the tweets collection.</p>
      <p>As future work we plan to develop a system that considers information contained on
the entity’s web page, as well as considering all the emoticons and hashtags contained
in tweets texts. Additionally, we plan to evaluate some other DTR representations, since
obtained results motivate us to keep working on this direction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Amigo´,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Corujo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Meij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            , and
            <surname>Rijke</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>Overview of RepLab 2012: Evaluating Online Reputation Management Systems</article-title>
          .
          <source>In Working Notes for the CLEF 2012 Evaluation Labs and Workshop</source>
          . Rome, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          (
          <year>1997</year>
          )
          <article-title>An algorithm for suffix stripping</article-title>
          . Morgan Kaufmann Publishers Inc. pp.
          <fpage>313</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lavelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zanoli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>(2004) Distributional Term Representations: An Experimental Comparison</article-title>
          .
          <source>In Italian Workshop on Advanced Database Systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cabrera</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Go´mez,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Distributional term representations for short text categorization</article-title>
          .
          <source>In 14th International Conference on Intelligent Text Processing and Computational Linguistics</source>
          ,
          <string-name>
            <surname>CI-CICLING</surname>
          </string-name>
          <year>2013</year>
          . Samos, Greece.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Reyes-Aguirre</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Moyotl-Herna´ndez, E., y Jime´
          <article-title>nez-</article-title>
          <string-name>
            <surname>Salazar</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2003</year>
          )
          <article-title>Reduccio´n de te´rminos ´ındice usando el punto de transicio´n</article-title>
          . En Avances en Ciencias de la Computacio´n. pp.
          <fpage>127</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Leon Martago´n, G.,
          <string-name>
            <surname>Villatoro-Tello</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <article-title>Jime´nez-</article-title>
          <string-name>
            <surname>Salazar</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <article-title>and Sa´nchez-</article-title>
          <string-name>
            <surname>Sa</surname>
            ´nchez,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Ana´lisis de Polaridad en Twitter</article-title>
          .
          <source>In Journal of Research in Computer Science</source>
          . Vol.
          <volume>62</volume>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Amigo´,
          <string-name>
            <given-names>E.</given-names>
            and
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            and
            <surname>Verdejo</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>A General Evaluation Measure for Document Organization Tasks</article-title>
          .
          <source>In Proceedings SIGIR 2013</source>
          . Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Amigo´,
          <string-name>
            <given-names>E.</given-names>
            and
            <surname>Carrillo de Albornoz</surname>
          </string-name>
          , J. and
          <string-name>
            <surname>Chugur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Corujo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and Mart´ın, T. and
          <string-name>
            <surname>Meij</surname>
          </string-name>
          , E. and
          <string-name>
            <surname>de Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems</article-title>
          .
          <source>In Proceedings of the Fourth International Conference of the CLEF initiative, CLEF 2013</source>
          . Springer LNCS, Valencia, Spain.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>