In search of reputation assessment: experiences
   with polarity classification in RepLab 2013

                                      José Saias

                         Departamento de Informática, ECT
                          Universidade de Évora, Portugal
                                jsaias@uevora.pt


        Abstract.
        The diue system uses a supervised Machine Learning approach for
        the polarity classification subtask of RepLab. We used the Python
        NLTK for preprocessing, including file parsing, text analysis and
        feature extraction. Our best solution is a mixed strategy, combin-
        ing bag-of-words with a limited set of features based on sentiment
        lexicons and superficial text analysis.
        This system begins by applying tokenization and lemmatization.
        Then each tweet content is analyzed and 18 features are obtained,
        related to presence of polarized term, negation before polarized ex-
        pression and entity reference.
        For the first run, the learning and classification were performed
        with the Decision Tree algorithm, from the NLTK framework. In
        the second run, we used a pipeline of classifiers. The first classifier
        applies Naive Bayes in a bag-of-words feature model, with the 1500
        most frequent words in the training set. The second classifier used
        the features from the first run plus another feature with the result
        from the previous classifier. Our system’s best result had 0.54694
        Accuracy and 0.31506 in F measure.


1     Introduction
This article describes the participation of a group from the Department of Com-
puter Science, at the University of Évora, in the RepLab track of the 2013 edition
of Cross Language Evaluation Forum (CLEF). RepLab1 is a competitive evalu-
ation exercise for online reputation management systems, organized as a CLEF
lab activity2 . In this challenge, the reputation processing subtasks are:
 1. tweet filtering: distinguish the tweets that are related to the entity from
    those who are not;
 2. reputation polarity classification: detect if a tweet has a positive, negative
    or neutral impact on the entity reputation;
1
    http://www.limosine-project.eu/events/replab2013
2
    http://clef2013.org/index.php?page=Pages/labs.html
 3. tweet clustering per entity related topic;
 4. priority detection.

Systems can participate in the full monitoring task, with the combined results
of the four subtasks, or present partial solutions to the global task, providing
results for one or more subtasks.
In this first participation, we focused our attention on the polarity classification
subtask, because this seems to be a key task in reputation analysis. We have
a recent work in the area of sentiment analysis in social media [1]. Polarity for
reputation is different from standard sentiment analysis for two reasons. Firstly,
an objective text, without sentiment, may still affect an entity’s reputation. And
on the other hand, sometimes the polarity of the expressed sentiment may be
contrary to the resulting polarity for the reputation of the target entity. Given
these differences, we have designed the diue system, with a supervised Machine
Learning approach for classifying the reputation polarity, as described in section
3. The following section presents some recent related work.


2   Related Work

In the previous edition of RepLab about 10 systems participated in subtask po-
larity classification for reputation. Most systems rely on a sentiment polarity
based approach, adapted for the reputation task [2].
DAEDALUS system [3] has a model with rules and annotated resources for sen-
timent analysis. It applies an aggregation algorithm to calculate the polarity
value based on the individual text segments polarity values. Morphosyntactic
analysis is performed, for lemmatize, divide the text and detect negation. The
approach from FBM/Yahoo! system [4] relies on lexicon-based techniques and
Support Vector Machines classifiers. The UNED system [5] adapts an existing
emotional concept-based system for sentiment analysis to determine polarity for
entity reputation. Its approach includes the detection of negation and intensi-
fiers, in order to deal with the effect of subordinate sentences.
The ILPS system [6] classifies the polarity of a tweet based on the observation
of the reactions to that tweet, such as replies and retweets.


3   Our Experiments

The reputation processing is done on data from Twitter, in English or Spanish.
Systems received a corpus of tweets in both languages, arranged in sets for each
of the 61 entities [7]. Due to the Twitter’s terms of service, the provided corpus
did not include the content of tweets, but only the identifier codes, for each sys-
tem then make its own reading.
Obtaining the tweets was a setback in our participation. The normal download
API imposes a maximum number of hits per hour, being very time consum-
ing. Because of our naivete, we did not anticipate the difficulties of fetching all
tweets, and when we completed the process, we had only 24 hours to the end of
the official submission period. This left little room for studying the data.
For each entity, it was given its name, the domain which the entity belongs to,
and URL addresses of their homepage and Wikipedia entries, in English and in
Spanish. Our system did not use the contents of the homepages nor Wikipedia.
Additional background tweets for each entity, and external links mentioned in
the tweets were also provided for the participating systems, but we lacked the
time to prepare that preprocessing step.
The diue system uses a supervised Machine Learning approach for the polarity
classification subtask of RepLab. As mentioned in section 1, we developed a re-
cent work [1] on Sentiment Analysis in Twitter. Despite differences in polarity
for reputation, the data structure and some initial treatment to apply to tweet
text are identical. So we decided to use part of the previous procedure, adding
features related to the entity reference and its reputation implication.
For the initial entity file handling and parsing, for the text analysis and feature
extraction, and also to manage the output format, we used Python and the Nat-
ural Language Toolkit (NLTK), a framework with resources and programming
libraries suitable for linguistic processing [8, 9].
Tweet text processing started with tokenization, in which the splitting was white
space or punctuation based. Lemmatization was then applied through the NLTK
WordNet Lemmatizer. Here began the differences in relation to language. This
lemmatization would help only tweets in English, because it was not applied
similar functionality for Spanish.
To help determine the polarity direction in some terms of the text, our system
uses three sentiment lexicons for English terms, and another hand-built resource
with 100 words in Spanish. AFINN [10] is a sentiment lexicon containing English
words manually labeled by Finn Årup Nielsen, from 2009 to 2011. Words were
rated between minus five (negative) and plus five (positive). SentiWordNet [11]
is a lexical resource for opinion mining that assigns sentiment scores to each
synset of WordNet3 . We apply a threshold, disregarding terms whose score ab-
solute value is less than 0.3. By doing this, we look for sharper polarities, or
greater confidence in the direction of polarity. The third English sentiment lex-
icon derived from Bing Liu’s work [12] on online customer reviews of products.
After tokenization and lemmatization, each tweet content is analyzed for ex-
tracting the features to use in machine learning. In the first run, we decided
not to use a bag-of-words model. Instead, we chose a more restricted set of 18
features involving:
 – presence of polarized term, using sentiment lexicons;
 – negation before polarized expression;
 – polarized term before entity reference;
 – polarized term after entity reference;
 – negation before entity reference;
3
    http://wordnet.princeton.edu/
 – entity reference followed by negation and polarized term.
Each of the above represents a group of features. The presence of polarized term
is checked for all sentiment lexicons, generating a pair of boolean features for
each, to signal the presence of an expression with negative polarity and the pres-
ence of a positive expression. The system also creates an overall sentiment value
feature, determined by consulting all those lexicons and adding 1 or -1, for each
polarized term in the tweet, according to the term polarity. The features involv-
ing the entity reference try to capture differences that the learning algorithm
can then associate to positive or negative impact on reputation.
The learning and classification were performed with the Decision Tree algorithm,
from the NLTK framework. Each tweet in the training set is annotated with RE-
LATED/UNRELATED (the tweet is/is not about the entity, for the filtering sub-
task) and POSITIVE/NEUTRAL/NEGATIVE to train the polarity classification.
When training our model, the system discards tweets not having the RELATED
annotation, because these have no interest for the subtask.
In preliminary experiments, the accuracy returned by NLTK matched with the
result obtained with the evaluation script provided by the organization for use
in development phase. This accuracy was around 58%, so we generated the first
run over the test data.
In the second run, we used a pipeline of classifiers. The first classifier applies
Naive Bayes in a bag-of-words feature model, with the 1500 most frequent words
in the training set. The second classifier used the features from the first run, plus
one more feature with the result from the former classifier. In this second run,
some errors were also corrected in the extraction of features. This was the case
of the overall sentiment value calculation, which needed sometimes to invert the
polarity of the values, when the source expression was affected by the negation.
A small lemmatization related bug was also fixed.
For the last run, a few terms were introduced in the Spanish sentiment lexicon,
and the overall sentiment value feature was turned off in the first classifier fea-
tures.
At the end of the competition, the systems were given extra time to finish ongo-
ing experiments, and also receive the assessment on those latter unofficial runs.
Our second and third runs were submitted during this extra period.
Given the short time and the delays in downloading the tweets, our system still
got 0.995 for the ratio of tweets in the goldstandard that have been processed.
Next section describes the evaluation metrics and the results for the submitted
runs.


4   Results
The systems involved in the polarity for reputation classification task are eval-
uated according to Accuracy, Reliability and Sensitivity. These latter two mea-
sures have already been used in RepLab 2012, and are described in [13]. Table
1 has the result of evaluating the three runs for the diue system. In the second
column we can see the Accuracy, as the proportion of cases where the system
guesses the right polarity class. The F column shows the balanced F measure
combining Reliability and Sensitivity. The value shown in these four columns is
the average for all entities. Pearson correlation, in the last column, is calculated
between average polarity of entities according to the system versus the gold stan-
dard.
The last two runs are marked with * because they were submitted in the extra
period, and thus were not considered as official runs in competition, despite be-
ing assessed.
The Accuracy is practically the same in all three cases, but better in the second
run, with 0.54694. Reliability is higher in the first run, with a little difference.
Sensitivity, F and Pearson correlation make clear the difference between the first
run and the other two using the classifier pipeline, all having the best result in
run 2.


    Run    Accuracy     Reliability    Sensitivity      F    Pearson correlation
      1     0.54688      0.33303         0.21516     0.25467       0.21398
     2*     0.54694      0.32923         0.31620     0.31506       0.64769
     3*     0.54603      0.32734         0.31470     0.31343       0.64572

          Table 1. Evaluation of polarity subtask results for diue system


5    Discussion

If we looked only to the Accuracy values, we would eventually say that the runs
have equal results, with 54% accuracy. But the results of each run are substan-
tially different, in particular from the first to the other two runs. In the first run
the system assigned the neutral polarity to 9804 tweets, while in the second run
that number rose to 18586. Run 1 had about 13000 more positive tweets than
run 2 and 3.
The pipelined classifier brought the bag-of-words model to complement the pre-
vious model and to compensate some scarcity in that feature set. This is noticed
in the F and Pearson correlation values evolution.
Let us now compare our modest results with the best systems in competition [7]
in the same subtask. The best value of our system accuracy is 0.54694, while the
best system managed 0.68596 (and it seems to have had no problems download-
ing the tweets, having 100% processed tweets). Considering the F measure, the
best official result was 0.38166, and the average was 0.22672. Our system official
run had 0.25467, and for the second run we got 0.31506.
At first, we thought that the existence of two languages would be a bigger prob-
lem. Writing used in tweets is very informal and full of typos. In Spanish tweets
can also appear emoticons and may even arise expressions in English that are
commonly used. Therefore certain results could be achieved, even with the base
system.


6    Conclusions

This was our first experience in RepLab challenge. Our system is not yet ready
to the full reputation monitoring task. We have dedicated our efforts to the po-
larity classification subtask. Our best solution is a mixed strategy, combining
bag-of-words with a limited set of features based on sentiment lexicons and su-
perficial analysis of text.
If we repeated the process, we would have started downloading the tweets ear-
lier, in order to have the time for experiences and analysis, and to choose the
most appropriate feature set for this kind of data and purpose.
For future work, we highlight the importance of strengthening the resources of
language support for Spanish, including lemmatization, and sentiment lexicon.
In the bag-of-words, we used only the 1500 most frequent words in the training
set. Maybe we should increase the number of words/features.
We consider NLTK very effective in text processing. For the future, however, we
consider using another tool for machine learning, supporting more classification
algorithms in the same friendly way, but, at the same time, allowing a greater
degree of configuration.
Regardless of the results obtained by our system, we consider that the participa-
tion in this challenge was very positive, by its competitive spirit, the large-scale
evaluation, and the sharing of new ideas in the treatment of reputation.


References
 1. José Saias and Hilário Fernandes. senti.ue-en: an approach for informally written
    short texts in semeval-2013 sentiment analysis task. In Second Joint Conference
    on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the
    Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages
    508–512, Atlanta, Georgia, USA, June 2013. Association for Computational Lin-
    guistics.
 2. Enrique Amigó, Adolfo Corujo, Julio Gonzalo, Edgar Meij, and Maarten de Rijke.
    Overview of RepLab 2012: Evaluating online reputation management systems. In
    CLEF (Online Working Notes/Labs/Workshop), 2012.
 3. Julio Villena-Román, Sara Lana-Serrano, Cristina Moreno, Janine Garcı́a-Morera,
    and José Carlos González Cristóbal. Daedalus at replab 2012: Polarity classification
    and filtering on twitter data. In Forner et al. [14].
 4. Jose M. Chenlo, Jordi Atserias, Carlos Rodriguez, and Roi Blanco. Fbm-yahoo! at
    replab 2012. In Forner et al. [14].
 5. Jorge Carrillo de Albornoz, Irina Chugur, and Enrique Amigó. Using an emotion-
    based model and sentiment analysis techniques to classify polarity for reputation.
    In Forner et al. [14].
 6. Maria-Hendrike Peetz, Maarten de Rijke, and Anne Schuth. From sentiment to
    reputation. In Forner et al. [14].
 7. Enrique Amigó, Jorge Carrillo de Albornoz, Irina Chugur, Adolfo Corujo, Julio
    Gonzalo, Tamara Martı́n, Edgar Meij, Maarten de Rijke, and Damiano Spina.
    Overview of replab 2013: Evaluating online reputation monitoring systems. In
    Fourth International Conference of the CLEF initiative - CLEF 2013 Proceedings,
    Valencia, Spain, Springer LNCS, Sep 2013.
 8. Edward Loper and Steven Bird. Nltk: the natural language toolkit. In Proceedings
    of the ACL-02 Workshop on Effective tools and methodologies for teaching natural
    language processing and computational linguistics - Volume 1, ETMTNLP ’02,
    pages 63–70, USA, 2002. Association for Computational Linguistics.
 9. Jacob Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing,
    2010.
10. Finn Årup Nielsen. A new ANEW: Evaluation of a word list for sentiment analysis
    in microblogs. In 1st Workshop on Making Sense of Microposts (#MSM2011),
    pages 93–98, 2011.
11. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: An
    enhanced lexical resource for sentiment analysis and opinion mining. In Nicoletta
    Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani,
    Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings
    of the Seventh International Conference on Language Resources and Evaluation
    (LREC’10), Valletta, Malta, May 2010. European Language Resources Association
    (ELRA).
12. Bing Liu. Opinion observer: Analyzing and comparing opinions on the web. In In
    WWW ’05: Proceedings of the 14th international conference on World Wide Web,
    pages 342–351. ACM Press, 2005.
13. Enrique Amigó, Julio Gonzalo, and Felisa Verdejo. A general evaluation measure
    for document organization tasks. In Proceedings SIGIR 2013, July 2013.
14. Pamela Forner, Jussi Karlgren, and Christa Womser-Hacker, editors. CLEF 2012
    Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, September
    17-20, 2012, 2012.