Using Feature Selection Metrics for Polarity
               Analysis in RepLab 2012

                        Hogyeong Jeong and Hyunjong Lee

                             Seoul, Republic of Korea
                   {hogyeong.jeong,hyunjong.lee.s}@gmail.com


       Abstract. In this workings notes paper for RepLab 2012, we describe
       our method of using feature selection metrics for polarity analysis. We use
       the correlation coefficient, a one-sided metric, to assign polarity scores
       to relevant words within a tweet; we then use aggregate of these scores
       to determine polarity of the tweet. Our results show a reasonable level
       of performance compared to other methods.

       Keywords: Correlation coefficient, feature selection, polarity analysis


1     Introduction
This paper describes a correlation-coefficient based procedure for determining
polarity of a tweet. Correlation coefficients have been used successfully for text
categorization, and also for sentiment analysis [1][2][3] . In this paper, we describe
how the correlation coefficient can be used to perform polarity analysis, and
compare our results with other participants in RepLab 2012 [4].

1.1    Tasks Performed
Among the many tasks for RepLab 2012, we concentrated on polarity analysis
of the profiling task. Also, while there were tweets in both English and Spanish,
we chose to focus on only the English tweets.

1.2    Main Objectives of Experiments
The main objective of the experiment was to determine the polarity (positive,
neutral, or negative) of a single tweet. Irrelevant tweets were excluded from the
polarity analysis. Although this task is closely related with sentiment analysis
[1], it is somewhat different as it focuses on reputation instead of sentiment.

1.3    Related Work
Our method uses some of the feature selection metrics that are described in
[2] to perform polarity analysis. In particular, we use the one-sided correlation
coefficient, as there are three polarity classes to consider: positive, neutral, and
negative. Although how it is actually used can vary, correlation coefficient has
been used in a closely related task of sentiment analysis [3].
2         Hogyeong Jeong, Hyunjong Lee

2      Approaches Used

We submitted four runs for the task: one using the basic method, two using the
basic method with modified thresholds, and the fourth that incorporated human
input for borderline cases.


2.1     Preprocessing

First, we had to preprocess the tweets to extract relevant keywords that we
would use to determine polarity. To do this, we used the Stanford part-of-speech
tagger, and extracted nouns and adjectives [5].


2.2     Correlation Coefficients

As part of the RepLab 2012 task, we were given a small set of labeled files
that we could use for training. Given the terms that we extracted above in the
preprocessing phase, and the three polarity categories (positive, neutral, and
negative), we calculated the correlation coefficient for a term t on class ci as
                                 √
                                     N [P (t, ci )P (t̄, c¯i ) − P (t, c¯i )P (t̄, ci )]
                  CC(t, ci ) =             p                                               (1)
                                               P (t)P (t̄)P (ci )P (c¯i )
      where t̄ denotes other terms and c¯i denotes other classes.


2.3     Basic Method

Using the correlation coefficients that we calculated above (for each term and
each polarity class), we can sum the correlation coefficients of a class for all the
relevant terms within a tweet. After the summation of coefficients is done for
each polarity class (positive, negative, and neutral), we then assign polarity of
the tweet to be that of the class corresponding to the largest summation.
    While this represents the most basic usage of the correlation coefficients, we
can modify the thresholds somewhat to try to achieve better performance.


2.4     Modified Threshold 1 - Same Class Proportions as the Training
        Set

One approach is to try to set the thresholds so that the resulting class proportions
on the test set is equal to the class proportions on the training set. This method
usually works best when the training set is bigger than the test set, and the test
set is similar to the training set. Unfortunately for RepLab 2012, the test set
was much larger than the training set and it was also quite different from the
training set [4].
                       Using Feature Selection Metrics for Polarity Analysis      3

2.5   Modified Threshold 2 - Best Performance on the Training Set
Another approach is to set thresholds that corresponded to the ones that achieve
best performance in the training set (via a 5-fold testing within the training set).
Again, we would expect better performance if the training set is large and similar
to the test set - neither of which were met for RepLab 2012.

2.6   Basic Method with Human Input
One nice result of the correlation coefficient approach is that we can get a
measure of confidence on our classifications. For example, our confidence of
a tweet being positive on classification whose values are (positive=3.8, neu-
tral=0.2, negative=-2.5) is much larger than our confidence if the values were
(positive=0.3, neutral=0.2, negative=-0.7) instead.
    We can take advantage of this additional information by introducing human
input for the borderline cases where the difference between the top two classes is
small (we used 0.5 as a threshold). These are the cases where the automatically
generated categorizations would have a high risk of being incorrect.

3     Results
The evaluation on classifications was performed using reliability, sensitivity, and
the corresponding F measure, which are modified recall and precision measures
[6]. There were a total of 42 runs submitted for the task, of which 3 served
as baselines. Evaluations were done separately for the English and the Spanish
tweets, and the results that we provide below correspond to the English results
[4]:


             Method (rank of 42)          Reliability Sensitivity F(R,S)
                Best (ranked 1st)            .369        .350      .348
          Human Input (ranked 10th)          .364        .275      .285
              Basic (ranked 15th)            .265        .280      .260
       Modified Threshold 1 (ranked 26th)    .230        .194      .198
       Modified Threshold 2 (ranked 28th)    .241        .184      .194


   As we feared, modifying the thresholds resulted in much worse performance,
because the training set was small, and unlike the test set. Meanwhile, the basic
correlation coefficient based method performed okay, ranking 15th of 42 submit-
ted runs. As expected, our run integrating human input on just the borderline
cases led to a marked improvement compared to the other methods.

4     Conclusion and Future Directions
We were able to achieve reasonable results using relatively simple approach using
correlation coefficients. Further, we showed that we can markedly improve our
4       Hogyeong Jeong, Hyunjong Lee

results by incorporating human input on cases deemed to be borderline by the
correlation coefficients.
   As a future direction, we can try exploiting the massive background data that
we did not use for our current results. Because the training set in this case was so
small, we can expect better results if we can exploit the background data to help
expand our training set. Once we have expanded the training set in such a way,
we may be able to expect better results from the modified threshold approaches
that were not able to perform well with a small training set.


References
1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends
   in Information Retrieval 2(1-2) (2007) 1–135
2. Zheng, Z.: Feature selection for text categorization on imbalanced data. ACM
   SIGKDD Explorations Newsletter 6 (2004) 2004
3. Marchetti-Bowick, M., Chambers, N.: Learning for microblogs with distant super-
   vision: Political forecasting with twitter. In: EACL. (2012) 603–612
4. Amigó, E., Corujo, A., Gonzalo, J., Meij, E., Rijke, M.d.: Overview of replab
   2012: Evaluating online reputation management systems. In: CLEF 2012 Labs and
   Workshop Notebook Papers. (2012)
5. Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum
   entropy part-of-speech tagger. In: In EMNLP/VLC 2000. (2000) 63–70
6. Gonzalo, J., Peters, C.: The impact of evaluation on multilingual text retrieval. In:
   SIGIR. (2005) 603–604