New Features for Sentiment Analysis:
                Do Sentences Matter?

     Gizem Gezici1 , Berrin Yanikoglu1 , Dilek Tapucu1,2 , and Yücel Saygın1
1
    Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey
           {gizemgezici,berrin,dilektapucu,ysaygin}@sabanciuniv.edu
    2
      Dept. of Computer Engineering, Izmir Institute of Technology, Izmir, Turkey


        Abstract. In this work, we propose and evaluate new features to be
        used in a word polarity based approach to sentiment classification. In
        particular, we analyze sentences as the first step before estimating the
        overall review polarity. We consider different aspects of sentences, such
        as length, purity, irrealis content, subjectivity, and position within the
        opinionated text. This analysis is then used to find sentences that may
        convey better information about the overall review polarity. The TripAd-
        visor dataset is used to evaluate the effect of sentence level features on
        polarity classification. Our initial results indicate a small improvement
        in classification accuracy when using the newly proposed features. How-
        ever, the benefit of these features is not limited to improving sentiment
        classification accuracy since sentence level features can be used for other
        important tasks such as review summarization.

        Keywords: sentiment analysis; sentiment classification; polarity detec-
        tion; machine learning


1     Introduction

Sentiment analysis aims to extract the opinions indicated in textual data en-
abling us to understand what people think about specific issues by analyzing
large collections of textual data sources such as personal blogs, review sites, and
social media. An important part of sentiment analysis boils down to a classifica-
tion problem, i.e., given an opinionated text, classifying it as positive or negative
polarity and Machine Learning techniques have already been adopted to solve
this problem.
    Two main approaches for sentiment analysis are lexicon-based and super-
vised methods. The lexicon-based approach calculates the semantic orientation
of words in a review by obtaining word polarities from a lexicon such as the Sen-
tiWordNet [5]. While the SentiWordNet [5] is a domain-independent lexicon, one
can use a domain-specific lexicon whenever available since domain-specific lex-
icons better indicate the word polarities in that domain (e.g. the word ”small”
has a positive connotation in cell phone domain; while it is negative in hotel
domain).


                                           5
2       Gizem Gezici, Berrin Yanikoglu, Dilek Tapucu, and Yücel Saygın

    Supervised learning approaches use machine learning techniques to establish
a model from a large corpus of reviews. The set of sample reviews form the
training data from which the model is built. For instance in [16] [21] , researchers
use the Naive Bayes algorithm to separate positive reviews from negative ones
by learning the probability distributions of the considered features in the two
classes. While supervised approaches are typically more successful, collecting a
large training data is often a problem.

   Word-level polarities provide a simple yet effective method for estimating
a review’s polarity, however, the gap from word-level polarities to review-level
polarity is too big. To bridge this gap, we propose to analyze word-polarities
within sentences, as an intermediate step.

    The idea of sentence level analysis is not new. Some researchers approached
the problem by first finding subjective sentences in a review, with the hope of
eliminating irrelevant sentences that would generate noise in terms of polarity
estimation [13], [24]. Yet another approach is to exploit the structure in sen-
tences, rather than seeing a review as a bag of words [8][11][15]. For instance
in [8], conjunctions were analyzed to obtain the polarities of the words that are
connected with the conjunct. In [9],[14] researchers focused on sentence polari-
ties separately, again to obtain sentence polarities more correctly, with the goal
of improving review polarity in turn. The first line polarity has also been used
as a feature by [24].

    Similar to [24], this work is motivated by our observation that the first and
last lines of a review are often very indicative of the review polarity. Starting from
this simple observation, we formulated more sophisticated features for sentence
level sentiment analysis. In order to do that, we performed an in-depth analysis
of different sentence types. For instance, in addition to subjective sentences, we
defined pure, short, and no irrealis sentences.

    We performed a preliminary evaluation using the TripAdvisor dataset to see
the effect of sentence level features on polarity classification. Throughout the
evaluation, we observed a small improvement in classification accuracy due to
the newly proposed features. Our initial results showed that the sentences do
matter and they need to be explored in larger and more diverse datasets such
as blogs. Moreover, the benefit of these features is not limited to improving
sentiment classification accuracy. In fact, sentence level features can be used
to identify the essential sentences in the review which could further be used in
review summarization.

   Our paper is organized as follows: Section 2 presents our taxonomy of sen-
timent analysis features, together with the newly proposed features. Section 3
describes the sentence level analysis for defining the features. Section 4 describes
the tools and methodology for sentiment classification together with the experi-
mental results and error analysis. Finally, in Section 5 we draw some conclusions
and propose future extension of this work.


                                         6
                   Lecture Notes in Computer Science: Authors’ Instructions        3

2   Taxonomy and Formulation of the New Features

We define an extensive set of 19 features that can be grouped in four categories:
(1) basic features, (2) features based on subjective sentence occurrence statistics,
(3) delta-tf-idf weighting of word polarities, and (4) sentence-level features. These
features are listed in Table 1 and using the notations given below and some basic
definitions provided in Table 2, they are defined formally in Tables 3-7.


              Table 1. Summary Feature Descriptions for a Review R

       Group Name    Feature Name
                     F1      Average review polarity
         Basic       F2      Review purity
                     F3      Freq. of subjective words
     Occurrence of F4        Avg. polarity of subj. words
    Subjective Words F5      Std. of polarities of subj. words
      ∆T F ∗ IDF     F6      Weighted avg. polarity of subj. words
                     F7      Scores of subj. words
                     F8      # of Exclamation marks
      Punctuation F9         # of Question marks
                     F10     Avg. First Line Polarity
                     F11     Avg. Last Line Polarity
                     F12     First Line Purity
                     F13     Last Line Purity
     Sentence Level F14      Avg. pol. of subj. sentences
                     F15     Avg. pol. of pure sentences
                     F16     Avg. pol. of non-irrealis sentences
                     F17     ∆T F ∗ IDF weighted polarity of first line
                     F18     ∆T F ∗ IDF scores of subj. words in the first line
                     F19     Number of sentences in review


    A review R is a sequence of sentences R = S1 S2 S3 ...SM where M is the
number of sentences in R. Each sentence Si in turn is a sequence of words, such
that Si = wi1 wi2 ...wiN (i) where N (i) is the number of words in Si . The review R
can also be viewed as a sequence of words w1 ..wT , where T is the total number
of words in the review.
    In Table 2, subjective words (SBJ) are defined as all the words in SentiWord-
Net that has a dominant negative or positive polarity. A word has dominant pos-
itive and negative polarity if the sum of its positive and negative polarity values
is greater than 0.5 [23]. SubjW (R) is defined as the most frequent subjective
words in SBJ (at most 20 of them) that appear in review R. For a sentence
Si ∈ R, the average sentence polarity is used to determine subjectivity of that
sentence. If it is above a threshold, we consider the sentence as subjective, form-
ing subjS(R). Similarly, a sentence Si is pure if its purity is greater than a fixed
threshold τ . We experimented with different values of τ and for evaluation we
used τ = 0.8. These two sets form the subS(R) and pure(R) sets respectively.


                                        7
4       Gizem Gezici, Berrin Yanikoglu, Dilek Tapucu, and Yücel Saygın

We also looked at the effect of first and last sentences in the review, as well as
sentences containing irrealis words. In order to determine irrealis sentences, the
existence of the modal verbs ’would’, ’could’, or ’should’ is checked. If one of
these modal verbs appear in the sentence then these sentences are labeled as
irrealis similar to [17].

                       Table 2. Basic definitions for a review R

      M         the total number of sentences in R
      T         the total number of words in R
      SBJ       set of known subjective words
      subjW (R) set of most frequent subjective words from SBJ, in R (max 20)
      subjS(R) set of subjective sentences in R
      pure(R) set of pure sentences in R
      nonIr(R) set of non-irrealis sentences in R


2.1    Basic Features
For our baseline system, we use the average word polarity and purity defined in
Table 3. As mentioned before, these features are commonly used in word polar-
ity based sentiment analysis. In our formulation pol(wj ) denotes the dominant
polarity of wj of R, as obtained from SentiWordNet, and |pol(wj )| denotes the
absolute polarity of wj .


                        Table 3. Basic Features for a review R

                    F1 Average review polarity T1
                                                    P
                                                     j=1..T pol(wj )
                                                 P
                                                          pol(wj )
                    F2 Review purity             P j=1..T
                                                  j=1..T |pol(wj )|


2.2    Frequent Subjective Words
The features in this group are derived through the analysis of subjective words
that frequently occur in the review. For instance, the average polarity of the most
frequent subjective words (feature F4 ) aims to capture the frequent sentiment
in the review, without the noise coming from all subjective words.
    The features were defined before in some previous work [4]; however, to the
best of our knowledge, they considered all words, not specifically subjective
words.

2.3    ∆tf*idf Features
We compute the ∆tf ∗idf scores of the words in SentiWordNet [5] from a training
corpus in the given domain, in order to capture domain specificity [12]. For a
word wi , ∆tf ∗idf (wi ) is defined as ∆tf ∗idf (wi ) = tf ∗idf (wi , +)−tf ∗idf (wi , −).


                                           8
                   Lecture Notes in Computer Science: Authors’ Instructions                            5

             Table 4. Features Related to Frequency and Subjectivity


  F3 Freq. of subjective words             |subjW (R)|/|R|
                                                1
                                                    P
  F4 Avg. polarity of subj. words          |subjW (R)|        wj ∈subjW (R) pol(wj )
                                           q
                                                     1
                                                              P                                    2
  F5 Stdev. of polarities of subj. words        |subjW (R)|       wj ∈subjW (R) (pol(wj ) − F4 )


    If it is positive, it indicates that a word is more associated with the positive
class and vice versa, if negative. We computed these scores on the training set
which is balanced in the number of positive and negative reviews.
    Then, we sum up the ∆tf ∗ idf scores of these words (feature F6 ). By do-
ing this, our goal is to capture the difference in distribution of these words,
among positive and negative reviews. The aim is to obtain context-dependent
scores that may replace the polarities coming from SentiWordNet which is a
context-independent lexicon [5]. With the help of context-dependent informa-
tion provided by ∆tf ∗ idf related features, we expect to better differentiate the
positive reviews from negative ones.
    We also tried another feature by combining the two information, where we
weighted the polarities of all words in the review by their ∆tf ∗idf scores (feature
F7 ).


                              Table 5. ∆tf*idf Features

                                             1
                                                 P
        F6 ∆tf ∗ idf scores of all words     T       j=1..T ∆tf ∗ idf (wj )

                                               1
                                                 P
        F7 Weight. avg. pol. of all words T          j=1..T ∆tf ∗ idf (wj ) × pol(wj )


2.4   Punctuation Features

We have two features related to punctuation. These two features were suggested
in [4] and since we have seen that they could be useful for some cases we included
them in our sentiment classification system.


                           Table 6. Punctuation Features

                  F8 Number of exclamation marks in the review
                  F9 Number of question marks in the review


                                            9
6        Gizem Gezici, Berrin Yanikoglu, Dilek Tapucu, and Yücel Saygın

2.5    Sentence Level Features

Sentence level features are extracted from some specific types of sentences that
are identified through a sentence level analysis of the corpus. For instance the
first and last lines polarity/purity are features that depend on sentence position;
while average polarity of words in subjective/pure etc. sentences are new features
that consider only subjective or pure sentences respectively.


                   Table 7. Sentence-Level Features for a review R
                                              1
                                                   P
      F10 Avg. First Line Polarity           N (1) P j=1..N (1) pol(w1j )
                                                1
      F11 Avg. Last Line Polarity            N (M )   j=1..N (M ) pol(wM j )
                                                 P
                                                              pol(w1j )
      F12 First Line Purity                     P j=1..N (1)
                                                             |pol(w1j )|
                                                 Pj=1..N (1)
                                                               pol(wM j )
      F13 Last Line Purity                      P j=1..N (M )
                                                  j=1..N (M ) |pol(wM j )|
                                                    1
                                                          P
      F14 Avg. pol. of subj. sentences                                     pol(wj )
                                                |subj(R)|
                                                    1
                                                           Pwj ∈subjW (R)
      F15 Avg. pol. of pure sentences           |pure(R)|     wj ∈pure(R) pol(wj )
                                                     1
                                                             P
      F16 Avg. pol. of non-irrealis sentences   |nonIr(R)|      wj ∈nonIr(R) pol(wj )
                                               P
      F17 ∆tf*idf weighted polarity of 1st line j=1..T ∆tf ∗ idf (w1j ) × pol(w1j )
                                               P
      F18 ∆tf*idf Scores of 1st line             j=1..T ∆tf ∗ idf (wj )
      F19 Number of sentences in review        M


3     Sentence Level Analysis for Review Polarity Detection

We tried three different approaches in obtaining the review polarity. In the first
approach, each review is pruned to keep only the sentences that are possibly
more useful for sentiment analysis. For pruning, thresholds were set separately
for each sentence level feature. Sentences with length of at most 12 words are
accepted as short and sentences with absolute purity of at least 0.8 are defined
as pure sentences. For subjectivity of the sentences, we adopted the same idea
that was mentioned in [23] and applied it on not words, but sentences in this
case.
    Pruning sentences in this way resulted in lower accuracy in general, due
to loss of information. Thus, in the second approach, the polarities in special
sentences (pure, subjective, short or no irrealis) were given higher weights while
computing the average word polarity. In effect, other sentences were given lower
weight, rather than the more severe pruning.
    In the final approach that gave the best results, we used the information
extracted from sentence level analysis as features used for training our system.
    We believe that our main contribution is the introduction and evaluation of
sentence-level features; yet other than these, some well-known and commonly
used features are integrated to our system, as explained in the next section.


                                          10
                  Lecture Notes in Computer Science: Authors’ Instructions       7

    Our approach depends on the existence of a sentiment lexicon that provide
information about the semantic orientation of single or multiple terms. Specif-
ically, we use the SentiWordNet [5] where for each term at a specific function,
its positive, negative or neutral appraisal strength is indicated (e.g. ”good,ADJ,
0.5)


4     Implementation and Experimental Evaluation
In this section, we provide an evaluation of the sentiment analysis features based
on word polarities. We use the dominant polarity for each word (the largest po-
larity among negative, objective or positive categories) obtained from sentiWord-
Net. We evaluate the newly proposed features and compare their performance to
a baseline system. Our baseline system uses two basic features which are the av-
erage polarity and purity of the review. These features are previously suggested
in [1] and [22] widely used in word polarity-based sentiment analysis. They are
defined in Table 3 for completeness. The evaluation procedure we used in our
experiments is described in the following subsections.

4.1   Dataset
We evaluated the performance of our system on a sentimental dataset, TripAd-
visor that was introduced by [18] and, [19] respectively. The TripAdvisor corpus
consists of around 250.000 customer-supplied reviews of 1850 hotels. Each re-
view is associated with a hotel and a star-rating, 1-star (most negative) to 5-star
(most positive), chosen by the customer to indicate his evaluation.
    We evaluated the performance of our approach on a randomly chosen dataset
from TripAdvisor corpus. Our dataset consists of 3000 positive and 3000 negative
reviews. After we have chosen 6000 reviews randomly, these reviews were shuffled
and split into three groups as train, validation and test sets. Each of these
datasets have 1000 positive and 1000 negative reviews.
    We computed our features and gave labels to our instances (reviews) accord-
ing to the customer-given ratings of reviews. If the rating of a review is bigger
than 2 then it is labeled as positive, and otherwise as negative. These interme-
diate files were generated with a Java code on Eclipse and given to WEKA [20]
for binary classification.

4.2   Sentiment Classification
Initially, we tried several classifiers that are known to work well for classifica-
tion purposes. Then, according to their performances we decided to use Support
Vector Machines (SVM) and Logistic regression. SVMs are known for being able
to handle large feature spaces while simultaneously limiting overfitting, while
Logistic Regression is a simple, and commonly used, well-performing classifier.
The SVM is trained using a radial basis function kernel as provided by Lib-
SVM [3]. For LibSVM, RBF kernel worked better in comparison to other kernels


                                       11
8       Gizem Gezici, Berrin Yanikoglu, Dilek Tapucu, and Yücel Saygın

on our dataset. Afterwards, we performed grid-search on validation dataset for
parameter optimization.

4.3    Experimental Results
In order to evaluate our sentiment classification system, we used binary classifi-
cation with two classifiers, namely SVMs and Logistic Regression. The reviews
with star rating bigger than 2 are positive reviews and the rest are negative
reviews in our case, since we focused on binary classification of reviews. Apart
from this, we also looked at the importance of the features. The importance of
the features will be stated with the feature ranking property of WEKA [20] as
well as the gradual accuracy increase, as we add a new feature to the existing
subset of features.
    For these results, we used grid search on validation set. Then, by these opti-
mum parameters, we trained our system on training set and tested it on testing
set.

          Table 8. The Effects of Feature Subsets on TripAdvisor Dataset

      Feature Subset                                      Accuracy Accuracy
                                                          (SVM) (Logistic)
      Basic (F1,F2)                                       79.20% 79.35%
      Basic (F1,F2) + ∆T F ∗ IDF (F6,F7)                  80.50% 80.30%
      Basic (F1,F2) + ∆T F ∗ IDF (F6,F7) + ...
          Freq. of Subj. Words (F3)                       80.80%   80.05%
      Basic (F1,F2) + ∆T F ∗ IDF (F6,F7) + ...
          Freq. of Subj. Words (F3) + Punctuation (F8,F9) 80.20%   79.90%
      Basic (F1,F2) + ∆T F ∗ IDF (F6,F7) + ...
          Occur. of Subj. Words (F3-F5)                   80.15%   79.00%
      All Features (F1-F19)                               80.85%   81.45%


Table 9. Comparative Performance of Sentiment Classification System on TripAdvisor
Dataset

                   Previous Work        Dataset F-measure Error Rate
                Gindl et al (2010) [6]    1800    0.79         -
               Bespalov et al (2011) [2] 96000      -        7.37
                Peter et al (2011) [10] 103000    0.82         -
               Grabner et al (2012) [7] 1000      0.61         -
                 Our System (2012)        6000    0.81         -


   The results for the best performing feature combinations described in Table 1,
are given in Table 8. As can be seen in this table, using sentence level features
bring improvements over the best results, albeit small.


                                       12
                   Lecture Notes in Computer Science: Authors’ Instructions        9

4.4   Discussion

As can be seen in the experiments section, our system with the newly proposed
features obtains one of the best results obtained so far, except for [2]. Although
[2] obtains the best result on a large TripAdvisor dataset, its main drawback is
that topic models learned by methods such as LDA requires re-training when
a new topic comes. In contrast, our system uses word polarities; therefore it is
very simple and fast. For this reason, it is more fair to compare our system with
similar systems in the literature.


5     Conclusions and Future Work

In this work, we tried to bridge the gap between word-level polarities and review-
level polarity through an intermediate step of sentence level analysis of the re-
views. We formulated new features for sentence level sentiment analysis by an
in-depth analysis of the sentences. We implemented the proposed features and
evaluated them on the TripAdvisor dataset to see the effect of sentence level
features on polarity classification. We observed that the sentence level features
have an effect on sentiment classification, and therefore, we may conclude that
sentences do matter in sentiment analysis and they need to be explored for larger
and more diverse datasets such as blogs. For future work, we will evaluate each
feature set both in isolation and in groups, and work on improving the accu-
racy. Furthermore, we will switch to a regression problem for estimating the star
rating of reviews.
    Sentence level features have other uses since they can be exploited further to
identify the essential sentences in the review. We plan to incorporate sentence
level features for highlighting the important sentences and review summarization
in our open source sentiment analysis system SARE which may be accessed
through http://ferrari.sabanciuniv.edu/sare.


Acknowledgements. This work was partially funded by European Commis-
sion, FP7, under UBIPOL (Ubiquitous Participation Platform for Policy Mak-
ing) Project (www.ubipol.eu).


References
 1. Ahmed, A., Hsinchun, C., Arab, S.: Sentiment analysis in multiple languages: Fea-
    ture selection for opinion classification in Web forums. ACM Transactions on In-
    formation Systems 26, 1–34 (2008)
 2. Bespalov, D., Bai, B., Qi, Y., Shokoufandeh, A.: Sentiment classification based
    on supervised latent n-gram analysis. In: ACM Conference on Information and
    Knowledge Management (CIKM) (2011)
 3. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines (2001)
 4. Denecke, K.: How to assess customer opinions beyond language barriers? In:
    ICDIM. pp. 430–435. IEEE (2008)


                                        13
10      Gizem Gezici, Berrin Yanikoglu, Dilek Tapucu, and Yücel Saygın

 5. Esuli, A., Sebastiani, F.: Sentiwordnet: A publicly available lexical resource for
    opinion mining. In: In Proceedings of the 5th Conference on Language Resources
    and Evaluation (LREC06. pp. 417–422 (2006)
 6. Gindl, S., Weichselbraun, A., Scharl, A.: Cross-domain contextualization of senti-
    ment lexicons. Media (2010)
 7. Grbner, D., Zanker, M., Fliedl, G., Fuchs, M.: Classification of customer reviews
    based on sentiment analysis. Social Sciences (2012)
 8. Hatzivassiloglou, V., Mckeown, K.R.: Predicting the semantic orientation of ad-
    jectives. In: Proceedings of ACL-97, 35th Annual Meeting of the Association for
    Computational Linguistics. pp. 174–181. Association for Computational Linguis-
    tics (1997)
 9. Kim, S.m., Hovy, E., Rey, M.: Automatic detection of opinion bearing words and
    sentences pp. 61–66
10. Lau, R.Y.K., Lai, C.L., Bruza, P.B., Wong, K.F.: Leveraging web 2.0 data for
    scalable semi-supervised learning of domain-specific sentiment lexicons. In: Pro-
    ceedings of the 20th ACM international conference on Information and knowledge
    management. pp. 2457–2460. CIKM ’11, ACM, New York, NY, USA (2011)
11. Mao, Y., Lebanon, G.: Isotonic conditional random fields and local sentiment flow.
    In: Advances in Neural Information Processing Systems (2007)
12. Martineau, J., Finin, T.: Delta tfidf: An improved feature space for sentiment
    analysis. In: Adar, E., Hurst, M., Finin, T., Glance, N.S., Nicolov, N., Tseng, B.L.
    (eds.) ICWSM. The AAAI Press (2009)
13. Mcdonald, R., Hannan, K., Neylon, T., Wells, M., Reynar, J.: Structured models
    for fine-to-coarse sentiment analysis. Computational Linguistics (2007)
14. Meena, A., Prabhakar, T.V.: Sentence level sentiment analysis in the presence of
    conjuncts using linguistic analysis. Symposium A Quarterly Journal In Modern
    Foreign Literatures (2), 573–580 (2007)
15. Pang, B., Lee, L.: A sentimental education : Sentiment analysis using subjectivity
    summarization based on minimum cuts. Cornell University Library (2004)
16. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using
    machine learning techniques. In: Proceedings of EMNLP. pp. 79–86 (2002)
17. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods
    for sentiment analysis. Comput. Linguist. 37(2), 267–307
18. The TripAdvisor website. http://www.tripadvisor.com (2011), [TripAdvisor
    LLC]
19. Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: A
    rating regression approach. Proceedings of the 16th ACM SIGKDD international
    conference on Knowledge discovery and data mining pp. 783–792 (2010)
20. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
    niques. Morgan Kaufmann (2005)
21. Yu, H.: Towards answering opinion questions: Separating facts from opinions and
    identifying the polarity of opinion sentences. Proceeding EMNLP 03 Proceedings
    of the 2003 conference on Empirical methods in natural language processing (2003)
22. Zhai, Z., Liu, B., Xu, H., Jia, P.: Grouping product features using semi-supervised
    learning with soft-constraints. In: Huang, C.R., Jurafsky, D. (eds.) COLING 2010,
    23rd International Conference on Computational Linguistics, Proceedings of the
    Conference, 23-27 August 2010, Beijing, China. pp. 1272–1280. Tsinghua Univer-
    sity Press (2010)
23. Zhang, E., Zhang, Y.: Ucsc on rec 2006 blog opinion mining. In: TREC (2006)


                                         14
                   Lecture Notes in Computer Science: Authors’ Instructions         11

24. Zhao, J., Liu, K., Wang, G.: Adding redundant features for crfs-based sentence sen-
    timent classification. In: Proceedings of the 2008 Conference on Empirical Methods
    in Natural Language Processing. pp. 117–126 (2008)


                                         15