<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>MSM</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A new ANEW: Evaluation of a word list for sentiment analysis in microblogs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Finn A˚rup Nielsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DTU Informatics, Technical University of Denmark</institution>
          ,
          <addr-line>Lyngby</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <volume>1</volume>
      <fpage>93</fpage>
      <lpage>98</lpage>
      <abstract>
        <p>Sentiment analysis of microblogs such as Twitter has recently gained a fair amount of attention. One of the simplest sentiment analysis approaches compares the words of a posting against a labeled word list, where each word has been scored for valence, - a s“e ntiment lexicon” or a“ffective word lists.” There exist several affec tive word lists, e.g., ANEW (Affective Norms for English Words) developed before the advent of microblogging and sentiment analysis. I wanted to examine how well ANEW and other word lists performs for the detection of sentiment strength in microblog posts in comparison with a new word list specifically constructed for microblogs. I used manually labeled postings from Twitter scored for sentiment. Using a simple word matching I show that the new word list may perform better than ANEW, though not as good as the more elaborate approach found in SentiStrength.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Sentiment analysis has become popular in recent years. Web services, such as
socialmention.com, may even score microblog postings on Identi.ca and Twitter
for sentiment in real-time. One approach to sentiment analysis star ts with labeled
texts and uses supervised machine learning trained on the labeled text data to
classify the polarity of new texts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Another approach creates a sentiment
lexicon and scores the text based on some function that describes how the words
and phrases of the text matches the lexicon. This approach is, e.g., at the core
of the SentiStrength algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        It is unclear how the best way is to build a sentiment lexicon. There
exist several word lists labeled with emotional valence, e.g., ANEW [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], General
Inquirer, OpinionFinder [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], SentiWordNet and WordNet-Affect as we ll as the
word list included in the SentiStrength software [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These word lists differ by the
words they include, e.g., some do not include strong obscene words and Internet
slang acronyms, such as “WTF” and “LOL”. The inclusion of such ter ms could
be important for reaching good performance when working with short informal
text found in Internet fora and microblogs. Word lists may also differ in whether
the words are scored with sentiment strength or just positive/negative polarity.
      </p>
      <p>
        I have begun to construct a new word list with sentiment strength and the
inclusion of Internet slang and obscene words. Although we have used it for
sentiment analysis on Twitter data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] we have not yet validated it. Data sets with
manually labeled texts can evaluate the performance of the different sentiment
analysis methods. Researchers increasingly use Amazon Mechanical Turk (AMT)
for creating labeled language data, see, e.g., [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Here I take advantage of this
approach.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Construction of word list</title>
      <p>My new word list was initially set up in 2009 for tweets downloaded for
online sentiment analysis in relation to the United Nation Climate Conference
(COP15). Since then it has been extended. The version termed AFINN-96
distributed on the Internet1 has 1468 different words, including a few phrases. The
newest version has 2477 unique words, including 15 phrases that were not used
for this study. As SentiStrength2 it uses a scoring range from −5 (very negative)
to +5 (very positive). For ease of labeling I only scored for valence, leaving out,
e.g., subjectivity/objectivity, arousal and dominance. The words were scored
manually by the author.</p>
      <p>
        The word list initiated from a set of obscene words [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] as well as a few
positive words. It was gradually extended by examining Twitter postings collected
for COP15 particularly the postings which scored high on sentiment using the
list as it grew. I included words from the public domain Original Balanced
Affective Word List 3 by Greg Siegle. Later I added Internet slang by browsing
the Urban Dictionary4 including acronyms such as WTF, LOL and ROFL. The
most recent additions come from the large word list by Steven J. DeRose, The
Compass DeRose Guide to Emotion Words.5 The words of DeRose are
categorized but not scored for valence with numerical values. Together with the DeRose
words I browsed Wiktionary and the synonyms it provided to further enhance
the list. In some cases I used Twitter to determine in which contexts the word
appeared. I also used the Microsoft Web n-gram similarity Web servic e
(“Clustering words based on context similarity” 6) to discover relevant words. I do not
distinguish between word categories so to avoid ambiguities I excluded words
such as patient, firm, mean, power and frank. Words such as “sur prise”—with
high arousal but with variable sentiment—were not included in the wor d list.
      </p>
      <p>
        Most of the positive words were labeled with +2 and most of the negative
words with –2, see the histogram in Figure 1. I typically rated strong obscene
words, e.g., as listed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], with either –4 or –5. The word list have a bias t owards
negative words (1598, corresponding to 65%) compared to positive words (878).
A single phrase was labeled with valence 0. The bias corresponds closely to the
bias found in the OpinionFinder sentiment lexicon (4911 (64%) negative and
2718 positive words).
1 http://www2.imm.dtu.dk/pubdb/views/publication details.php?id=59819
2 http://sentistrength.wlv.ac.uk/
3 http://www.sci.sdsu.edu/CAL/wordlist/origwordlist.html
4 http://www.urbandictionary.com
5 http://www.derose.net/steve/resources/emotionwords/ewords.html
6 http://web-ngram.research.microsoft.com/similarity/
      </p>
      <p>
        I compared the score of each word with mean valence of ANEW. Figure 2
shows a scatter plot for this comparison yielding a Spearman’s rank c orrelation
on 0.81 when words are directly matched and including words only in the
intersection of the two word lists. I also tried to match entries in ANEW and my
word list by applying Porter word stemming (on both word lists) and WordNet
lemmatization (on my word list) as
implemented in NLTK [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The results 1000 Histogram of valences for my word list
did not change significantly.
      </p>
      <p>When splitting the ANEW at 800
valence 5 and my list at valence y
0greIssifiven,d maischfeiwef, deinsncruei,pahnacrides,: siallgy-, lftrscoeeenuuq 460000
alert, mischiefs, noisy. Word stem- bA
ming generates a few further dis- 200
crepancies, e.g., alien/alienation,
affection/affected, profit/profiteer. 0 6 4 2 My va0lences 2 4 6</p>
      <p>
        Apart from ANEW I also
examined General Inquirer and the
OpinionFinder word lists. As these word Fig. 1. Histogram of my valences.
lwisittshreppoosirttivpeolsaernittyimI eanstsowciiatthedthweorvdas- 9 Correlation between sentiment word lists
lence +1 and negative with –1. I 8
furthermore obtained the sentiment 7
strength from SentiStrength via its 6
Web service7 and converted its pos- EANW5
itive and negative sentiments to one 4 PSepaerasromnacnocrroerlraetliaotnio=n =0.901.81
single value by selecting the one with 3 Kendal correlation = 0.63
the numerical largest value and zero- 2
inneggatthiveesesnetnitmimenetntif mthaegnpitousidteivsewaenrde 1 6 4 2 My0list 2 4 6
equal.
For evaluating and comparing the word list with ANEW, General Inquirer,
OpinionFinder and SentiStrength a data set of 1,000 tweets labeled with AMT was
applied. These labeled tweets were collected by Alan Mislove for the
Twittermood /“Pulse of a Nation” 8 study [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Each tweet was rated ten times to get
a more reliable estimate of the human-perceived mood, and each rat ing was a
sentiment strength with an integer between 1 (negative) and 9 (positive). The
average over the ten values represented the canonical “ground truth” for this
study. The tweets were not used during the construction of the word list.
      </p>
      <p>To compute a sentiment score of a tweet I identified words and found the
va7 http://sentistrength.wlv.ac.uk/
8 http://www.ccs.neu.edu/home/amislove/twittermood/
lence for each word by lookup in the sentiment lexicons. The sum of the valences
of the words divided by the number of words represented the combined sentiment
strength for a tweet. I also tried a few other weighting schemes: The sum of
valence without normalization of words, normalizing the sum with the number
of words with non-zero valence, choosing the most extreme valenc e among the
words and quantisizing the tweet valences to +1, 0 and –1. For ANEW I also
applied a version with match using the NLTK WordNet lemmatizer.
9
8</p>
      <p>Scatterplot of sentiment strengths for tweets
4</p>
      <p>Results
wwMoiytrhdws4o,ir0nd9tt5ootukanelniqaizumaeotiwnogonrtidhdsee.n14t,2i0fi20e0odft1wt5he,ee7st6es8 ilrcckTaaenhu657
l4i,s0t9,5wwhiolerdtshheitcomrryes2p,o4n7d7inwgorndu msizbeedr zoanAMm43
fIOolfratbAheNeleE3d3W9w2iwtwhaosnrod3ns9-z8ineorGofiestensne1tr0iam3l4Iennwqtou3ri5rde8sr. 211.5 1.0 0.5 My0.l0istPSKepeaenradsraomlnac0cnoo.r5crrroeerlalraettiloiaontnio==n10=0..0.45036.95496 1.5
were found in our Twitter corpus and
for OpinionFinder this number was Fig. 3. Scatter plot of sentiment
562 from a total of 6442, see Table 1 strengths for 1,000 tweets with AMT
for a scored example tweet. sentiment plotted against sentiment</p>
      <p>I found my list to have a found by application or my word list.
higher correlation (Pearson
correlation: 0.564, Spearman’s rank
correlation: 0.596, see the scatter plot My ANEW GI OF SS
in Figure 3) with the labeling from AMT .564 .525 .374 .458 .610
the AMT than ANEW had (Pear- My .696 .525 .675 .604
son: 0.525, Spearman: 0.544). In my ANEW .592 .624 .546
application of the General Inquirer GI .705 .474
word list it did not perform well hav- OF .512
ing a considerable lower AMT
correlation than my list and ANEW (Pear- Table 2. Pearson correlations between
son: 0.374, Spearman: 0.422). Opin- sentiment strength detections methods
ionFinder with its 90% larger lexi- on 1,000 tweets. AMT: Amazon
Mecon performed better than General In- chanical Turk, GI: General Inquirer, OF:
quirer but not as good as my list and OpinionFinder, SS: SentiStrength.
ANEW (Pearson: 0.458, Spearman: 0.491). The SentiStrength analyzer showed
superior performance with a Pearson correlation on 0.610 and Spearman on
0.616, see Table 2.</p>
      <p>I saw little effect of the different tweet sentiment scoring approaches: For
ANEW 4 different Pearson correlations were in the range 0.522–0.526 . For my
list I observed correlations in the range 0.543–0.581 with the extrem e scoring as
the lowest and sum scoring without normalization the highest. With quantization
of the tweet scores to +1, 0 and –1 the correlation only dropped to 0.548. For
the Spearman correlation the sum scoring with normalization for the number of
words appeared as the one with the highest value (0.596).</p>
      <p>
        To examine whether the difference
in performance between the applica- 0.6 Evolution of word list performance
tion of ANEW and my list is due to litoaen00..45
a different lexicon or a different scor- rrco0.3
itnwgeeIn ltohoeketdwoonwotrhde liinsttse.rsWecittihona bdei-- rsPoaen000...0125100 300 500 1000 1500 2000 2477
rect match this intersection consisted iton
of 299 words. Building two new sen- lrrcoae00..5505
timent lexicons with these 299 words, rkann0.45
one with the valences from my list, the raaem0.40
other with valences from ANEW, and pS0.355100 300 500 1000Word list s1iz5e00 2000 2477
applying them on the Twitter data
I found that the Pearson correlations
were 0.49 and 0.52 to ANEW’s
advantage.
On the simple word list approach for sentiment analysis I found my list
performing slightly ahead of ANEW. However the more elaborate sentiment analysis in
SentiStrength showed the overall best performance with a correlation to AMT
labels on 0.610. This figure is close to the correlations reported in the evaluation
of the SentiStrength algorithm on 1,041 MySpace comments (0.60 and 0.56) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Even though General Inquirer and OpinionFinder have the largest word lists
I found I could not make them perform as good as SentiStrength, my list and
ANEW for sentiment strength detection in microblog posting. The two former
lists both score words on polarity rather than strength and it could explain the
difference in performance.</p>
      <p>Is the difference between my list and ANEW due to better scoring or more
words? The analysis of the intersection between the two word list indicated that
the ANEW scoring is better. The slightly better performance of my list with the
entire lexicon may be due to its inclusion of Internet slang and obscene words.</p>
      <p>
        Newer methods, e.g., as implemented in SentiStrength, use a range of
techniques: detection of negation, handling of emoticons and spelling variations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The present application of my list used none of these approaches and might have
benefited. However, the SentiStrength evaluation showed that valence switching
at negation and emoticon detection might not necessarily increase the
performance of sentiment analyzers (Tables 4 and 5 in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>The evolution of the performance (Figure 4) suggests that the addition of
words to my list might still improve its performance slightly.</p>
      <p>Although my list comes slightly ahead of ANEW in Twitter sentiment
analysis, ANEW is still preferable for scientific psycholinguistic studies as the scoring
has been validated across several persons. Also note that ANEW’s standard
deviation was not used in the scoring. It might have improved its performance.
Acknowledgment I am grateful to Alan Mislove and Sune Lehmann for
providing the 1,000 tweets with the Amazon Mechanical Turk labels and to Steven
J. DeRose and Greg Siegle for providing their word lists. Mislove, Lehmann and
Daniela Balslev also provided input to the article. I thank the Danish
Strategic Research Councils for generous support to the ‘Responsible Bus iness in the
Blogosphere’ project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Opinion mining and sentiment analysis</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          -2) (
          <year>2008</year>
          )
          <fpage>11</fpage>
          -
          <lpage>35</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Thelwall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paltoglou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Sentiment strength detection in short informal text</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>61</volume>
          (
          <issue>12</issue>
          ) (
          <year>2010</year>
          )
          <fpage>25442</fpage>
          -
          <lpage>558</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lang</surname>
            ,
            <given-names>P.J.:</given-names>
          </string-name>
          <article-title>Affective norms for English words (ANEW): Instruction manual and affective ratings</article-title>
          .
          <source>Technical Report C-1</source>
          , The Cent er for Research in Psychophysiology, University of Florida (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Wilson,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Wiebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Hoffmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Recognizing contextual polarity in phraselevel sentiment analysis</article-title>
          .
          <source>In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing</source>
          , Stroudsburg, PA, USA, Association for Computational Linguistics (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>L.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arvidsson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.A˚.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colleoni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Good friends, bad news - affect and virality in Twitter</article-title>
          .
          <source>Accepted for The 201 1 International Workshop on Social Computing</source>
          , Network, and
          <string-name>
            <surname>Services</surname>
          </string-name>
          (SocialComNet
          <year>2011</year>
          ) (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Akkaya</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conrad</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiebe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihalcea</surname>
          </string-name>
          , R.:
          <article-title>Amazon Mechanical Turk for subjectivity word sense disambiguation</article-title>
          .
          <source>In: Proceedings of the NAACL HLT 2010 Workshop on Creating</source>
          ,
          <article-title>Speech and Language Data with Amazon' s Mechanical Turk, Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
          <fpage>1952</fpage>
          -
          <lpage>03</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Baudhuin</surname>
            ,
            <given-names>E.S.:</given-names>
          </string-name>
          <article-title>Obscene language and evaluative response: an empirical study</article-title>
          .
          <source>Psychological Reports</source>
          <volume>32</volume>
          (
          <year>1973</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sapolsky</surname>
            ,
            <given-names>B.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shafer</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaye</surname>
            ,
            <given-names>B.K.</given-names>
          </string-name>
          :
          <article-title>Rating offensive words in three television program contexts</article-title>
          .
          <source>BEA</source>
          <year>2008</year>
          , Research Division (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , E.:
          <article-title>Natural Language Processing with Python. OR'eilly</article-title>
          , Sebastopol, California (
          <year>June 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Biever</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Twitter mood maps reveal emotional states of America</article-title>
          .
          <source>The New Scientist</source>
          <volume>207</volume>
          (
          <issue>2771</issue>
          ) (
          <year>July 2010</year>
          )
          <fpage>14</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>