<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>http://social-media-
monitoring-review.toptenreviews.com/ (2013)
G. Vinodhini and R.M. Chandrasekaran: Sentiment Analysis and Opinion Min-
ing: A Survey. International Journal of Advanced Research in Computer Sci</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Potential and Limitations of Commercial Sentiment Detection Tools</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark Cieliebak</string-name>
          <email>ciel@zhaw.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oliver Dürr</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fatih Uzdilli</string-name>
          <email>uzdi@zhaw.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Author names in alphabetic order</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Zurich University of Applied Sciences Winterthur</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <volume>2</volume>
      <issue>6</issue>
      <fpage>282</fpage>
      <lpage>292</lpage>
      <abstract>
        <p>In this paper, we analyze the quality of several commercial tools for sentiment detection. All tools are tested on nearly 30,000 short texts from various sources, such as tweets, news, reviews etc. In addition to the quality analysis (measured by various metrics), we also investigate the effect of increasing text length on the performance. Finally, we show that combining all tools using machine learning techniques increases the overall performance significantly.</p>
      </abstract>
      <kwd-group>
        <kwd>Sentiment Detection</kwd>
        <kwd>Opinion Mining</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Classification</kwd>
        <kwd>Corpus Analytics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        How good is the state-of-the-art in sentiment detection? If you look at scientific
literature, there exist numerous approaches to the topic and many of them have been
proven in experiments to perform very well, both in precision and recall. For instance,
basic text-based sentiment detection seems to be “solved”, in the sense that precision
and recall of current algorithms are typically above 80% [
        <xref ref-type="bibr" rid="ref18">14, 22</xref>
        ]. On the other hand,
if one looks at real-world applications that use or include sentiment detection, the
picture changes dramatically. In fact, there exist various blog posts on the web that
state something like this: “More often than not, a positive comment will be classified
as negative or vice-versa” [
        <xref ref-type="bibr" rid="ref20">16</xref>
        ]. Is there really such a large gap between research and
real-life systems?
      </p>
      <p>In this paper, we will tackle this question by evaluating the performance of several
commercial sentiment detection tools. More precisely, we will explore how good
existing tools perform on different sentence-based test corpora. This will allow us to
identify the potential for improvements, and to indicate relevant directions for future
research on sentiment detection. We then combine all tools using machine learning
techniques (Random Forest) to unleash a hidden portion of the commercial
landscape’s potential.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Sentiment Detection in General</title>
        <p>For the purpose of this paper, “sentiment detection” means to find the polarity
(positive, negative, or neutral) of a given text. The texts are single sentences or very
short texts from a single source (“sentence-based”). This includes the special case of
Twitter documents.</p>
        <p>
          There exist several other types and tasks in the realm of sentiment detection, e.g.
emotion detection (is a text emotional or not?), document-based sentiment detection,
target-specific sentiment detection (e.g. for a product), or rating prediction, where the
number of stars for product reviews is predicted from the text. For a good overview of
sentiment detection and its variants in general, see e.g. [
          <xref ref-type="bibr" rid="ref16">12</xref>
          ], [22], or [
          <xref ref-type="bibr" rid="ref19">15</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Comparison of Tools and Algorithms</title>
        <p>
          We are not aware of any scientific study on commercial sentiment detection tools that
tackles questions as presented in this paper. However, there exist several comparison
studies on sentiment detection algorithms, which have a somewhat different focus. In
the following, we briefly summarize some of these studies. On the one hand, there
exist scientific survey papers that explore the abilities of different algorithmic
approaches to sentiment detection. Padmaja et al. list the results of 19 sentiment analysis
papers and categorize each approach to a machine learning algorithm. Typical
accuracy of the approaches is about 80% [
          <xref ref-type="bibr" rid="ref18">14</xref>
          ]. Cui et al. analyze the performance of
different machine learning algorithms on a large test set of product reviews for
predicting the number of “stars”. Precision, recall and F1 score are above 85% for most
algorithms they tested, reaching up to 90% [
          <xref ref-type="bibr" rid="ref10">6</xref>
          ]. Annett et al. compare basic sentiment
analysis techniques on movie blog entries. They show that lexical methods are
5060% accurate, while machine learning approaches are between 66 and 77 percent [1].
On the other hand, there are several comparisons of sentiment detection tools that
focus on business needs. These studies are mostly done by companies or agencies,
targeted for the non-scientific reader, and aim at guiding users to select appropriate
tools. For instance, Bitext.com compares 10 sentiment APIs, using a negative
sentence, a comparative sentence and a conditional sentence. They conclude that most of
the APIs have problems with polarity modifiers or intensifiers and conditional
sentences. Also they argue that most APIs do not show multiple opinions found in some
sentences [4]. Hawskey analyzes the performance of two sentiment APIs using only
tweets. The precision for polar text is around 20% [
          <xref ref-type="bibr" rid="ref13">9</xref>
          ].
        </p>
        <p>
          Sentiment detection is an integral part of social media monitoring tools. For this
reason, comparisons of social media monitoring tools typically also explore their
sentiment detection abilities. Freshnetworks.com’s comparison of 7 social media
monitoring tools show that on average they coded positive and negative sentiment
correctly for about 30% of the texts [
          <xref ref-type="bibr" rid="ref12">8</xref>
          ]. Toptenreviews.com provides a ranking of
social media monitoring tools by different aspects, including sentiment analysis [21].
Sponder compares social media monitoring tools on sentiment analysis features [
          <xref ref-type="bibr" rid="ref23 ref4">19</xref>
          ].
Finally, Kmetz describes how to evaluate sentiment analysis, and presents advice for
choosing a sentiment analysis tool for analyzing social media content [
          <xref ref-type="bibr" rid="ref15">11</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <p>Our basic question in this experiment is simple: How good are commercial
sentiment detection tools? To answer this question, we evaluated the quality and
performance of nine commercial sentiment detection tools on a test set of annotated texts. The
texts were from different media sources (news, reviews, twitter etc.); however, no
context information about the texts was provided to the tools during the evaluation.
We implemented a uniform evaluation framework to submit all documents to the
tools’ API and evaluate the responses automatically.
3.1</p>
      <sec id="sec-3-1">
        <title>Test Data</title>
        <p>For the evaluation, we searched for publicly available test corpora that contained
annotated short texts from different media sources. We found 7 appropriate corpora,
which contained in total 28653 texts. Most of these corpora have already been used in
other research and experiments. Each text is either a complete short document, or a
single sentence. We used the annotations provided by the corpora to classify each text
as “positive”, “negative”, or “other” (e.g. for neutral or mixed sentiment). For more
details on test corpora, see Table 1.
14
30
20
6
18
16
23</p>
        <p>
          Reference
[
          <xref ref-type="bibr" rid="ref17">13</xref>
          ]
[2]
[20]
[
          <xref ref-type="bibr" rid="ref21">17</xref>
          ]
[
          <xref ref-type="bibr" rid="ref14">10</xref>
          ]
[
          <xref ref-type="bibr" rid="ref11">7</xref>
          ]
[23]
        </p>
        <sec id="sec-3-1-1">
          <title>Corpus Name Text Type</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>DAI_tweets Tweets</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>JRC_quotations</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>TAC_reviews</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>SEM_headlines</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>HUL_reviews</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>DIL_reviews</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>MPQ_news</title>
          <p>Speech
Quotations
Product Review
Sentences
News Headlines</p>
        </sec>
        <sec id="sec-3-1-9">
          <title>Product Review Sentences Product Review Sentences</title>
          <p>News Sentences</p>
          <p># of
Texts
4093
1290
2689
1250
3945
4275
11111</p>
          <p>Technical Remarks: Sizes of corpora might differ slightly from their original sizes,
since we skipped some texts in our evaluation, where no proper sentiment annotation
was available. As DAI_tweets and JRC_quotations provided several annotations per
text we used only those texts where all annotations were identical. For TAC_reviews,</p>
        </sec>
        <sec id="sec-3-1-10">
          <title>Polar Text Ratio</title>
          <p>pos neg</p>
        </sec>
        <sec id="sec-3-1-11">
          <title>Average Word Count 19% 15%</title>
          <p>34%
14%
27%
31%
14%
13%
18%
49%
25%
16%
18%
30%
oth
67%
67%
17%
61%
57%
51%
55%
categories MIX (for “mixed sentiment”) and NEU (for “neutral sentiment”) were
merged and texts with category NR (for “not relevant”) were not used.
SEM_headlines uses numeric annotations. In accordance with its documentation, we
used positive sentiment for texts with value &gt;= 50, other for values from -49 to 49,
and negative for values &lt;= -50. HUL_reviews, DIL_reviews and MPQ_news annotate
features and chunks within a text; we aggregated these annotations as follows: if there
were only positive annotations in a text, the entire text was labeled positive;
analogously, texts with only negative annotations were labeled negative; all other texts
were labeled other.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Tools</title>
        <p>For the evaluation, we used commercial state-of-the-art tools for automatic
sentiment detection. There exist literally hundreds of such tools. In order to obtain
comparable results, the tools had to fulfill the following criteria: stand-alone sentiment
detection tool (i.e., not part of a larger system, such as social media monitoring
systems); ability to analyze arbitrary texts (i.e., not specialized on single text types like
tweets); API access; free-of-charge access for the purpose of this evaluation. Based on
these criteria, we selected nine tools1, as shown in Table 2.</p>
        <sec id="sec-3-2-1">
          <title>Tool</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>AlchemyAPI</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Lymbix</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>ML Analyzer</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Repustate</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Semantria</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Sentigem</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Skyttle</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>Textalytics</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>Text-processing</title>
        </sec>
        <sec id="sec-3-2-11">
          <title>Short Name URL alc lym</title>
          <p>mla
rep
sma
sen
sky
tex
txp
www.alchemyapi.com
www.lymbix.com
www.mashape.com/mlanalyzer/ml-analyzer
www.repustate.com
www.semantria.com
www.sentigem.com
www.skyttle.com
core.textalytics.com
www.text-processing.com</p>
          <p>Technical Remarks: Repustate returns values between -1 and 1, indicating negative
to positive sentiment. We asked the tool provider for appropriate threshold values and
used thresholds -0.05 and 0.05 to separate negative, other, and positive sentiment,
respectively. Skyttle returns categories POS and NEG for chunks within the text. We
aggregated these data to entire texts as follows: if there were only positive chunks in
the text, result was “positive”; if it was only negative chunks, result was “negative”;
in all other cases, result was “other” (similar to adaption of corpus annotations).
1</p>
          <p>We also had access to webknox.com, which we had to remove from our test because it only
provides positive and negative classes, and this did not fit our experimental setup.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Number of Texts</title>
        <p>Text Type
Ratio of Positive Text
Ratio of Negative Text
Ratio of Other Text</p>
      </sec>
      <sec id="sec-4-2">
        <title>Average Accuracy</title>
        <p>Maximum Accuracy
Average F1 Score
Average Precision: Pos
Average Precision: Neg
Average Precision: Oth
Average Recall: Pos
Average Recall: Neg
Average Recall: Oth
Average F1 Score: Pos
Average F1 Score: Neg
Average F1 Score: Oth</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Key Findings</title>
      <sec id="sec-5-1">
        <title>Tools are Wrong for Almost 50% of All Documents</title>
        <p>We found that average accuracy of all tools on all documents is 54%. This means
that if you pick a random tool and submit any of the documents, you have to expect a
wrong result for almost every second document.</p>
        <p>Of course, there are tools that have better average accuracy. But even the tool with
maximum accuracy over all documents, sky, achieves only an accuracy of 60%.
Hence, even with this tool, 4 out of 10 documents will be classified wrong.</p>
        <p>It is very likely that commercial classifiers have not been trained with the test
corpora we used. If they were, the accuracy figures could potentially be much different
and even match the accuracies reported in scientific literature.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Tweets are Easier than All Other Text Types</title>
        <p>Figure 1 shows that commercial tools can achieve maximum accuracy for tweets
(corpus DAI_tweets). Here, the best tools achieve an accuracy of 76%. For all other
text types, best accuracy is approx. 60% or even lower.</p>
        <p>1.0  
0.8  
y 0.6  
c
a
r
u
c
c
A0.4  
0.2  
0.0  </p>
        <p>DAI_tweets   JRC_quota8ons   TAC_reviews   SEM_headlines   HUL_reviews   DIL_reviews   MPQ_news  </p>
        <p>How is sentiment detection performance affected by text-length? To answer that
question we first have to define what we understand by “performance”. Since the
focus of this study is more on general trends than on the individual performance of the
tools, we measure performance p as number of tools (0-9) classifying a given text
correctly. We found that p can be modeled by linear regression using p = a*x + b,
with x being the square-root of the text length (data not shown). In Figure 2 we
display the slope a for all corpora. A positive value of a indicates that performance
increases with increasing text length.</p>
        <p>We observe a slope a &lt; 0 for All Texts (dotted line), thus, longer texts are in
general harder to classify. However, this effect is governed by texts with “other”
sentiment: For all corpora, performance to detect “other” sentiment is negatively affected
by the text-length. For texts with positive or negative sentiment, we find both slightly
increasing and decreasing performances for longer texts. Only exception is corpus
SEM_headlines, where we find a strong increase of performance for longer texts. The
later might be due to the fact that headlines are very short texts (typically between 4-8
words), and longer texts give better indications on its sentiment.</p>
        <p>In NLP research, one usually uses annotations of test corpora as "gold standard", in
the sense that they provide a ground truth about the texts. Whenever a tool differs
from this annotation, it is wrong. But our results imply that a non-negligible fraction
of annotations might be wrong: for 9.2% of all texts, at least 7 of the tools agree on its
tonality, but the corpus annotation is different (see Table 4). That is, 7 or more out of
nine tools think a text is, say, positive, but the annotation is negative or other. For one
corpus, this value reaches up to 15%.</p>
        <sec id="sec-5-2-1">
          <title>DAI_tweets</title>
          <p>JRC_quotations
TAC_reviews
SEM_headlines
HUL_reviews
DIL_reviews
MPQ_news</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>ALL Texts</title>
          <p>Of course, it would be possible that all these tools are wrong; but manual
inspection of sample texts showed that we - the authors - would often agree with the tools.
Hence, there is a good chance that the annotations in the test corpora are erroneous.</p>
          <p>
            One explanation might be that good corpus annotations are not easy to obtain: It is
a well-known fact that human agreement on sentiment is far from perfect [24, 3].
Moreover, not all human annotators are equally qualified: Snow et al. have shown that
it takes on average four non-expert annotators to achieve equivalent accuracy to one
expert annotator [
            <xref ref-type="bibr" rid="ref22">18</xref>
            ].
          </p>
          <p>It is out of scope of this paper to further investigate the reasons and implications of
this issue in detail, nevertheless this will be an interesting and important research
question.</p>
          <p>For the purpose of this paper, we use the corpus annotations “as-is”, since their
impact on our findings is only marginal, some measurements might need to be adapted
slightly due to errors in the corpora; however, our main results on quality of
commercial sentiment analysis tools will remain unchanged.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Combined Forces</title>
      <p>Our results above show that many tools perform reasonably well on most of the
corpora. But there is no tool that excels on all corpora. Even more important,
maximum accuracy is only about 75% even for the best tools, which is far from perfect.
But what if we combine the tools, to build a “meta-tool”? Will we get better results?
We explore this idea next and analyze the potential of two different approaches.
6.1</p>
      <sec id="sec-6-1">
        <title>Majority Classifier</title>
        <p>Our first approach is a majority classifier: each input document is submitted to all
nine tools for analysis. Each tool returns a vote for “positive”, “negative”, or “other”.
These votes are collected, and the sentiment that received most votes is chosen. If
several sentiments with equal high number of votes exist, one of those sentiments is
picked randomly.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Random Forest Classifier</title>
        <p>A more advanced approach to predict the sentiment given the votes of the tools is
to use a random forest classifier [5]. More precisely, we use the random forest
implementation of the R-package randomForest with default settings. For each corpus, we
train the classifier using the votes (negative, other, positive) as the numerical values
(1, 0, 1), respectively. In Figure 3, accuracy is reported as usual as one minus the
outof-bag error.</p>
        <p>0.8  
0.7  
y 
c
ra0.6  
u
c
c
A
0.5  
0.4  
Max  Accuracy  of  Single  Tool  
Average  Accuracy  of  All  Tools  
Majority  Classifier  
RandomForest  Classifier  </p>
      </sec>
      <sec id="sec-6-3">
        <title>Result: Random Forest &gt;&gt; Best Single Tool ≈ Majority</title>
        <p>In this work, we evaluated the quality of 9 state-of-the-art commercial sentiment
detection tools for approx. 30,000 different short texts (tweets, news headlines,
reviews etc.). The best tools have an accuracy of 75% for some document types
(tweets), but the average accuracy over all documents is at best 60%. Surprisingly, the
accuracy decreases if texts get longer, which is due to the decline in the ability to
detect “other” sentiments. As an aside, we observed that existing sentiment corpora
are prone to error, with error rates up to 15% per corpus.</p>
        <p>Combining all tools with a meta-classifier can help to improve analysis results. In
fact, using a random forest classifier can improve accuracy by up to 9 percent points,
in comparison to the best single tools.</p>
        <p>Our work gives rise to several interesting directions of future research. A first
direction would be to explore the quality of existing sentiment corpora. How good are
these corpora in reality? Our classification method could be used to find suspicious
texts within a corpus which need further manual verification. This could, on one hand,
lead to better “gold standard” data; on the other hand, we might have to re-analyze
some of the results that are based on such corpora.</p>
        <p>Our main motivation, as mentioned in the introduction, is to explore and
understand the gap between commercial and scientific algorithms for sentiment detection.
We saw that accuracy for commercial tools is only mediocre; on the other hand,
scientific papers often claim excellent accuracy rates. Hence, our next step will be to
apply up-to-date scientific algorithms and prototypes to all test corpora, and compare
these results. From this, we expect interesting insights on how to further improve
existing sentiment detection systems.</p>
        <p>Finally, we want to use smarter ensemble methods for combining tools besides
random forest. One could also use other ensemble approaches, such as bagging and
boosting, to build new meta-classifiers on top of existing tools. Furthermore, other
features such as text length or text type could be used to further improve analysis
results. Since we have already shown that such approaches can improve analysis
quality significantly, it will be interesting to see what level of quality could be achieved at
best.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank all tool providers for giving us the opportunity to test and
evaluate their systems for free, and for their excellent support. Further we would like
to thank Thilo Stadelmann for carefully reading the manuscript and Andreas
Ruckstuhl for comments and suggestions on the statistical methods.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>nadian Conference on Artificial Intelligence</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Analysis in the News</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Language</given-names>
            <surname>Resources</surname>
          </string-name>
          and
          <article-title>Evaluation (LREC'</article-title>
          <year>2010</year>
          ), pp.
          <fpage>2216</fpage>
          -
          <lpage>2220</lpage>
          . Valletta,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Malta</surname>
          </string-name>
          ,
          <fpage>19</fpage>
          -
          <lpage>21</lpage>
          (May
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>ment for Opinion Retrieval</article-title>
          . In: SIGIR'
          <fpage>09</fpage>
          ,
          <string-name>
            <surname>July</surname>
            <given-names>19</given-names>
          </string-name>
          <source>-23</source>
          ,
          <year>2009</year>
          , Boston, Massa-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>chusetts</surname>
          </string-name>
          , USA (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Sentiment API</surname>
          </string-name>
          <article-title>Market comparison</article-title>
          , http://www.bitext.com/
          <year>2013</year>
          /08
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>/comparing-apis-example</article-title>
          .
          <source>html</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Leo</surname>
            <given-names>Breiman: Random</given-names>
          </string-name>
          <string-name>
            <surname>Forests</surname>
          </string-name>
          .
          <source>Machine Learning 45(1)</source>
          ,
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Hang</given-names>
            <surname>Cui</surname>
          </string-name>
          , Vibhu Mittal, and Mayur Datar:
          <article-title>Comparative Experiments on Sentiment Classification for Online Product Reviews</article-title>
          .
          <source>In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-2006)</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Xiaowen</given-names>
            <surname>Ding</surname>
          </string-name>
          , Bing Liu, and Philip S.
          <article-title>Yu: A Holistic Lexicon-Based Appraoch to Opinion Mining</article-title>
          .
          <source>In: Proceedings of First ACM International Conference on Web Search and Data Mining (WSDM-2008)</source>
          , Stanford University, Stanford, California, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          8.
          <article-title>Social media monitoring report - Turning conversations into insights</article-title>
          , http://www.freshnetworks.com/files/freshnetworks/FINA L%
          <article-title>20FreshNetworks%20version_0</article-title>
          .
          <string-name>
            <surname>pdf</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Hawksey</surname>
          </string-name>
          :
          <article-title>Sentiment Analysis of tweets: Comparison of ViralHeat and Text-Processing Sentiment APIs</article-title>
          , http://mashe.hawksey.info/
          <year>2011</year>
          /11/sentiment
          <article-title>-analysis-of-tweets-comparison-ofviralheat-and-text-processing-sentiment-api/ (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Minqing</given-names>
            <surname>Hu</surname>
          </string-name>
          and Bing Liu:
          <article-title>Mining and summarizing customer reviews</article-title>
          .
          <source>In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining (KDD-2004</source>
          , full paper), Seattle, Washington, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jackie</surname>
          </string-name>
          <article-title>Kmetz: Measuring Social Sentiment: Assessing and Scoring Opinion in Social Media</article-title>
          , http://www.visibletechnologies.com/ resources/white-papers/measuring-sentiment/ (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          12. Bing Liu:
          <article-title>Sentiment Analysis and Opinion Mining</article-title>
          . Morgan &amp; Claypool Publishers (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          13.
          <string-name>
            <surname>Sascha</surname>
            <given-names>Narr</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hülfenhaus</surname>
          </string-name>
          , and Sahin Albayrak:
          <article-title>Language-Independent Twitter Sentiment Analysis</article-title>
          .
          <source>In: Knowledge Discovery and Machine Learning (KDML)</source>
          ,
          <source>LWA</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          14.
          <string-name>
            <given-names>S.</given-names>
            <surname>Padmaja</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Sameen</surname>
          </string-name>
          <article-title>Fatima: Opinion Mining and Sentiment Analysis - An Assessment of Peoples' Belief: A Survey</article-title>
          .
          <source>International Journal of Ad hoc, Sensor &amp; Ubiquitous Computing (IJASUC)</source>
          Vol.
          <volume>4</volume>
          , No.
          <volume>1</volume>
          (
          <year>February 2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          15.
          <article-title>Bo Pang and Lillian Lee: Opinion Mining and Sentiment Analysis</article-title>
          . Now Publishers Inc. (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          16.
          <string-name>
            <surname>Matt</surname>
          </string-name>
          <article-title>Rhodes: The problem with automated sentiment analysis</article-title>
          , http://www.freshnetworks.com/blog/2010/05/theproblem
          <article-title>-with-automated-sentiment-analysis/ (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          17.
          <string-name>
            <given-names>SemEval</given-names>
            <surname>Corpus</surname>
          </string-name>
          .
          <source>4th Internation Workshop on Semantic Evaluations</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rion</surname>
            <given-names>Snow</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brendan O'Connor</surname>
            ,
            <given-names>Daniel</given-names>
          </string-name>
          <string-name>
            <surname>Jurafsky</surname>
          </string-name>
          , and Andrew Y. Ng:
          <article-title>Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks</article-title>
          .
          <source>Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>254</fpage>
          -
          <lpage>263</lpage>
          , (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          19. Marshall Sponder:
          <source>Comparing Social Media Monitoring Platforms on Sentiment Analysis about Social Media Week NYC 10</source>
          , http://www.webmetricsguru.com/archives/2010/01/compar ing-social
          <article-title>-media-monitoring-platforms-on-sentimentanalysis-about-social-media-</article-title>
          <string-name>
            <surname>week-</surname>
          </string-name>
          nyc-
          <volume>10</volume>
          / (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>