<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning to Analyze Relevancy and Polarity of Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rianne Kaptein</string-name>
          <email>rianne@oxyme.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oxyme Amsterdam</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of Oxyme in the proling task of the RepLab workshop. We use a machine learning approach to predict the relevancy and polarity for reputation. The same classi er is used for both tasks. Features used include query dependent features, relevancy features, tweet features and sentiment features. An important component of the relevancy features are manually provided positive and negative feedback terms. Our best run uses a Naive Bayes classi er and reaches an accuracy of 41.2% on the pro ling task. Relevancy of tweets is predicted with an accuracy of 80.9%. Predicting polarity for reputation turns out to be more di cult, the best polarity run achieves an accuracy of 38.1%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>1. Relevancy: Is the tweet related to the company?</title>
        <p>2. Polarity for Reputation: Does the tweet have positive or negative
implications for the company's reputation?
Concerning relevancy there are some di erences compared to standard
information retrieval tasks such as web search :
{ Standing queries</p>
        <p>In most search scenarios the users create queries on the y according to their
information need at that moment. Queries might be adjusted according to
the search results retrieved by their initial queries. For this task however the
information need is clear, i.e. all tweets about the company in question. The
same query is used to retrieve results over a longer period of time.
{ Binary relevancy decisions</p>
        <p>Although relevancy of a search result is usually a binary decision, it is either
relevant or non-relevant, the output of search systems is often a ranking
of documents with the most relevant documents on top. For this task no
ranking is generated, only the binary annotation relevant or non-relevant is
assigned to each tweet.</p>
        <p>Determining the polarity for reputation has some di erences in comparison
with standard sentiment analysis, but in our approach which learns from the
training data this should not be an issue. The biggest challenges for the polarity
for reputation analysis are:
{ Dealing with two di erent languages: English and Spanish.
{ The companies in the training data and the test data are di erent.</p>
      </sec>
      <sec id="sec-1-2">
        <title>The main goals of our participation in this task are to:</title>
        <p>{ Explore explicit relevance feedback</p>
        <p>In web search it has always been di cult to extract more information from
the user than the keyword query. In this type of search however queries are
used for an extended period of time to retrieve many results, so the pay-o
of explicit relevance feedback is higher.
{ Devise a transparent method to analyse polarity for reputation
Most Twitter sentiment analysis tools are far from perfect, and reach an
accuracy anywhere between 40% and 70%. We do not expect our approach
to work perfect either, so what is important in the interaction with users is
to be able to explain why a tweet is tagged with a certain sentiment.
In the next section we describe our approach to determining relevancy and
polarity of tweets. Section 3 describes the experimental set-up and results. Finally,
in Section 4 the conclusions are presented.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>To determine the relevancy and the polarity for reputation we use a machine
learning approach which is the same for both tasks. We use standard machine
learning algorithms, including Naive Bayes, Support Vector Machines and
Decision Trees. The features used in the machine learning algorithms are a
combination of features found in related work.
2.1</p>
      <sec id="sec-2-1">
        <title>Query Dependent Features</title>
        <p>
          The rst group of features we use to determine the relevancy of tweet are features
that depend on the query only, and not on speci c tweets as suggested by [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
In our approach we use the following features:
{ Wikipedia Disambiguation This feature has 4 possible values, the higher the
value, the more ambiguous the query:
0: There is no disambiguation page for the entity name
1: There is a disambiguation page, but the page with the entity name
leads directly to the entity
2: There is a disambiguation page, and the page with the entity name
leads to this disambiguation page.
        </p>
        <p>3: The page with the entity name leads to another entity
{ Is the query also a common rst or last name?
{ Is the query also a dictionary entry?
{ Is the query identical to the entity name? Here we disregard corporation
types such as `S.A'. Abbreviations such as `VW', and partial queries such as
`Wilkinson' for the entity `Wilkinson Sword' are examples where the query
is not identical to the query name.
{ The amount of negative feedback terms. This feature has 3 possible values:
0: No negative feedback terms
1: 1 to 3 negative feedback terms
2: 4 to 10 negative feedback terms
3: More than 10 negative feedback terms
{ Query di culty, this feature is a combination of all features above. The
higher the value of this feature, the more di cult it will be to retrieve relevant
results.</p>
        <p>All of these features are language dependent, since for example a common name
in English does not have to be a common name in Spanish.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Relevancy Features</title>
        <p>
          The second group of features is based on relevancy. We use the language
modeling approach to determine the relevancy of the content of a tweet. Besides the
search term we also use manual relevance feedback. For each query in both
languages we generate a list of positive and negative relevance feedback terms. To
generate the list of feedback terms we make use of the background corpus that is
provided. From the background corpus that consists of 30,000 tweets crawled per
company name we extract the most frequently used terms and visualize these in
wordclouds using a wordcloud generator tool [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. For each query we create two
wordclouds, one wordcloud from the tweets that contain the search term and
one wordcloud from the tweets that do not contain the search term.
        </p>
        <p>The tweets that do contain the search term can still contain positive and
negative feedback terms. In Figure 2.2 the wordcloud for `Nivea' is shown. By
clicking on a term in the cloud, the tweets containing that term are displayed.
This allows us to quickly explore the large amount of tweets. In the example
we clicked on the term `song'. When we inspect the associated tweets, it turns
out `Nivea' is also the name of a band. We therefore add this term to the
negative feedback terms, as well as other words related to the band Nivea, such as
the song titles `Complicated' and `Don't mess with my men'. Positive feedback
terms include `body', `cream', `care', etc. There is also a number of words in
the wordcloud which are general words which could appear both in relevant and
non-relevant tweets, such as `follow' and `lol'. These words are not added to any
of the feedback term sets.</p>
        <p>In Figure 2.2 the wordcloud of the tweets that do not contain the search term
as a separate word is shown. This can happen for example when the search term
is part of a username. In this case we inspect the tweets to see if the username
belongs to a Twitter account about Nivea only, or owned by Nivea. If not, we add
the username to the negative feedback terms. Also, in case of doubt the public
Twitter user pro le can be checked. In Figure 2.2 we clicked on the username
`@nivea mariee '. From the displayed tweets we conclude this account is not
relevant for Nivea, so we add `@nivea mariee ' to the negative feedback terms.
The other terms are treated the same way as the terms in the previous wordcloud,
so for each term we decide whether to add it to the positive or negative feedback
terms or neither. There is actually quite some overlap between the two clouds,
since all of the tweets are search results returned by Twitter for the query `Nivea'.
If we would have some training data available for this company we could look
at the words which are used more often in positively or negatively rated tweets.</p>
        <p>To calculate the relevancy score we use the following formula:
log P (RjD; Q) = log P (RjD) +</p>
        <p>X (P (tjQ) log P (tjD)
t2Qpos</p>
        <p>X (P (tjQ) log P (tjD)
t2Qneg</p>
        <p>P (tjQ) log P (tjC))
P (tjQ) log P (tjC))
where R stands for the probability a document D is relevant given query
Q, t stands for a term in a document or a query, and C for the background
collection. The positive query terms Qpos consist of the search terms plus the
positive feedback terms. The document probabilities P (tjD) are smoothed as
follows to avoid taking the log of 0:</p>
        <p>P (tjD) =</p>
        <p>P (tjD) + (1
)P (tjC)
where we use a value of 0:1 for the smoothing parameter .</p>
        <p>Qneg contains the negative feedback terms. All probabilities are calculated
in the log space, so that the numbers do not become too small. P (RjD) is the
prior probability of a document being relevant. Here we use a length prior based
on the number of characters in a tweet to favour longer tweets.</p>
        <p>The relevancy scores of each query are normalised using min-max
normalization, where the minimum score is simulated by a document that contains only
the negative relevance feedback terms, and the maximum score is simulated by a
document that contains only the search words and the positive feedback terms.</p>
        <p>The machine learning algorithm uses the following relevancy features:
{ Relevancy score
{ Tweet content contains a query term
{ Tweet content contains a positive feedback term
{ Tweet content contains a negative feedback term
{ Username equals the query
{ Username contains a negative feedback term
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Tweet Features</title>
        <p>
          The last group of features are mostly general tweet features. Similar features
have been used in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>{ Length of tweet in characters
{ Tweet contains a link
{ Tweet contains a link to the domain of the entity
{ Tweet is a direct message, i.e. starts with a username
{ Tweet is a retweet
{ Tweet is question
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Sentiment Features</title>
        <p>
          To determine the sentiment of a tweet we make use of the scores generated
by the SentiStrength algorithm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. SentiStrength generates sentiment scores
based on prede ned list of words and punctuation with associated positive or
negative term weights. The rated training tweets are used to add words to the
dictionary and to optimize term weights. Each language has its own lists of words.
The tweets are scanned and and all words bearing sentiment, negating words,
words boosting sentiment, question words, slang and emoticons are tagged. Using
positive and negative weights which can be optimized using training data, the
classi cation of a tweet is determined.
        </p>
        <p>
          All together we now have 21 features we can use to determine relevancy
and polarity for reputation of tweets. We use the data mining tool Weka [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for
training and testing the classi ers.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>The pro ling task consists of two separate sub tasks: ltering and determining
the polarity for reputation.
3.1</p>
      <sec id="sec-3-1">
        <title>Experimental Set-Up</title>
        <p>We have created and submitted 5 runs to the RepLab workshop. For each run
we made three choices of options:
1. The Machine Learning classi er to use: Naive Bayes (NB), Support Vector</p>
        <p>Machine (SVM) or Decision Tree (J48).
2. Which attributes to use as features for the classi er. To determine the
relevancy we always user all attributes as described in the previous section. To
determine polarity for reputation we use either all attributes, or only the
SentiStrength scores.
3. Whether to train two separate classi ers for English tweets and Spanish
tweets (separate) or train only one classi er that handles both Spanish and
English tweets (merged).
The overall results of the pro ling task combining ltering and polarity for
reputation are presented in Table 1. Our run OXY 2, using a Naive Bayes classi er
and using all possible features in a single classi er for both English and Spanish
tweets performs best. On average over the topics, 41.2% of the tweets is
annotated correctly. From all submitted runs to the workshop this is the best score.
In the remainder of this section we take a closer look at the results of the two
subtasks: ltering and polarity.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results Filtering</title>
        <p>The results of the ltering task are shown in table 2. Our best run, OXY 2
is the Naive Bayes classi er using all of the attributes described in Section 2.
Comparing the scores to the overall workshop results, we see that all our runs
outperform all other runs when we look at the measure of accuracy, but we score
mediocre looking at the F(R,S)- ltering measure. The reason for this is that the
F(R,S)- ltering measure does not reward.</p>
        <p>There are some Twitter speci c issues that we take into account:
{ The Twitter search returns not only results where the search terms are found
in the content of the tweet, but also where the search terms are found as a
part of the username, or in an external link. We do not use the information
contained in the external links. To calculate our relevancy score we only take
into account the content of the tweet. In fact, if the search term occurs in
the username, this can mean three things:
1. The Twitter account is owned by the company, e.g. `Lufthansa USA.</p>
        <p>Tweets from these account can all be considered relevant.
2. The company name is used in a di erent context, it could be for example
also a last name, e.g. `dave gillette'. Tweets from these accounts mostly
non-relevant.
3. The username is referring to the company name because he or she likes to
be associated with the company, e.g. `Mr Bmw'. The number of relevant
tweets from these accounts varies, some only tweet about the company
they are named after, others hardly do.
It is relatively easy to manually retrieve the o cial Twitter accounts for
a company, and regard their tweets as relevant. It is harder to distinguish
between the other two cases, but luckily they can be treated in the same way.
That is, do not regard the occurrence of the company name in the username
on its own as a signal for relevancy, but check whether the content of the
tweet also contains relevant terms.</p>
        <p>For our submitted runs we tried to separate relevant and non-relevant
Twitter accounts by adding the usernames to the positive and negative relevance
feedback. Although these are high quality indicators of relevance, their
coverage in the dataset is small. Therefore for example we do not see them in
the generated decision trees.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results Polarity for Reputation</title>
        <p>The results of the task to determine polarity for reputation can be found in
Table 3. There is not one run that performs best at all evaluation measures.
The run OXY 3 performs best looking at accuracy with an accuracy of 0.381,
but run OXY 4 performs best looking at the F(R,S)- ltering score with a score
of 0.256. In general the scores are not very impressive. Classifying all tweets as
positive results in an accuracy of 0.438, which is a better accuracy than any of
our runs.</p>
        <p>
          An advantage of the SentiStrength approach is that we can see why a tweet
is classi ed into a certain sentiment. Let's look at some examples. The following
tweet is correctly tagged as positive:
`@NIVEA Australia love your products thanks[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]for following :) [1 emoticon]
[sentence: 2,-1]'
The word `thanks' is recognized as a positive term, as well as the happy emoticon
:). The word `love' however is not tagged as positive.
        </p>
        <p>
          Incorrectly classi ed as negative due to the term `swear' is the following
tweet:
`Best[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]smelling body wash known to man ,I swear[-2]!!!![-0.6 punctuation
emphasis][sentence: 2,-3] #apricot #nivea http://instagr.am/p/Jv5o0UG8Gy
[sentence: 1,-1]'
        </p>
        <p>A problem is that words have di erent meanings in di erent context.
Apparently in our training data the word `love' occurs in both positive and negative
tweets, since it was not tagged as positive. A bigger training set might solve part
of this problem. When we train the classi er on the test data, the word `love'
gets a positive value of 3.</p>
        <p>Another problem occurs when also other companies or products are
mentioned in the tweet, e.g. in the tweet:
`This boots hand cream is just rubbish[-3][{1 booster word]!![-0.6 punctuation
emphasis][sentence: 1,-3] Gonna buy my nivea back today mchew [sentence:
1,1]'
At the moment we only calculate polarity over the whole tweet. When other
entities occur in the tweet, polarity should only be calculated over the part of
the tweet that deals with the entity that is the topic of the search.</p>
        <p>As we discussed earlier, a problem for our classi er is the lack of training
data. First of all, the amount of training data is small. Secondly, the companies
in the training and test set are di erent. When our classi er is trained on a
su ciently large dataset for a company or a industry such as the automotive
industry, and tested with data from the same company or industry, better results
will be obtained. The best submitted run in the workshop attains an accuracy
of 0.487, so we can conclude that classifying tweets on polarity of reputation is
indeed a di cult task.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>The tasks in the RepLab workshop allow us to work on some relatively new and
challenging problems, i.e. how to detect the relevancy and polarity of reputation
of tweets. In our submission we make use of manually selected feedback terms.
In contrast to web search tasks it is more likely we will be able to obtain manual
feedback terms from users since the queries remain the same for an extended
period of time. The e ort of providing feedback terms is paid back by receiving
a better set of search results.</p>
      <p>We make use of a machine learning approach to predict the relevancy and
polarity of tweets. We include query dependent features, relevancy features, tweet
features, and sentiment features. The features are language independent, i.e. the
same 22 features are calculated for English and Spanish tweets.</p>
      <p>Our approach works very well to predict the relevancy of tweets. An average
accuracy of 80.9% is achieved using a Naive Bayes classi er. Predicting the
polarity of reputation turns out to be a much harder task. Our best run achieves
an accuracy of 38.1% using the J48 decision tree classi er. One of the main
problems with our machine learning approach is the limited amount of training
data. When the training and test data would contain the same companies, the
results would already improve.</p>
      <p>The best run of the pro ling task, that is the combination of the ltering and
polarity task is achieved by our run that uses a Naive Bayes classi er which uses
all possible features in a single classi er for both English and Spanish tweets. It
reaches an accuracy of 41.2%, which is the highest accuracy of all o cial RepLab
submissions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaptein</surname>
          </string-name>
          .
          <article-title>Using Wordclouds to Navigate and Summarize Twitter Search Results</article-title>
          .
          <source>In The 2nd European Workshop on Human-Computer Interaction and Information Retrieval (EuroHCIR)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Thelwall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Buckley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Paltoglou</surname>
          </string-name>
          .
          <article-title>Sentiment strength detection for the social web</article-title>
          .
          <source>JASIST</source>
          ,
          <volume>63</volume>
          (
          <issue>1</issue>
          ):
          <volume>163</volume>
          {
          <fpage>173</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Tsagkias</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          . The University of Amsterdam at WePS3. In M. Braschler,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          , and E. Pianta, editors,
          <source>CLEF</source>
          (Notebook Papers/LABs/Workshops),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          , E. Frank, and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hall</surname>
          </string-name>
          .
          <source>Data Mining: Practical Machine Learning Tools and Techniques</source>
          . Morgan Kaufmann, Amsterdam, 3. edition,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Yoshida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matsushima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ono</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sato</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Nakagawa.</surname>
          </string-name>
          ITC-UT:
          <article-title>Tweet Categorization by Query Categorization for On-line Reputation Management</article-title>
          . In M. Braschler,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          , and E. Pianta, editors,
          <source>CLEF</source>
          (Notebook Papers/LABs/Workshops),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>