=Paper= {{Paper |id=Vol-1127/paper7 |storemode=property |title=Sentiment Estimation on Twitter |pdfUrl=https://ceur-ws.org/Vol-1127/paper7.pdf |volume=Vol-1127 |dblpUrl=https://dblp.org/rec/conf/iir/Amati0M14 }} ==Sentiment Estimation on Twitter== https://ceur-ws.org/Vol-1127/paper7.pdf
          Sentiment Estimation on Twitter

     Giambattista Amati, Marco Bianchi, and Giuseppe Marcone

                   Fondazione Ugo Bordoni, Rome, Italy
             gba@fub.it, mbianchi@fub.it, gmarcone@fub.it



Abstract We study the classifier quantification problem in the context of
the topical opinion retrieval, that consists in estimating proportions of the
sentiment categories in the result set of a topic. We propose a methodology to
circumvent individual classification allowing a real-time sentiment analysis for
huge volumes of data. After discussing existing approaches to quantification,
the novel proposed methodology is applied to Microblogging Retrieval and
provides statistically significant estimates of sentiment category proportions.
Our solution modifies Hopkins and King’s approach in order to remove manual
intervention, and making sentiment analysis feasible in real time. Evaluation
is conduced with a test collection made up of about 3,2M tweets.


1    Introduction
Sentiment analysis for social networks is one of the most popular and mature
research problem in Information Retrieval and Machine Learning. Several
companies already provide services aimed to detect, estimate, and summarize
opinions of social network users on different topics. As a consequence, a “gold
rush” is started on finding scalable solutions to perform real-time analysis on
huge volumes of data.
In general, real-time content analysis should accomplish different tasks: dis-
tilling high-quality samples about populations under investigation, labeling
samples by concepts and taxonomies, and providing accurate estimation of
sizes and proportions of category populations. Although accurate classifica-
tion of single documents and high precision retrieval are important desiderata
of real time content analytics, decision making often requires quantitative ana-
lysis, that consists in computing accurately the estimates of populations size
or the proportion of individuals that fall into a predefined category. For ex-
ample, what is important in sentiment analysis or topical opinion retrieval [1]
is how the entire population distribute by sentiment polarities relatively to
different topics, possibly at certain instant of time or showing temporal trends
for these distributions.
We propose a methodology that provides statistically significant estimates
of sentiment category proportions in social science, and in particular in the
Microblogging Retrieval that is for the result set of tweets of a topic. In
particular, we revisit the classifier quantification problem in the context of
topical opinion retrieval. Classifier quantification was introduced earlier in
the 70s in epidemiology [2], an area in which - similarly to social science -
the quantities of interest are at aggregated level. Classifier quantification was
later reconsidered in Machine Learning [3].


                                       39
It is worth noting that, in principle, the classifier quantification can be done
applying two components in sequence: after retrieval, the result set of the
query is passed to a sentiment classifier; then, after classification, one can
count individuals in each category set. However, according to Pang et al. [4]
who compared more than twenty methods for sentiment analysis, classification
performance does not go beyond 83% of accuracy, that it is not enough to
compute a statistically significant estimate of categories proportions.
To remedy classifier inaccuracy and predict the actual opinion category size,
Forman introduces a “classify and count” (CC) strategy [3,5], that consists
in counting the individuals of each category set after classification (|D̂k |),
and then adjusting the observed proportions with the classifier error rates
P (D̂j |Dk ) with j 6= k, that are obtained after training the classifier:
                                     X
                          P (D̂j ) =   P (D̂j |Dk ) · P (Dk )
                                   k

The actual categories proportions P (Dk ) are then solutions of all k (adjusted-
CC) linear equations. Since a sentiment classifier is trained independently
from queries, and the collection C contains millions of documents, P (D̂j |Dk )
can be observed only on a small sample C 0 of the collection. To accept these
adjusted-CC solutions as correct sentiment category proportions among a sub-
set Rq of relevant documents for a query q, we need to assume that P (D̂j |Dk )
are conservative on the subset of relevant documents of the query, even if the
intersection Rq ∩ C 0 might result empty. Alternatively, we need to introduce
relevance as an extra category (actually irrelevance as a mutually exclus-
ive extra category), and train the classifier on each query. The conservative
assumption is unrealistic, because there always is a bias of sentiment distribu-
tion between relevant and non-relevant information, and sentiment analysis
quantifies such a bias. On the other case, training the classifier on each query
would be time consuming and requires a manual intervention.
Notice that since the adjusted-CC approach uses misclassification proportions
P (D̂j |Dk ), it works with any classifier accuracy. What is important for a stat-
istically significant prediction is how large the random sample of the query is.
Therefore, if the training set is a large random sample of the query popula-
tion, then a query-by-query approach becomes de facto a manual evaluation
of sentiment. Although the retrieving process is fast, classification task is in-
stead time consuming so that the adjusted-CC approach is not feasible for
real-time analytics.
In topical opinion retrieval, the strategy of filtering by sentiment the relevance
ranking always degrades retrieval performance with respect to other reranking
strategy [6,7], due to the negative effects of the removal of the misclassified
relevant information [8].
We instead follow more closely the model proposed by Hopkins & King [9]
and adapt it to retrieval. Hopkins & King get rid of the classifier and propose
a “word profiles counting” approach among a random sample of evaluated
documents. Among S = 2V possible subsets (profiles) of the set of words they
choose a random sample S 0 ⊂ S and count the occurrences of profiles s ∈ S 0
occurring in the true classification Dk :
                   X
     P(s ∈ S 0 ) =     P(s ∈ S 0 |Dj ) · P(Dj )                                (1)
                   j



                                       40
Then P(Dj ) are obtained as regression coefficient of a linear regression set of
equations.
The advantage of Hopkins & King’s methodology is that of estimating P(Dj )
without the intermediate step of computing the individual classifications. Un-
fortunately for each query, a manual classification is required for a quite large
number of individuals of the population. Hopkins & King’s methodology is
thus manual and provides only an automatic smoothing technique to make
the manually labeled proportions statistically significant.
We now adapt Hopkins and King’s basic model to a topical opinion retrieval.
The main difference with Hopkins and King’s methodology is the use of a set
Sk of learned (biased) features for the set S 0 (and not a random word profile
sample), one for each category Dk . Also we do not count the features in the
true category distribution of a query, but we use an information theoretic
approach that instead counts the number of bits necessary to code all the
features Sk that occurs in the result set of a query. The code is obtained with
respect to the true distribution of a query-independent training set. We then
assume that all these aggregated numbers of bits are linearly correlated to
the number P(D̂k |Dj , q). In such a way we avoid the classification step as
in Hopkins and King’s but we also avoid to use a manual evaluation for the
specific query q.
In the next sections we present this approach in order to handle arbitrary
queries without training on single queries.


2    Revisiting Hopkins and King’s Model
First we exploit the theorem of total probability:
                X
    P(D̂k |q) =    L(D̂k |Dj , q) · π(Dj |q)                                 (2)
                 j


where D̂k is the k-th category set with the chosen classifier, Dj is the true
j-th category set, q the query, L(Dj |Dk , q) is the likelihood, and π(Dk |q) is
the unknown prior on the set of categories (the actual category proportions).
The non diagonal elements L(D̂i |Dj , q) of the matrix L, with i 6= j, are the
errors due to all misclassified individuals.
     The estimation problem can be thus restated to how to find estimates
     π̂(Dk |q) for π(Dk |q) that predict the proportions in the population
     for the categories Dk .
We now define population size prediction in terms of features of a textual
classifier. A textual classifier D̂ can be defined as a set of features S (or a
weighting function from D̂ to S) that assigns individuals to single categories.
For text classifiers D̂, S in particular is a set of words, each word having a
probability L(s ∈ S|Di , q) of occurrence in the i-th category. S can be thus
chosen as a set of words spanning over the population Ω of individuals, and
                                                                             q
thus S is simply a function of the classifier D̂. For weighting function ws,i  ,
such as the SVM classifier or IR weighting models such as standard tf-idf
model or other topic-specific lexicon [10,11], we can make the assumption:
                           q
    L(s ∈ S|Di , q) = α · ws,i                                               (3)


                                       41
where the parameter α is just a normalizing factor. The theorem of total
probability spanning over a set S of features is
                   n
                   X
    P(s ∈ S|q) =         L(s ∈ S|Di , q)π(Di |q)                             (4)
                   i=1

where L(s|Di , q) is the likelihood and π(Di |q) is the prior over the set of
categories.
We may rewrite the Equality (4) in matrix form:
    P(S|q) = L(S|D, q) · π(D|q)
     |S|×1      |S|×|D|      |D|×1

Our problem formulation then can be restated as the problem of finding best
fitting values π̂(Di |q) for the vector π(Di |q) in a set of linear equations.
To learn the likelihood matrix L(D̂|D, q) on the entire population one can
use instead a likelihood matrix LXq over a sample Xq of retrieved tweets with
respect to the query q and error  (L ∼ LXq ), that is

    P(D̂|q) = LXq (D̂|D, q) · π(D|q)                                         (5)
Passing through the set S of spanning features
    P(S|q) = LXq (S|D, q) · π(D|q)
     |S|×1     |S|×|D|              |D|×1

we get the latent regression matrix L0Xq such that
                                          |D|×|S|
                                            !−1
    π(D|q) =      L0Xq · LXq (S|D, q)             · L0Xq · P(S|q)
     |D|×1       |D|×|S|   |S|×|D|                 |D|×|S|     |S|×1

We may use the linear regression with the two matrices LXq and P. Therefore
the estimates π̂(Dk |q) of π(Dk |q) becomes the coefficient for the k-th category
of the linear regression and provide the estimated proportion for the category
Dk .
The main computational cost of this methodology is that S should be learned
by the chosen classifier D̂, that is S is a function of D̂ and the topic q.
A sentiment textual classifier has at least the following categories:
                            {S + , S − , S N O , S M ix , S N R }
In presence of a query or a set of queries q, S N R contains all non relevant
elements, S N O relevant but without opinions, S + (S − ) relevant and strictly
positive (negative) opinions, whilst the individuals containing mixed opinions
fall into the remaining category S M ix . To learn S, and thus to build LXq , a
manual inspection is required and this is the main drawback of this method-
ology to perform any real time analytics. Hopkins & King suggest to use a
random sample of word profiles and manually annotate about 500 blog posts
to achieve 99% of statistical significance of the estimates π̂(Di |q).
In case of an adjusted-CC approach with individual classifiers, such as SVM,
there is a highly time intensive tuning phase. Hopkins and King reports that
each run of their estimator based on SVM took 60 seconds of computer time,
and a total of five hours for 300 bootstrapped runs, for a collection of 4k blog
posts for a total number of 47k tokens.


                                            42
3     Real time analytics
We now want to get statistics on the sentiment polarity for the entire popu-
lation relatively to an arbitrary query q at retrieval time.
LXq (si ∈ S|Dk , q) can be interpreted as the percent of times the sentiment
term si occurs in the set Xq of evaluated tweets in Dk and retrieved by the
query q, and P(si ∈ S|q) is the percent of times the sentiment term si occurs
in the retrieved set. Once we have the evaluation on the query q the input
matrices of LXq and P can be computed quite fast. A sentiment dictionary
even when containing many thousand of elements is submitted to the system
as a query and would be feasible to build the input matrices in seconds, or
much less within a distributed system. However regression algorithm would
require additional computation time, even though it would require still reas-
onable computational time in presence of a relatively small result set.


3.1     Cumulative scoring assumption
We now describe in details the approach. In the training phase we pool all
relevant documents irrespective to the query by which they were retrieved and
then we select the opinionated documents from the rest. Therefore we formed
two collections the sentiment sample included into a document sample. The
reason we use only relevant information is to use a fully evaluated collection,
and thus reject from the sentiment dictionary words that occur randomly
in both collections. The collection of relevant documents provides a prior
distribution for the terms s (see πs below). We then use the features s ∈ S i
of a classifier learned by the sentiment category Di . We apply the linear
regression algorithm of Equation (5) and learn the regression coefficients.
Each coefficient will be then multiplied by the sum of sentiment scores of the
retrieved documents to obtain the number of tweets in a category for any new
query q.


Training phase In the training phase we make the following assumptions:
 – (Multiple binary dictionaries reduction) We first assume a binary classifier
   for each category k. In particular for the sentiment analysis, we learn from
   only the positive and negative result sets, that is producing two distinct
   dictionaries S + and S − . We have thus two Equations (4), each therefore
   restricted to one of the two chosen categories. The likelihood for that
   category k is thus: LX (si ∈ S k |Dk , q).
 – We use information theoretic arguments to compute LX (S k |Dk , qj ) (but
   it is not necessarily a theoretical limit since any other score weighting
   formula or classifier can be similarly used here). We pool the result sets
     k
   DQ   for a category k from a set of training queries Q and extract frequen-
   cies of terms for that category S k , and build the classifier for the k-th
   category

          I(s ∈ S k ) = − log P(s|πs , DQ
                                        k
                                          )

      which is given by a binomial P with prior distribution πs and frequencies
          k
      in DQ .


                                         43
 – The proportion of documents LXq (D̂k |Dk , q) falling in a given category
   with respect to the result set of a query q is proportional to the amount of
   sentiment information for that category in the result set of that query [12],
   that is
                     X
       I(S k |q) =                           k
                             − log P(s|πs , DQ )
                        s∈S k ,d∈Xq


 – We train and test the classifier with linear regression on a set of queries
   Q and establish how many queries are necessary to learn the estimate
   π̂(Dk );
 – misclassifications LX (S 6= S k |Dk , qj ) and LX (S k |D 6= Dk , qj ) are not
   computed but recovered by learning distinct regression coefficients α̂ ·
   π̂(Dk ):

           P(D̂k |q) = α · I(S k |q) · π(Dk ) for a set of queries q ∈ Q         (6)

      where P(D̂k |q) is the percent of observed individuals in the category Dk
      with respect to the query q. The errors in prediction between observed
      and predicted are given by the residuals P(D̂k |q) − α̂ · I(S k |q) · π̂(Dk ).


Retrieval phase In the retrieval phase we assume that the number of
tweets in the category k with respect to any query q is:

      |Dk | = α̂ · I(S k |q) · π̂(Dk )                                           (7)

and proportions are given by |D k|
                              |Cq |
                                    where |Cq | is the result set size of the query.
Notwithstanding that the ranking will contain both irrelevant and misclassi-
fied tweets for the k-th category that have a positive score I(s ∈ S k |q), the
predicted number of relevant tweets falling into the k-th category is very close
to the actual number.


4     Experimentation
4.1     Evaluation measures
The evaluation of category size and category proportions for sentiment ana-
lysis is not trivial. Forman suggests to use the Kullback-Leibler distance
between category distribution of CC-adjusted values and true values in the
test set to measure effectiveness of the prediction, irrespective of the popu-
lation size. This measure however though very simple does not comply with
evaluation of quantification in social science or in statistics where residuals
between predicted and observed values are measured. Also KL measures the
distance between two distribution irrespective to the test set size. More pre-
cisely, to validate classifier effectiveness for individual classification an average
value of different test sets smaller than their training sets is usually used, for
example through a k-fold cross validation, so that the accuracy (which is a
percentage of decision successes of the classifier) may not be statistically sig-
nificant with respect to the actual population size. For example if a query has
millions of relevant tweets and the evaluation is done on a few hundreds of


                                          44
them, then the true accuracy can falls into a very large confidence interval of
the observed accuracy at 95% or higher confidence level. As an example, the
third row of Table 2 shows a prediction of 45% positives within a population
of 251K tweets, but the manual inspection on a sample shows 41% which
falls into the confidence interval [37.1%, 52%] at the 95% confidence level. In
such a case, it is required more evaluation of tweets to reject or accept that
proportion prediction.
In addition, when quantification is not performed with a variant of a CC
approach, such as our method which is not based on individual classification,
then we cannot compute the accuracy of the classifier but may only study the
residuals between the predicted and the observed values.
Here, we use the R-squared (sum of the squares of the residuals) regression
analysis to assess the goodness of fit between observed and estimates of the
category sizes, the number of queries (minus 1) being the degrees of freedom.
Notice that, when individual classifier is used in quantification, the residual
of the aggregated statistics is

    Obs+ − P red+ = tp + f p − (tp + f n) = f p − f n = −(Obs− − P red− )

Therefore, the misclassification errors in quantification or in aggregated clas-
sification are less severe than in individual classification, because the missing
counting of the false negatives is partially balanced by the counting of the
false positives.
This observation does not complete evaluation issues. Relevance and senti-
ment polarity cannot be split in the evaluation. Therefore the true distribu-
tion of a test set can be only given for a sample of the population of a specific
user query. However, the sentiment polarity of a non-relevant retrieved doc-
ument always contribute to size prediction, and for difficult queries there are
many non-relevant documents. How the difficulty of a query impacts senti-
ment category size prediction must be studied, though we may conjecture
that the sentiment noise brought by irrelevant data maximizes the entropy of
the probabilities over the sentiment vocabulary in the result set. The entropy
maximization smooths the predicted sizes of a polarity skewed set of a large
result set towards the polarity means in the collection, that is predicted sizes
converge to milder values.
As final remark, in classification it is also assumed that the test set was
randomly built from the entire population. In general this is not the case, es-
pecially in order to remove imbalance between positive and negative category
sizes. On the contrary, in IR we easily have imbalance due to the sparsity of
the terms over the collection. We may be accurate on some queries less on
others so that the set of predictions xq , q ∈ Q, must be assessed by standard
statistical tests, that is studying the distribution of residuals to show how
good was the fit (R-squared) or to single out possible query outliers (Cook’s
distance or Normal Q-Q distributions etc.).
As a consequence of the above considerations, we have to distinguish propor-
tion estimates from population size estimates, because they can largely vary.
When the result set of a query is very small then proportions can be mean-
ingless and not significant, whilst for large populations with small evaluated
samples population size prediction can fall into a very large confidence level.


                                       45
                                                      Residuals vs Fitted                                                                           Scale−Location




                                                                                               Standardized residuals

                                                                                                                        0.0 0.5 1.0 1.5 2.0
                                                                                                                                                                              Sanremo




                                       20000
                                                                              Sanremo                                                                                       TheVoiceOfItaly
                                                                                                                                                            Ballaro




              Residuals

                                       0
                                       −30000
                                                                Ballaro     TheVoiceOfItaly


                                                0          40000          80000    120000                                                     0       40000           80000        120000

                                                             Fitted values                                                                                Fitted values




                                                           Normal Q−Q                                                                          Residuals vs Leverage
              Standardized residuals




                                                                                               Standardized residuals
                                       4




                                                                                                                        4
                                                                              Sanremo                                                                                         Sanremo




                                                                                                                        2
                                       2




                                                                                                                                                                                          1
                                                                                                                                                                                          0.5




                                                                                                                        0
                                       0




                                                                                                                                                                                          0.5
                                                                                                                                                                                          1




                                                                                                                        −2
                                       −2




                                                        Ballaro
                                                                                                                                                        Cook’sTheVoiceOfItaly
                                                                                                                                                    Ballaro     distance
                                                    TheVoiceOfItaly
                                                                                                                        −4
                                                −2         −1         0      1       2                                                        0.0   0.1     0.2       0.3      0.4

                                                       Theoretical Quantiles                                                                               Leverage



Figure 1. The linear regression coefficient estimate of positive category is statistically sig-
nificant (p-value <2e-16). Multiple R-squared is 0.9199, and F-statistic has a p-value: <
2.2e-16. Positive outliers are “Sanremo”, “Ballarò” and “The Voice of Italy”.




       4.2    Benchmark description
       TREC Twitter Collection Tweets2011 is the only publicly available very large
       collection to test retrieval performance of models, but there is not yet a test
       collection on Twitter to conduct sentiment analysis. Obviously, nothing ex-
       ists for the Italian language. However, Twitter’s policy for distribution re-
       quires the redistribution of only tweet IDs (or user IDs) not their content,
       and this restriction together with the limitation of a few hundreds for the
       maximum number of API requests by hour (by the GET statuses/show/:id or
       users/lookup :id methods) makes the actual distribution of a very large col-
       lection prohibitive. In order to test our methodology, especially for the Italian
       language, we have thus conducted a sentiment analysis campaign about the
       Italian TV broadcasting. From December 2012 up to April 2013, we collected
       about 3,2 million tweets related to 30 TV programs. More than 6300 tweets
       have been evaluated by a team of 5 assessors. For each TV program a random
       selection of tweets have been manually annotated in terms of:
       Relevance, that is:
         – highly relevant (R+ ), if the main topic of the tweet was about the TV
            program itself;


                                                                                              46
                                                      Residuals vs Fitted                                                                         Scale−Location




                                       20000




                                                                                            Standardized residuals

                                                                                                                     0.0 0.5 1.0 1.5 2.0
                                                                                                                                                           Ballaro
                                                                        ServizioPubblico
                                                                                                                                                               ServizioPubblico
                                                                                                                                                  CheTempoCheFa




              Residuals

                                       0
                                                          CheTempoCheFa



                                       −30000
                                                                    Ballaro


                                                0       50000             150000                                                           0     50000            150000

                                                               Fitted values                                                                         Fitted values




                                                              Normal Q−Q                                                                    Residuals vs Leverage
              Standardized residuals




                                                                                            Standardized residuals
                                                                        ServizioPubblico                                                                       ServizioPubblico




                                                                                                                     2
                                       2




                                                                                                                                                                                  1
                                                                                                                                                                                  0.5




                                                                                                                     0
                                       0




                                                                                                                                                                                  0.5
                                                                                                                                                                                  1




                                                                                                                     −2
                                       −2




                                                        CheTempoCheFa                                                                          CheTempoCheFa



                                                                                                                     −4
                                                                                                                                                     Cook’s distance
                                                                                                                                                    Ballaro
                                       −4




                                                    Ballaro


                                                −2            −1    0         1        2                                                   0.0    0.1    0.2    0.3     0.4

                                                       Theoretical Quantiles                                                                            Leverage



Figure 2. The linear regression coefficient estimate of negative category is statistically sig-
nificant (p-value <2e-16). Multiple R-squared is 0.9781, and F-statistic: has a p-value: <
2.2e-16. Negative outliers are “Servizio Pubblico”, “Ballarò” and “Che tempo che fa”.




        – relevant (R), if the main topic of the tweet was related to the TV program,
          such as a guest or a topic discussed during the TV program;
        – non-relevant (N R), otherwise.
       Opinion. Each relevant tweet then was evaluated as:
        - positive (O+ ), if containing a positive opinion;
        - negative (O− ), if containing a negative opinion;
        - mixed (M ix), if containing both positive and negative opinions;
        - neutral (N O), if not containing opinions;
        - other (N C), otherwise.
       The results on the annotation activity are summarized in Table 1.


       4.3    Results and discussion
       Although, the test phase is still preliminary, because of the small number
       of training queries and number of evaluated tweets for each query to learn
       dictionaries and test the classifiers, the 5-fold cross validation worked well (see
       Figure 3). To notice that the amount of overall positive information in the
       training set is much less than the negative one, and therefore a number of 30


                                                                                           47
                                                  Relevance                Sentiment
                                             R+ R N R T otal O+ O− M ix N O N C T otal
                                             787 5518 3910 10215 1358 2293 382 1959 313 6305


Table 1. Negativity prevails on TV tweet comments. We have imbalanced collection of data
to learn sentiment dictionaries.


                 Small symbols show cross−validation predicted values                            Small symbols show cross−validation predicted values
               150000




                            Fold 1                                                                          Fold 1




                                                                                               200000
                            Fold 2                                                                          Fold 2
                            Fold 3                                                                          Fold 3
                            Fold 4                                                                          Fold 4
                            Fold 5                                                                          Fold 5




                                                                                               150000
               100000




                                                                                   NEG_OSS_T
   POS_OSS_T




                                                                                               100000
               50000




                                                                                               50000
               0




                                                                                               0




                        0            50000    100000   150000   200000   250000                         0e+00        1e+05    2e+05      3e+05   4e+05

                                               POS_SCORE                                                                     NEG_SCORE



Figure 3. 5-Fold cross-validation for positive and negative data. The ANOVA table of the
data has the statistics bF = 3018 for negative and F = 322 for positive with 28 degrees of
freedom and p-value P r(> F ) = 2 · e−16 which provides a statistically significant predictive
model.




                    queries for both testing and training is extremely small to learn a satisfactory
                    vocabulary for positive polarity.
                    The quantile plots of Figures 1 and 2 (the Q-Q plots) show a normal distribu-
                    tion around a line with some outliers in the right upper corner. These outliers
                    are caused by difficult queries, that is the queries with many relevant docu-
                    ments. For such queries the cumulative sentiment score becomes a less precise
                    indicator for the prediction of the size of the category population, especially
                    when there is a large bias of the positive (negative) opinions from their mean.
                    Indeed, the outliers of the Figures 1 and 2 are those that receive the highest
                    number of tweets with respect to the rest of the training queries. Also they
                    have a relatively larger number of negative and positive tweets respectively
                    with respect to others (the TV talk show called “Servizio Pubblico” has a very
                    few percent of positive tweets, whilst the singer contest called “Sanremo” has
                    a percent number of positive tweets that is larger than the mean 27.5%). In
                    our opinion big national and international events (when the result set is very
                    large) may require a specific classifier and not necessarily a linear regression
                    classifier. Further investigation on how relevance (the number of irrelevant


                                                                                  48
    Figure 4. Comparison of the number of positive and negative tweets at run-time.




       tweets in the results set) affects performance of sentiment predictions is thus
       required.



       5    Conclusions

       We have shown that sentiment estimation can be conducted in real time. For
       an arbitrary query, plot on Figure 4 shows the projected number of positive
       and negative tweets at different instant of time. At the moment, the two clas-
       sifiers (positive and negative) are independent and do not interact. It can thus
       happen that sentiment quantification can produce inconsistent statistics when
       category proportions are compared.Relaxing the multiple binary dictionaries
       reduction assumption that handles separately positive and negative classifiers
       is thus required.


                    Ret  POS            POS            NEG            NEG
                        (pred)          (obs)         (pred)          (obs)
    Sanremo     825,596 23.2%* [29.3%, 38.4%] (33.7%) 33.4% [34.9%,44.3%] (39.5%)
 Servizio Pubb. 659,264 18.7%* [21.0%, 31.3%] (25.9%) 50.5%* [62.0%, 72.9%] (67.7%)
  VoiceOfItaly 251,296 45.7% [37.1%, 52.3%] (41.1%) 26.7%* [28.5%, 43.1%] (30.7%)
     Ballarò    344,683 19.9%* [8.4%, 15.3%] (11.4%) 43.1%* [44.1%, 54.8%] (49.4%)
CheTempoCheFa 134,455 27.3% [28.8%, 38.6%] (33.5%) 34.3% [29.5%, 39.4%] (36.6%)


Table 2. Big events require the evaluation by sentiment of a very large sample of relevant
tweets to reduce the confidence interval (within square brackets). Outliers (denoted by a *)
have indeed very large results sets.




                                             49
Acknowledgment
Work carried out under Research Agreement with Almawave.


References
 1. Ounis, I., De Rijke, M., Macdonald, C., Mishne, G., and
    Soboroff, I. Overview of the TREC-2006 Blog Track. In In Proceedings
    of the Text REtrieval Conference (TREC 2006) (2006), National Institute
    of Standards and Technology.
 2. Levy, P. S., and Kass, E. H. A three population model for sequential
    screening for bacteriuria. American Journal of Epidemiology 91 (1970),
    148–154.
 3. Forman, G. Counting positives accurately despite inaccurate classifica-
    tion. In ECML (2005), pp. 564–575.
 4. Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up?: sentiment
    classification using machine learning techniques. In EMNLP ’02: Proceed-
    ings of the ACL-02 conference on Empirical methods in natural language
    processing (Morristown, NJ, USA, 2002), Association for Computational
    Linguistics, pp. 79–86.
 5. Forman, G. Quantifying counts and costs via classification. Data Min.
    Knowl. Discov. 17, 2 (2008), 164–206.
 6. He, B., Macdonald, C., He, J., and Ounis, I. An effective statistical
    approach to blog post opinion retrieval. In Proceedings of the 17th ACM
    Conference on Information and Knowledge Management (New York, NY,
    USA, 2008), CIKM ’08, ACM, pp. 1063–1072.
 7. Huang, X., and Croft, W. B. A unified relevance model for opin-
    ion retrieval. In CIKM ’09: Proceeding of the 18th ACM conference on
    Information and knowledge management (New York, NY, USA, 2009),
    Acm, pp. 947–956.
 8. Amati, G., Amodeo, G., Capozio, V., Gaibisso, C., and Gambosi,
    G. On performance of topical opinion retrieval. In SIGIR (2010), F. Crest-
    ani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy,
    Eds., ACM, pp. 777–778.
 9. Hopkins, D., and King, G. A method of automated nonparametric
    content analysis for social science. American Journal of Political Science
    54, 1 (01/2010 2010), 229–247.
10. Esuli, A., and Sebastiani, F. SentiWordNet: A publicly available
    lexical resource for opinion mining. In Proceedings of LREC-06, the 5th
    Conference on Language Resources and Evaluation (2006).
11. Jijkoun, V., de Rijke, M., and Weerkamp, W. Generating fo-
    cused topic-specific sentiment lexicons. In Proceedings of the 48th Annual
    Meeting of the Association for Computational Linguistics (Stroudsburg,
    PA, USA, 2010), ACL ’10, Association for Computational Linguistics,
    pp. 585–594.
12. Amati, G., Ambrosi, E., Bianchi, M., Gaibisso, C., and Gam-
    bosi, G. Automatic construction of an opinion-term vocabulary for ad
    hoc retrieval. In ECIR (2008), C. Macdonald, I. Ounis, V. Plachouras,
    I. Ruthven, and R. W. White, Eds., vol. 4956 of Lecture Notes in Com-
    puter Science, Springer, pp. 89–100.


                                     50