=Paper= {{Paper |id=Vol-1314/paper-02 |storemode=property |title=A Scalable Approach to near Real-time Sentiment Analysis on Social Networks |pdfUrl=https://ceur-ws.org/Vol-1314/paper-02.pdf |volume=Vol-1314 |dblpUrl=https://dblp.org/rec/conf/aiia/AmatiA0CM14 }} ==A Scalable Approach to near Real-time Sentiment Analysis on Social Networks== https://ceur-ws.org/Vol-1314/paper-02.pdf
        A scalable approach to near real-time
        sentiment analysis on social networks

      G. Amati, S. Angelini, M. Bianchi, L. Costantini, G. Marcone

   Fondazione Ugo Bordoni, Viale del Policlinico 147, 00161 Roma, Italy
 {gba,sangelini,mbianchi,lcostantini,gmarcone}@fub.it


Abstract. This paper reports about results collected during the development of a
scalable Information Retrieval system for near real-time analytics on social net-
works. More precisely, we present the end-user functionalities provided by the
system, we introduce the main architectural components, and we report about
performances of our multi-threaded implementation. Since sentiment analysis
functionalities are based on techniques for estimating document category pro-
portions, we report about a comparative experimentation aimed to analyse the
effectiveness of such techniques.


1    Introduction
The development of platforms for near real-time analytics on social networks
poses very challenging research problems to the Artificial Intelligence and Infor-
mation Retrieval communities. In this context, sentiment analysis is a tricky task.
In fact, sentiment analysis for social networks can be defined as a search-and-
classify task, that is a pipeline of two processes: retrieval and classification [12],
[14]. The accuracy of a search-and-classify task thus suffers of the multiplicative
effects of independent errors produced by both the retrieval and the classification.
The search-and-classify task however is just an example of a most general prob-
lem of near real-time analytics. Near real-time analytics is actually based on five
main tasks: the retrieval of a preliminary set (the posting lists of the query terms),
the assignment of a retrieval score to these documents, the application of binary
filters (for example, by selecting documents by period of time and opinion polar-
ity), the mining of hidden entities, and, finally, the final sort to display statistical
outcomes and to decorate document pages of results.
All these functionalities must be finally thought and designed to handle big-data,
as that of Twitter, that generates unbounded streams of data. Moreover, near real-
time sentiment analysis for social networks includes end-user functionalities that
are typical of either data-warehouses or real-time big data analytics platforms.
For example, the topic of interest is often represented as a large query to be pro-
cessed in batch mode, and several search tools must support the query specifica-
tion phase. On the other hand, systems need to continuously index a huge flow of
data generated by multiple data-sources, to make new data available as soon as
possible, and to prompt reactive detection of incoming events of interest.
In this scenario we report the experience acquired in the development of a system
specialized on near-realtime analytics for the Twitter platform.
In Section 2 we describe our system. More precisely, we present end-users func-
tionalities allowing end-users to search, classify and estimate category propor-
tions for real-time analytics. The implementation of these functionalities relies
on some architectural components defined downline of the analysis of a typical
retrieval process performed by a search engine. As a consequence, we show how
all functionalities can be implemented according to a single retrieval process and
how to scale-up by a multithreaded parallelization, or scale-out by mean of dis-
tribution of processes on different computational nodes. We conclude the section
reporting the results of an experimentation aimed to assess the performance of
our multi-thread implementation. The assessment of the distributed version of
the system is still in progress. Even if the system is not yet optimized, the exper-
imentation validates the viability of our solution.
Among all implemented functionalities, in Section 3 we focus on the proportion
estimation of categories for sentiment analysis, since quantification for sentiment
analysis is particularly complex to be accomplished in near real-time analysis. It
is indeed an example of a complex task that requires many steps of Information
Retrieval and Machine Learning processing to be performed. Because of this, we
describe several techniques for category proportion estimation and we provide
their comparison. Section 4 concludes the paper.


2 A scalable system for near real-time sentiment
analysis
In order to identify requirements for a system enabling near real-time analysis of
phenomena occurring on social networks, we took into consideration two kinds
of end-users: social scientists and data scientists.
Broadly speaking, a social scientist is a user interested in finding answers to ques-
tions such as: what are the most relevant/recent tweets, how many tweets convey
a positive/negative opinion, what are concepts related to a given topic, how is the
trend of a given topic, what are the most important topics, and so on. In general,
social scientists interact with the system by submitting several queries formaliz-
ing their information needs, they empirically evaluate the quality of the answer
provided by the system. The role of social scientist can be played by any user
interested in studying or reporting phenomena of social networks that can be con-
nected to scientific discipline such as sociology, psychology, economics, political
science, and so on. On the contrast, a data scientist is interested in developing
and improving functionalities for social scientists. More precisely, data scientists
implement machine learning processes and they take under control the quality of
answers provided by the system by means of statistical analyses. Furthermore,
they take in charge of define and develop new functionalities for reporting, chart-
ing, summarizing, etc.
The following Section presents the end-user functionalities provided by the sys-
tem. They are the result of a user-requirement analysis activity, jointly conducted
by social scientists, data scientist and software engineers.


2.1   End-user functionalities for analytics and sentiment analysis
From the end-user perspective, a system for near real-time analytics and senti-
ment analysis should provide three main classes of functions: search, count and
mining functionalities.
Given a query, search functionalities consist in a suite of operations useful to find:
the most relevant tweets (topic retrieval); the most recent tweets in any interval
of time (topical timeline); a representative sample of tweets conveying opinions
about the topic (topical opinion retrieval); a representative sample of tweets con-
veying positive or negative opinions about the topic (polarity driven topical opin-
ion retrieval); any mixture of tweets resulting from the combination of relevance,
time and opinion search dimensions. Search functionalities are used by social
scientist in order to explore tweets indexed by the system, to detect emerging
topics, to discover new keywords or accounts to be tracked on Twitter; on other
hands, they are used by data scientists to empirically assess the effectiveness of
the system.
Count functionalities quantify the result-set size of a given query. As a conse-
quence, they are useful to quantify, for example, the number of positive positive
tweets related to a given topic. The system offers two main methods for count-
ing: the exact count, that is a database-like function returning the exact number
of tweets matching the query, and the estimated count, that statistically estimates
the number of tweets belonging to a given results-set. As described in Section 3.1
there are some different strategies to perform the estimation count: for sake of
exposition we anticipate that the two main approaches are classify-and-count and
category size estimation.
Finally, a suite of mining functionalities is available: trending topics, query-related
concept mining, geographic distribution of tweets, most representative users for
a topic, and so on.
Both count and mining functionalities are mainly used by social scientists for
their studying and reporting aims.
In the next Section we show how the above mentioned functionalities can be
implemented adopting an Information Retrieval approach.


2.2   A search engine based system architecture
Functionalities presented in the previous Section can be implemented by a system
based on a search engine, specifically extended for this purpose. In fact, classic
index structures have to be properly configured to host some additional informa-
tion about tweets. Among the others, an opinion score, a positive opinion score
and a negative opinion score, computed at indexing-time and stored in the index,
enable the implementation of sentiment analysis functionalities. These scores can
be computed by using a dictionary-based approach, as proposed in [1], or by
means of an automatic-classifier, such as SVM or Bayesian classifiers. As de-
scribed in Section 3, these scores can be used at querying-time for implementing
functionalities as exact and estimated counting.
Furthermore, due to the scalability system requirement, index data structures have
to support mechanisms for document or term partitioning [13]. In the first case,
documents are partitioned into several sub-collections and are separately indexed;
in the second case, all documents are indexed as a single collection, and then
some data structures (i.e. the lexicon and the posting lists) are partitioned. Even
if the term partitioning approach has some advantages in query processing (e.g.
making the routing of queries easier and thus resulting in a lower utilization of
resources [13]), it does not scale well: because of this we adopt a document par-
titioning approach.
Once the partitioning approach has been selected, it becomes crucial to define
a proper document partition strategy. We opt for partitioning tweets just on the
basis of their timestamps: this implies each index contains all tweets generated
during a certain period of time. In our case this strategy is more convenient than
others [2],[4],[9],[11],[15], since it is suitable in presence of an unbounded stream
of tweets delivered in chronological order; moreover, it enables the optimization
of the query process when a time-based constraint is specified for the query.
Finally, we have to decide if to implement a solution to scale-up or to scale-out
in terms of the number of indexed tweets. In the first case, a multi-index com-
posed by several shards can be created, updated and used on a single machine:
as a consequence, the time needed to resolve a query depends on the calculating
capacity and the main memory availability on the machine. In the second case,
each machine of a computer cluster has to be responsible for a sub-collection and
to act as an indexing and query server: with respect to the time needed to resolve
a query, this solution (referred as distributed index in the following) exploits the
calculating capacity of the entire computer cluster, but introduces some latency
due to network communications. Interestingly, in both of the cases, it is possi-
ble to define a common set of software components that allow to efficiently im-
plement functionalities presented in Section 2.1. These components, here briefly
described, can be implemented to develop an application based on either a multi-
index, or a distributed index:
  1. Global Statistics Manager (GSM). As soon as new incoming tweets are in-
     dexed, the GSM has to update some global statistics, such as the total number
     of tweets and tokens. Both for multi-index and distributed index solution, the
     update operation can be simply performed either at query-time, or when the
     collection changes.
  2. Global Lexicon Manager (GLM). The lexicon data structure contains the list
     and statistics of all terms in the collection. Both multi-indexes and distributed
     indexes require a manager providing information about terms with respect
     of the entire collection. The GLM can relying on a serialized data structure
     to be updated every time the collection changes (i.e. a global lexicon), or
     it can compute at query-time just global information needed to resolve the
     submitted query.
  3. Score Assigner (SA). Any document containing at least one query-term is
     candidate to be added in the final result-set. SA assigns a ranking score to
     each document to quantify a relevance degree with respect to the query. Us-
     ing information provided by GSM and GLM, the scores of document in-
     dexed in different shards, or by different query servers, are comparable be-
     cause computed using global statistics. It is worth noting that opinion scores,
     needed to sentiment analysis functionalities, are computed once for all at
     indexing-time, and that they have just to be read in the indexes. In fact, we
     assume that the classifier model for sentiment analysis does not change over
     time: as a consequence, any change to global statistics of the collections
     does not affect already computed sentiment scores, and thus their sentiment
     classifications.
  4. Global Sorter (S). Top-N results are sorted in descending order of score.
  5. Post Processing Retriever (PPR): a second pass retrieval can follow the re-
     trieval phase, such as query expansion, or a document score modifier can be
     applied, such as mixture of relevance, time and sentiment models.
  6. Post Processing Filter and Entity Miner (EM): some post-processing opera-
     tions can be performed in order to filter the final result set by time, country
     etc. or by sentiment category membership constraints. If the direct index, i.e.
     the posting list of the terms occurring in each document, or other additional
Table 1. Mapping examples of user functionalities over information retrieval processes.


             Functionalities                 GSM GLM SA S PPR EM D
             Query result set count
             Classify and count                      X                 X
             Category estimation               X     X                 X
             Ranking                           X     X     X X             X
             Trending topics                   X     X         X       X
             Query-related concept mining      X     X     X X X       X




         data structures are available, text mining operations can be also applied to
         the result set, for example: extraction of relevant and trendy concepts, or
         mentions, or entities related to the query.
      7. Decorator (D): once the result set is determined and ordered, some efficient
         DB-like operations can be eventually performed in order to make results
         ready for presentation to the final user (e.g. posting records are decorated
         with metadata such title, timestamp, author, text, etc.).
    Table 1 shows which components are involved in the implementation of some
    exemplifying end-user functionalities. To obtain an efficient implementation of
    these functionalities it is crucial to design and implement the listed components
    as more decoupled as possible. It is worth noting the Query result set count func-
    tionality does not depend on any listed component since it only needs local post-
    ings retrieval operations.


    2.3   Assessing the performance of a multi-index implementation
    We have developed a multi-index based implementation of the system adding
    new data structure to the Terrier framework1 . The current version takes advan-
    tage of the multi-threading paradigm to parallelize, as much as possible, reading
    operations from shards.
    In order to assess the efficiency of our solution, we use a collection containing
    more than 153M of tweets, written in English, concerning the FIFA 2014 World
    Cup (up to half July 2014), and football news (up to half September 2014). Since
    June 14 to September 14, a new shard has been daily created and added to the
    multi-index, independently from the number of tweets downloaded in the last 24
    hours. The final index contains 76 shards unbalanced in terms of number of con-
    tained tweets, as shown in Figure 1 (each shard contains an average of about 2M
    tweets). We have focused our assessment on the ranking functionality: more pre-
    cisely, we have used 2127 queries, retrieving an average of about 44,361 tweets
    each. Table 2 reports the processing time for each component involved in the
    functionality under testing. In general, observed performances fit our expectation:
    anyway, we identify a potential bottleneck in the decoration phase. The decorator
    component will have to be carefully developed in the new version of the system
    based on a distributed index.
1 http://www.terrier.org
Table 2. Processing time the processing time for each components involved in the functionality.
GSM is not reported since it has a negligible processing time. Times are milliseconds averaged
on queries. We have run the system on a machine having a quad-core i3 CPU clocked at 3.07 GHz
with 8GB RAM. Being the system written in the Java programming language, we have allocated
about 4GB to the Java Virtual Machine.

        # shards       # docs          GLM              SA              S               D
          25        56,304,653        3.76 ms        0.07 ms         7.23 ms      0.05 ms/doc
          50        99,912,639        6.86 ms        0.13 ms        9.84 ms       0.06 ms/doc
          76        153,137,302       8.20 ms        0.16 ms        10.87 ms      0.07 ms/doc




Fig. 1. Number of documents in daily shards used for assessing the performances of the multi-
index implementation.


       3 Comparing techniques for category proportion
       estimation
       On Twitter, time and sentiment polarity can be important as relevance is for rank-
       ing documents. Since sentiment polarity is a classification task, the IR system
       needs to perform both classification and search tasks in one single shot. In order
       to obtain a near-real time classification for large data streams, we need to make
       some computational approximations and to recover the approximation error by
       introducing a supplementary model able to correct the results, for example by re-
       sizing the proportions by estimates of such classification errors [5, 6]. Finally, we
       correct the number of misclassified items by a linear regression model previously
       learned on a set of training queries, as presented in [1], or using an adjusted clas-
       sify & count approach (Section 3.1). At the query time we just combine scores
       either to aggregate for estimates of retrieval category sizes or to select and sort
       documents by time, relevance and sentiment polarities.
       In this Section we report results of a experimental comparison we conducted on
       different techniques for category proportion estimation.


       3.1     Category proportion estimation
       Let D = {D1 , . . . , Dn } be a set of mutually exclusive sentiment categories over the
       set of tweets Ω , and let q be a topic (story). The problem of size or proportion es-
       timation of sentiment categories for a story consists in specifying the distribution
       of the categories P(Di |q) over the result set of the story q.
Such an estimation is similar to that conducted within a typical statistical problem
of social sciences, macroeconomics or epidemiological studies. In general if an
unbiased sample of the population can be selected, then it can be used to estimate
the population categories together with a statistical error due to the size of the
sample. For example, Levy & Kass [10] use the Theorem of Total Probability on
the observed event A to decompose this event over a set of predefined categories.
In Information Retrieval, the observed event A can be, for example, the set of the
posting lists of a story. We also assume that P(A|Di ) is obtained by a sample A0 of
A, that is by P(A0 |Di ). The problem of estimating the category proportions P(Di )
is determining these probabilities on a sample of observations A0 ⊂ A:
                                      n
                            P(A0 ) = ∑ P(A0 |Di )P(Di ).
                                     i=1

If we monitor the event A0 as aggregated outcome of all observable items in the
sample, then we may easily rewrite the Theorem of Total Probability in matrix
form as a set of linear equations:

                             P(A0 ) = P(A0 |D) · P(D).
                                           1×|D|    |D|×1

We simply derive the category proportions P(Di ) by resolving a system of |D|
linear equations into |D| variables. From now on we denote all probabilities by
P( |q) to recall the dependence of observables to the result set of the current query
q.
When the assignment of documents of A, or more generally of observables for
A, to categories is not performed manually, but automatically, then it is not only
the size of the selected sample A that matters, but also both type I and II errors
(false positives and false negatives) produced by misclassification that becomes
equally significant. In other words, the accuracy of the classifier need also to be
known for a correct estimation of all P(Di |q). If the two types of errors comes
out to be similar in size, then the final counting outcomes for category proportions
may produce a correct answer. More generally, if the observations is given by a
set X of observable variables for the document sample A, then the observables,
and their proportions P(X|D), may be used as a set of training data for a linear
classifier to derive P(D|q):

                           P(X|q) = P(X|D, q) · P(D|q).
                            |X|×1         |X|×|D|    |D|×1

These equations can be thus resolved, for example, by linear regression. The set
of observable variables X can be defined according several approaches.
  – The classify and count methodology: X is the set of predicted categories D̂ j
     of a classifier D̂. Misclassification errors are given by the conditional prob-
     abilities P(D̂k |D j ) when k 6= j. Counting the errors of the classifier in the
     training data set, and using these measures to correct the category propor-
     tions, is at the basis of the adjusted classify and count approach [10, 5, 6].
  – The profile sampling approach: X is a random subset of word profiles S j ,where
     a profile is a subset of words occurring in the collection. This approach is at
     the basis of Hopkins & King’s method [7].
  – The cumulative approach: X is a set of weighted features f j of a trained
     classifier (a weighted sentiment dictionary) [1]. The classifier model then can
     be used to score each document in the collection. Differently from Hopkins
     & King’s method, that counts occurrences of an unbiased covering set of
     profiles for a topic, the classifier approach correlates a cumulative category
     score with category proportions for a topic.


Adjusted-Classify and Count The observations A are obtained by a classi-
fier D̂ for the categories D
                                  n
                    P(D̂ j |q) = ∑ P(D̂ j |Di , q)P(Di |q) j=1,. . . , n.
                                 i=1

We pool the queries results, that is P(D̂0j |D0i , q) = P(D̂0j |D0i ) on a training data set
D0 and a set of queries. The estimates derive from this pooling set, (i.e.P(Di ) =
P(D0i )) solving a simple linear system of |D| equations with |D| variables:

                               P(A|q) = P(A0 |D) · P(D|q).
                               |D|×1       |D|×|D|    |D|×1

The methodology is automatic and supervised, and therefore does not need to
start over at each query. The accuracy of the classifier does not matter, since the
misclassification errors are used for the estimation of category sizes. On the other
hand, being not based on a query-by-query learning model, it does not achieve as
high precision as with the manual evaluation of Hopkins & King’s method.


Hopkins & King’s method Let S0 be a sample of profiles of words of the
vocabulary V, that is S0 ⊂ S = 2V , able to cover well enough the space of events,
and let A be the set of relevant documents for a topic q. Let us assess the sentiment
polarities of a sample A0 of A. About 500 evaluated documents will suffice for a
statistically significant test. The partition of A0 over the categories D will yield
the statistics for the occurrences of S0 in each category, and these proportions
are used to estimate P(A|D, q). P(A) instead will be estimated by P(S0 ), that is
the total number of occurrences of the word profiles of S0 in the sample A0 with
respect to all word profiles occurring in A0 .
The category proportions P(D|q) are estimated as the coefficients of the linear
regression model
                             P(A|q) = P(A|D, q) · P(D|q).
                               |A|×1       |A|×|D|     |D|×1

This is not a supervised methodology, as it would be with an automated classi-
fier. It is based on counting word profiles from a covering sample. The advantage
is a statistically significantly high accuracy (almost 99%, see Table 3). However,
there are many drawbacks.The methodology needs to start over at each query, and
to achieve such a high accuracy, a long and costly activity of human evaluation
of documents is required. The word profile counting is anyway complex since
profiles are arbitrary subsets of a very large dictionary, and data are very sparse
in Information Retrieval. Moreover, the query-by-query linear regression learn-
ing model is also time consuming. In conclusion, this method is not based on a
supervised learning model, but it is essentially driven by a manual process, and
linear regression and word profiles counting are just used to smooth the maximum
likelihood category estimators.
Cumulative approach The cumulative approach is a supervised learning tech-
nique that consists in the use of a linear regression to predict and smooth a sen-
timent category size on the basis of a cumulative score of documents [1]. The
approach is information theoretic: for each category, the set F of features for a
category are made up of the most informative terms, or equivalently, the highest
coding code in that category. Differently from Levy & Kass-Forman’s misclas-
sification recovery model, there is not a pipeline of computational processes to
perform, namely classifying, then counting, and finally adjusting the category
sizes with the number of estimated misclassified items. The technique of the cu-
mulative approach simply correlate the category size with the total number of
bits used to code the occurring category features. Since information is additive,
the linear regression model is the natural choice that sets up such a correlation
over a set of features spanning over a set of training queries. Similarly to the ad-
justed classify and count approach the precision of this methodology is high and
is reported on Section 3.2.


3.2   Experimentation
To assess the effectiveness of the classifier-based quantification, we have build
an annotated corpus composed by 6305 tweets manually classified on the basis
of the contained opinion. More precisely: 1358 tweets was classified as positive
(i.e. containing a positive opinion), 2293 as negative (i.e. containing a negative
opinion), 382 as mixed (i.e. containing both positive and negative opinions); 1959
as neutral (i.e. not containing opinions), 313 as not classifiable.
We have run two sets of experiments. We have first statistical technique to smooth
the proportions from a manual document sample assessment. This experiment is
essentially manual because requires a training set for each query. For each query
instead of the word profiles as used in the proposed by Hopkins & King we have
used two standard classifiers (Multinomial Naive Bayes, MNB, and SVM with a
linear kernel), and the adjusted classify & count (ACC) as maximum likelihood
estimate smoothing technique. However, Hopkins & King’s results are hardly re-
producible since the set of admissible profiles are generated by a complex feature
selection, and also a portion of negative examples are removed from the train-
ing set of the query. Indeed, these profiles are generated by an adaptation of the
technique by King and Lu [8], that randomly chooses subsets of between approxi-
mately 5 and 25 words as admissible profiles. This number of words is determined
empirically through cross-validation within the labeled set. Therefore, we show
our results in comparison to their method on Table 3 as only reported in their
paper [7].
Table 4 shows that the supervised methods with the adjusted classify & count
(ACC) technique achieves a very high precision (96.63%-97.86%), i.e. a Mean
Absolute Proportion error similar to that of Hopkins & King, with a supervised
learning process that is not tailored on a single query only, but trained over a set
of about 30 queries and with a 5-fold cross validation. The difference of Mean
Absolute Proportion error for 30 queries produced by a search like classification
process with respect to Hopkins & King method with a single query, is minimal
and not statistically significant.
This first outcomes on Table 4 show that standard supervised classification meth-
ods can be effectively applied, and fast implemented, for quantification of senti-
ment analysis of new queries. The second experiment on Table 5 indeed shows
Table 3. Performance of Hopkins & King Approach (HKA) [7] and Support Vector Machine with
linear and polynomial kernels.

                         Percent of Blog Posts Correctly Classified
                 In-Sample      In-Sample         Out-of-Sample           Mean Absolute
                     Fit    2-Cross-Validation 2-Cross-Validation         Proportion Error
       HKA           -                 -                    -                    1.2
      Linear        67.6              55.2                 49.3                  7.7
    Polynomial      99.7              48.9                 47.8                  5.3



Table 4. Performance of Classify & Count and Adjusted Classify & Count (ACC) for MNB and
SVM Classifiers. Each query has its training data set.

                             Percent of Tweets Correctly Classified
                                             Out-of-Data-Sample         Mean Absolute
                           # queries
                                               Cross-Validation         Proportion Error
        SVM                   30                      78.82                   2.01
        MNB                   30                      81.12                   4.25
      ACC-SVM                 30                      78.82                   3.37
      ACC-MNB                 30                      81.12                   2.14
        HKA                   1                         -                      1.2




       that the ACC smoothing with the use of classifiers is a fully automated supervised
       method that performs highly with new queries as the manual classification of
       HKA on a single query. The classifiers were trained using a set of about 30 queries
       and 6-fold cross validation, where each test set has new documents coming from
       the result sets of the new queries (Out-Query-Sample Cross Validation). We also
       report the sample fit for each fold (In-Query-Sample Fit Cross-Validation) that
       shows that an almost perfect category counting with the SVM classifier.
       Notice that, the Classify & Count process (CC) is mush less prone to error than
       the individual classification accuracy, because of possible error type balancing
       effect (see Table 5). However, there is not a correlation between individual classi-
       fication accuracy and Mean Absolute Error Rate of the CC process, so that the CC
       approach cannot ever be considered reliable estimation or statistically significant.
       Finally, the cumulative approach achieves high effectiveness (Multiple R-squared
       is 0.9781 for the negative category with 5-fold cross validation on the same set of
       queries) [1].


       4    Conclusion
       This paper reported some experiences gained during the development of a scal-
       able system for real-time analytics on social networks.
       We have presented how some architectural components resulting from the anal-
       ysis of a typical querying process that can be used to implement several func-
Table 5. Performance of Classify & Count and Adjusted Classify & Count (ACC) for MNB and
SVM Classifiers. Data set is made up of 30 queries, divided in test set and training set of queries.

                            Percent of Tweets Correctly Classified
                In-Queries-Sample                        Out-of-Queries-        Mean Absolute
                                     Mean Absolute
                       Fit                                   Sample              Proportion
                                     Proportion Error
                 Cross-Validation                       Cross-Validation           Error
     SVM               99.85                 0.05                 74.76               5.57
     MNB               94.26                 3.00                 78.46               3.95
   ACC-SVM             99.85                 0.03                 74.76               6.23
   ACC-MNB             94.26                 1.99                 78.46               8.91




        tionalities of the system. These components can be adopted both for developing a
        multi-index and a distributed index implementation of the system. We also iden-
        tified a potential bottleneck in the decoration phase: the related component has to
        be carefully developed in the distributed version of the system.
        Furthermore, we have shown how to estimate real-time document category pro-
        portions for topical opinion retrieval for big data. Outcomes are produced either
        by a direct count or by estimation of category sizes based on a supervised auto-
        mated classification with a smoothing technique to recover the number of mis-
        classified documents. The use of MNB and SVM classifiers or information-based
        dictionaries to estimate category proportions are highly effective and achieves
        almost perfect accuracy if a training phase on the query is also performed.
        Search, classify and quantification for analytics can be thus effectively conducted
        in real-time.

        Acknowledgement: Work carried out under the Research Agreement between
        Almawave and Fondazione Ugo Bordoni.


        References
         [1] Giambattista Amati, Marco Bianchi, and Giuseppe Marcone. Sentiment
             estimation on twitter. In IIR, pages 39–50, 2014.
         [2] C. S. Badue, R. Baeza-Yates, B. Ribeiro-Neto, A. Ziviani, and N. Ziviani.
             Analyzing imbalance among homogeneous index servers in a web search
             system. Inf. Process. Manage., 43(3):592–608, 2007.
         [3] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. Modern Information
             Retrieval, volume 463. ACM press New York, 1999.
         [4] Jamie Callan. Distributed Information Retrieval. In In: Advances in Infor-
             mation Retrieval, pages 127–150. Kluwer Academic Publishers, 2000.
         [5] George Forman. Counting positives accurately despite inaccurate classifi-
             cation. In João Gama, Rui Camacho, Pavel Brazdil, Alı́pio Jorge, and Luı́s
             Torgo, editors, ECML, volume 3720 of Lecture Notes in Computer Science,
             pages 564–575. Springer, 2005.
         [6] George Forman. Quantifying counts and costs via classification. Data Min.
             Knowl. Discov., 17(2):164–206, 2008.
 [7] Daniel Hopkins and Gary King. A method of automated nonparametric
     content analysis for social science. American Journal of Political Science,
     54(1):229–247, 01/2010 2010.
 [8] Gary King, Ying Lu, and Kenji Shibuya. Designing verbal autopsy studies.
     Population Health Metrics, 8(1), 2010.
 [9] Leah S. Larkey, Margaret E. Connell, and Jamie Callan. Collection Selec-
     tion and Results Merging with Topically Organized U.S. Patents and TREC
     Data. In CIKM 2000, pages 282–289. ACM Press, 2000.
[10] P S Levy and E H Kass. A three-population model for sequential screening
     for bacteriuria. American J. of Epidemiology, 91(2):148–54, 1970.
[11] Xiaoyong Liu and Bruce W. Croft. Cluster-based retrieval using language
     models. In SIGIR ’04: Proceedings of the 27th annual international ACM
     SIGIR conference on research and development in information retrieval,
     pages 186–193, New York, NY, USA, 2004. ACM Press.
[12] Craig Macdonald, Iadh Ounis, and Ian Soboroff. Overview of the TREC
     2007 blog track. In Ellen M. Voorhees and Lori P. Buckland, editors, TREC,
     volume Special Publication 500-274. National Institute of Standards and
     Technology (NIST), 2007.
[13] Alistair Moffat, William Webber, Justin Zobel, and Ricardo B. Yates. A
     pipelined architecture for distributed text query evaluation. Inf. Retr.,
     10(3):205–231, June 2007.
[14] Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne, and Ian
     Soboroff. Overview of the trec-2006 blog track. In Text Retrieval Confer-
     ence, 2006.
[15] Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. Query-driven
     document partitioning and collection selection. In InfoScale ’06: Proceed-
     ings of the 1st international conference on Scalable information systems.
     ACM Press, 2006.