<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A scalable approach to near real-time sentiment analysis on social networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>G. Amati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S. Angelini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Bianchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Costantini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>G. Marcone</string-name>
          <email>gmarconeg@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Ugo Bordoni</institution>
          ,
          <addr-line>Viale del Policlinico 147, 00161 Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports about results collected during the development of a scalable Information Retrieval system for near real-time analytics on social networks. More precisely, we present the end-user functionalities provided by the system, we introduce the main architectural components, and we report about performances of our multi-threaded implementation. Since sentiment analysis functionalities are based on techniques for estimating document category proportions, we report about a comparative experimentation aimed to analyse the effectiveness of such techniques.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The development of platforms for near real-time analytics on social networks
poses very challenging research problems to the Artificial Intelligence and
Information Retrieval communities. In this context, sentiment analysis is a tricky task.
In fact, sentiment analysis for social networks can be defined as a
search-andclassify task, that is a pipeline of two processes: retrieval and classification [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The accuracy of a search-and-classify task thus suffers of the multiplicative
effects of independent errors produced by both the retrieval and the classification.
The search-and-classify task however is just an example of a most general
problem of near real-time analytics. Near real-time analytics is actually based on five
main tasks: the retrieval of a preliminary set (the posting lists of the query terms),
the assignment of a retrieval score to these documents, the application of binary
filters (for example, by selecting documents by period of time and opinion
polarity), the mining of hidden entities, and, finally, the final sort to display statistical
outcomes and to decorate document pages of results.
      </p>
      <p>All these functionalities must be finally thought and designed to handle big-data,
as that of Twitter, that generates unbounded streams of data. Moreover, near
realtime sentiment analysis for social networks includes end-user functionalities that
are typical of either data-warehouses or real-time big data analytics platforms.
For example, the topic of interest is often represented as a large query to be
processed in batch mode, and several search tools must support the query
specification phase. On the other hand, systems need to continuously index a huge flow of
data generated by multiple data-sources, to make new data available as soon as
possible, and to prompt reactive detection of incoming events of interest.
In this scenario we report the experience acquired in the development of a system
specialized on near-realtime analytics for the Twitter platform.</p>
      <p>In Section 2 we describe our system. More precisely, we present end-users
functionalities allowing end-users to search, classify and estimate category
proportions for real-time analytics. The implementation of these functionalities relies
on some architectural components defined downline of the analysis of a typical
retrieval process performed by a search engine. As a consequence, we show how
all functionalities can be implemented according to a single retrieval process and
how to scale-up by a multithreaded parallelization, or scale-out by mean of
distribution of processes on different computational nodes. We conclude the section
reporting the results of an experimentation aimed to assess the performance of
our multi-thread implementation. The assessment of the distributed version of
the system is still in progress. Even if the system is not yet optimized, the
experimentation validates the viability of our solution.</p>
      <p>Among all implemented functionalities, in Section 3 we focus on the proportion
estimation of categories for sentiment analysis, since quantification for sentiment
analysis is particularly complex to be accomplished in near real-time analysis. It
is indeed an example of a complex task that requires many steps of Information
Retrieval and Machine Learning processing to be performed. Because of this, we
describe several techniques for category proportion estimation and we provide
their comparison. Section 4 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2 A scalable system for near real-time sentiment analysis</title>
      <p>In order to identify requirements for a system enabling near real-time analysis of
phenomena occurring on social networks, we took into consideration two kinds
of end-users: social scientists and data scientists.</p>
      <p>Broadly speaking, a social scientist is a user interested in finding answers to
questions such as: what are the most relevant/recent tweets, how many tweets convey
a positive/negative opinion, what are concepts related to a given topic, how is the
trend of a given topic, what are the most important topics, and so on. In general,
social scientists interact with the system by submitting several queries
formalizing their information needs, they empirically evaluate the quality of the answer
provided by the system. The role of social scientist can be played by any user
interested in studying or reporting phenomena of social networks that can be
connected to scientific discipline such as sociology, psychology, economics, political
science, and so on. On the contrast, a data scientist is interested in developing
and improving functionalities for social scientists. More precisely, data scientists
implement machine learning processes and they take under control the quality of
answers provided by the system by means of statistical analyses. Furthermore,
they take in charge of define and develop new functionalities for reporting,
charting, summarizing, etc.</p>
      <p>The following Section presents the end-user functionalities provided by the
system. They are the result of a user-requirement analysis activity, jointly conducted
by social scientists, data scientist and software engineers.
2.1</p>
      <sec id="sec-2-1">
        <title>End-user functionalities for analytics and sentiment analysis</title>
        <p>From the end-user perspective, a system for near real-time analytics and
sentiment analysis should provide three main classes of functions: search, count and
mining functionalities.</p>
        <p>Given a query, search functionalities consist in a suite of operations useful to find:
the most relevant tweets (topic retrieval); the most recent tweets in any interval
of time (topical timeline); a representative sample of tweets conveying opinions
about the topic (topical opinion retrieval); a representative sample of tweets
conveying positive or negative opinions about the topic (polarity driven topical
opinion retrieval); any mixture of tweets resulting from the combination of relevance,
time and opinion search dimensions. Search functionalities are used by social
scientist in order to explore tweets indexed by the system, to detect emerging
topics, to discover new keywords or accounts to be tracked on Twitter; on other
hands, they are used by data scientists to empirically assess the effectiveness of
the system.</p>
        <p>Count functionalities quantify the result-set size of a given query. As a
consequence, they are useful to quantify, for example, the number of positive positive
tweets related to a given topic. The system offers two main methods for
counting: the exact count, that is a database-like function returning the exact number
of tweets matching the query, and the estimated count, that statistically estimates
the number of tweets belonging to a given results-set. As described in Section 3.1
there are some different strategies to perform the estimation count: for sake of
exposition we anticipate that the two main approaches are classify-and-count and
category size estimation.</p>
        <p>Finally, a suite of mining functionalities is available: trending topics, query-related
concept mining, geographic distribution of tweets, most representative users for
a topic, and so on.</p>
        <p>Both count and mining functionalities are mainly used by social scientists for
their studying and reporting aims.</p>
        <p>In the next Section we show how the above mentioned functionalities can be
implemented adopting an Information Retrieval approach.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>A search engine based system architecture</title>
        <p>
          Functionalities presented in the previous Section can be implemented by a system
based on a search engine, specifically extended for this purpose. In fact, classic
index structures have to be properly configured to host some additional
information about tweets. Among the others, an opinion score, a positive opinion score
and a negative opinion score, computed at indexing-time and stored in the index,
enable the implementation of sentiment analysis functionalities. These scores can
be computed by using a dictionary-based approach, as proposed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], or by
means of an automatic-classifier, such as SVM or Bayesian classifiers. As
described in Section 3, these scores can be used at querying-time for implementing
functionalities as exact and estimated counting.
        </p>
        <p>
          Furthermore, due to the scalability system requirement, index data structures have
to support mechanisms for document or term partitioning [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In the first case,
documents are partitioned into several sub-collections and are separately indexed;
in the second case, all documents are indexed as a single collection, and then
some data structures (i.e. the lexicon and the posting lists) are partitioned. Even
if the term partitioning approach has some advantages in query processing (e.g.
making the routing of queries easier and thus resulting in a lower utilization of
resources [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]), it does not scale well: because of this we adopt a document
partitioning approach.
        </p>
        <p>
          Once the partitioning approach has been selected, it becomes crucial to define
a proper document partition strategy. We opt for partitioning tweets just on the
basis of their timestamps: this implies each index contains all tweets generated
during a certain period of time. In our case this strategy is more convenient than
others [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ],[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], since it is suitable in presence of an unbounded stream
of tweets delivered in chronological order; moreover, it enables the optimization
of the query process when a time-based constraint is specified for the query.
Finally, we have to decide if to implement a solution to scale-up or to scale-out
in terms of the number of indexed tweets. In the first case, a multi-index
composed by several shards can be created, updated and used on a single machine:
as a consequence, the time needed to resolve a query depends on the calculating
capacity and the main memory availability on the machine. In the second case,
each machine of a computer cluster has to be responsible for a sub-collection and
to act as an indexing and query server: with respect to the time needed to resolve
a query, this solution (referred as distributed index in the following) exploits the
calculating capacity of the entire computer cluster, but introduces some latency
due to network communications. Interestingly, in both of the cases, it is
possible to define a common set of software components that allow to efficiently
implement functionalities presented in Section 2.1. These components, here briefly
described, can be implemented to develop an application based on either a
multiindex, or a distributed index:
1. Global Statistics Manager (GSM). As soon as new incoming tweets are
indexed, the GSM has to update some global statistics, such as the total number
of tweets and tokens. Both for multi-index and distributed index solution, the
update operation can be simply performed either at query-time, or when the
collection changes.
2. Global Lexicon Manager (GLM). The lexicon data structure contains the list
and statistics of all terms in the collection. Both multi-indexes and distributed
indexes require a manager providing information about terms with respect
of the entire collection. The GLM can relying on a serialized data structure
to be updated every time the collection changes (i.e. a global lexicon), or
it can compute at query-time just global information needed to resolve the
submitted query.
3. Score Assigner (SA). Any document containing at least one query-term is
candidate to be added in the final result-set. SA assigns a ranking score to
each document to quantify a relevance degree with respect to the query.
Using information provided by GSM and GLM, the scores of document
indexed in different shards, or by different query servers, are comparable
because computed using global statistics. It is worth noting that opinion scores,
needed to sentiment analysis functionalities, are computed once for all at
indexing-time, and that they have just to be read in the indexes. In fact, we
assume that the classifier model for sentiment analysis does not change over
time: as a consequence, any change to global statistics of the collections
does not affect already computed sentiment scores, and thus their sentiment
classifications.
4. Global Sorter (S). Top-N results are sorted in descending order of score.
5. Post Processing Retriever (PPR): a second pass retrieval can follow the
retrieval phase, such as query expansion, or a document score modifier can be
applied, such as mixture of relevance, time and sentiment models.
6. Post Processing Filter and Entity Miner (EM): some post-processing
operations can be performed in order to filter the final result set by time, country
etc. or by sentiment category membership constraints. If the direct index, i.e.
the posting list of the terms occurring in each document, or other additional
data structures are available, text mining operations can be also applied to
the result set, for example: extraction of relevant and trendy concepts, or
mentions, or entities related to the query.
7. Decorator (D): once the result set is determined and ordered, some efficient
DB-like operations can be eventually performed in order to make results
ready for presentation to the final user (e.g. posting records are decorated
with metadata such title, timestamp, author, text, etc.).
        </p>
        <p>Table 1 shows which components are involved in the implementation of some
exemplifying end-user functionalities. To obtain an efficient implementation of
these functionalities it is crucial to design and implement the listed components
as more decoupled as possible. It is worth noting the Query result set count
functionality does not depend on any listed component since it only needs local
postings retrieval operations.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Assessing the performance of a multi-index implementation</title>
        <p>We have developed a multi-index based implementation of the system adding
new data structure to the Terrier framework1. The current version takes
advantage of the multi-threading paradigm to parallelize, as much as possible, reading
operations from shards.</p>
        <p>In order to assess the efficiency of our solution, we use a collection containing
more than 153M of tweets, written in English, concerning the FIFA 2014 World
Cup (up to half July 2014), and football news (up to half September 2014). Since
June 14 to September 14, a new shard has been daily created and added to the
multi-index, independently from the number of tweets downloaded in the last 24
hours. The final index contains 76 shards unbalanced in terms of number of
contained tweets, as shown in Figure 1 (each shard contains an average of about 2M
tweets). We have focused our assessment on the ranking functionality: more
precisely, we have used 2127 queries, retrieving an average of about 44,361 tweets
each. Table 2 reports the processing time for each component involved in the
functionality under testing. In general, observed performances fit our expectation:
anyway, we identify a potential bottleneck in the decoration phase. The decorator
component will have to be carefully developed in the new version of the system
based on a distributed index.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Comparing techniques for category proportion estimation</title>
      <p>
        On Twitter, time and sentiment polarity can be important as relevance is for
ranking documents. Since sentiment polarity is a classification task, the IR system
needs to perform both classification and search tasks in one single shot. In order
to obtain a near-real time classification for large data streams, we need to make
some computational approximations and to recover the approximation error by
introducing a supplementary model able to correct the results, for example by
resizing the proportions by estimates of such classification errors [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Finally, we
correct the number of misclassified items by a linear regression model previously
learned on a set of training queries, as presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], or using an adjusted
classify &amp; count approach (Section 3.1). At the query time we just combine scores
either to aggregate for estimates of retrieval category sizes or to select and sort
documents by time, relevance and sentiment polarities.
      </p>
      <p>In this Section we report results of a experimental comparison we conducted on
different techniques for category proportion estimation.
3.1</p>
      <sec id="sec-3-1">
        <title>Category proportion estimation</title>
        <p>Let D = fD1; : : : ; Dng be a set of mutually exclusive sentiment categories over the
set of tweets W , and let q be a topic (story). The problem of size or proportion
estimation of sentiment categories for a story consists in specifying the distribution
of the categories P(Dijq) over the result set of the story q.</p>
        <p>
          Such an estimation is similar to that conducted within a typical statistical problem
of social sciences, macroeconomics or epidemiological studies. In general if an
unbiased sample of the population can be selected, then it can be used to estimate
the population categories together with a statistical error due to the size of the
sample. For example, Levy &amp; Kass [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] use the Theorem of Total Probability on
the observed event A to decompose this event over a set of predefined categories.
In Information Retrieval, the observed event A can be, for example, the set of the
posting lists of a story. We also assume that P(AjDi) is obtained by a sample A0 of
A, that is by P(A0jDi). The problem of estimating the category proportions P(Di)
is determining these probabilities on a sample of observations A0 A:
n
P(A0) = å P(A0jDi)P(Di):
        </p>
        <p>i=1
P(A0) = P(A0jD) P(D):</p>
        <p>1 jDj jDj 1
If we monitor the event A0 as aggregated outcome of all observable items in the
sample, then we may easily rewrite the Theorem of Total Probability in matrix
form as a set of linear equations:
We simply derive the category proportions P(Di) by resolving a system of jDj
linear equations into jDj variables. From now on we denote all probabilities by
P( jq) to recall the dependence of observables to the result set of the current query
q.</p>
        <p>When the assignment of documents of A, or more generally of observables for
A, to categories is not performed manually, but automatically, then it is not only
the size of the selected sample A that matters, but also both type I and II errors
(false positives and false negatives) produced by misclassification that becomes
equally significant. In other words, the accuracy of the classifier need also to be
known for a correct estimation of all P(Dijq). If the two types of errors comes
out to be similar in size, then the final counting outcomes for category proportions
may produce a correct answer. More generally, if the observations is given by a
set X of observable variables for the document sample A, then the observables,
and their proportions P(X jD), may be used as a set of training data for a linear
classifier to derive P(Djq):</p>
        <p>
          P(X jq) = P(X jD; q) P(Djq):
jXj 1 jXj jDj jDj 1
These equations can be thus resolved, for example, by linear regression. The set
of observable variables X can be defined according several approaches.
– The classify and count methodology: X is the set of predicted categories Dˆ j
of a classifier Dˆ . Misclassification errors are given by the conditional
probabilities P(Dˆ kjD j) when k 6= j. Counting the errors of the classifier in the
training data set, and using these measures to correct the category
proportions, is at the basis of the adjusted classify and count approach [
          <xref ref-type="bibr" rid="ref10 ref5 ref6">10, 5, 6</xref>
          ].
– The profile sampling approach: X is a random subset of word profiles S j,where
a profile is a subset of words occurring in the collection. This approach is at
the basis of Hopkins &amp; King’s method [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
– The cumulative approach: X is a set of weighted features f j of a trained
classifier (a weighted sentiment dictionary) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The classifier model then can
be used to score each document in the collection. Differently from Hopkins
&amp; King’s method, that counts occurrences of an unbiased covering set of
profiles for a topic, the classifier approach correlates a cumulative category
score with category proportions for a topic.
        </p>
        <p>Adjusted-Classify and Count The observations A are obtained by a
classifier Dˆ for the categories D</p>
        <p>n
P(Dˆ jjq) = å P(Dˆ jjDi; q)P(Dijq) j=1,. . . , n:</p>
        <p>i=1
We pool the queries results, that is P(Dˆ 0jjD0i; q) = P(Dˆ 0jjD0i) on a training data set
D0 and a set of queries. The estimates derive from this pooling set, (i.e.P(Di) =
P(D0i)) solving a simple linear system of jDj equations with jDj variables:
P(Ajq) = P(A0jD) P(Djq):
jDj 1 jDj jDj jDj 1
The methodology is automatic and supervised, and therefore does not need to
start over at each query. The accuracy of the classifier does not matter, since the
misclassification errors are used for the estimation of category sizes. On the other
hand, being not based on a query-by-query learning model, it does not achieve as
high precision as with the manual evaluation of Hopkins &amp; King’s method.
Hopkins &amp; King’s method Let S0 be a sample of profiles of words of the
vocabulary V, that is S0 S = 2V, able to cover well enough the space of events,
and let A be the set of relevant documents for a topic q. Let us assess the sentiment
polarities of a sample A0 of A. About 500 evaluated documents will suffice for a
statistically significant test. The partition of A0 over the categories D will yield
the statistics for the occurrences of S0 in each category, and these proportions
are used to estimate P(AjD; q). P(A) instead will be estimated by P(S0), that is
the total number of occurrences of the word profiles of S0 in the sample A0 with
respect to all word profiles occurring in A0.</p>
        <p>The category proportions P(Djq) are estimated as the coefficients of the linear
regression model</p>
        <p>P(Ajq) = P(AjD; q) P(Djq):
jAj 1 jAj jDj jDj 1
This is not a supervised methodology, as it would be with an automated
classifier. It is based on counting word profiles from a covering sample. The advantage
is a statistically significantly high accuracy (almost 99%, see Table 3). However,
there are many drawbacks.The methodology needs to start over at each query, and
to achieve such a high accuracy, a long and costly activity of human evaluation
of documents is required. The word profile counting is anyway complex since
profiles are arbitrary subsets of a very large dictionary, and data are very sparse
in Information Retrieval. Moreover, the query-by-query linear regression
learning model is also time consuming. In conclusion, this method is not based on a
supervised learning model, but it is essentially driven by a manual process, and
linear regression and word profiles counting are just used to smooth the maximum
likelihood category estimators.</p>
        <p>
          Cumulative approach The cumulative approach is a supervised learning
technique that consists in the use of a linear regression to predict and smooth a
sentiment category size on the basis of a cumulative score of documents [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The
approach is information theoretic: for each category, the set F of features for a
category are made up of the most informative terms, or equivalently, the highest
coding code in that category. Differently from Levy &amp; Kass-Forman’s
misclassification recovery model, there is not a pipeline of computational processes to
perform, namely classifying, then counting, and finally adjusting the category
sizes with the number of estimated misclassified items. The technique of the
cumulative approach simply correlate the category size with the total number of
bits used to code the occurring category features. Since information is additive,
the linear regression model is the natural choice that sets up such a correlation
over a set of features spanning over a set of training queries. Similarly to the
adjusted classify and count approach the precision of this methodology is high and
is reported on Section 3.2.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimentation</title>
        <p>To assess the effectiveness of the classifier-based quantification, we have build
an annotated corpus composed by 6305 tweets manually classified on the basis
of the contained opinion. More precisely: 1358 tweets was classified as positive
(i.e. containing a positive opinion), 2293 as negative (i.e. containing a negative
opinion), 382 as mixed (i.e. containing both positive and negative opinions); 1959
as neutral (i.e. not containing opinions), 313 as not classifiable.</p>
        <p>
          We have run two sets of experiments. We have first statistical technique to smooth
the proportions from a manual document sample assessment. This experiment is
essentially manual because requires a training set for each query. For each query
instead of the word profiles as used in the proposed by Hopkins &amp; King we have
used two standard classifiers (Multinomial Naive Bayes, MNB, and SVM with a
linear kernel), and the adjusted classify &amp; count (ACC) as maximum likelihood
estimate smoothing technique. However, Hopkins &amp; King’s results are hardly
reproducible since the set of admissible profiles are generated by a complex feature
selection, and also a portion of negative examples are removed from the
training set of the query. Indeed, these profiles are generated by an adaptation of the
technique by King and Lu [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], that randomly chooses subsets of between
approximately 5 and 25 words as admissible profiles. This number of words is determined
empirically through cross-validation within the labeled set. Therefore, we show
our results in comparison to their method on Table 3 as only reported in their
paper [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Table 4 shows that the supervised methods with the adjusted classify &amp; count
(ACC) technique achieves a very high precision (96.63%-97.86%), i.e. a Mean
Absolute Proportion error similar to that of Hopkins &amp; King, with a supervised
learning process that is not tailored on a single query only, but trained over a set
of about 30 queries and with a 5-fold cross validation. The difference of Mean
Absolute Proportion error for 30 queries produced by a search like classification
process with respect to Hopkins &amp; King method with a single query, is minimal
and not statistically significant.</p>
        <p>
          This first outcomes on Table 4 show that standard supervised classification
methods can be effectively applied, and fast implemented, for quantification of
sentiment analysis of new queries. The second experiment on Table 5 indeed shows
that the ACC smoothing with the use of classifiers is a fully automated supervised
method that performs highly with new queries as the manual classification of
HKA on a single query. The classifiers were trained using a set of about 30 queries
and 6-fold cross validation, where each test set has new documents coming from
the result sets of the new queries (Out-Query-Sample Cross Validation). We also
report the sample fit for each fold (In-Query-Sample Fit Cross-Validation) that
shows that an almost perfect category counting with the SVM classifier.
Notice that, the Classify &amp; Count process (CC) is mush less prone to error than
the individual classification accuracy, because of possible error type balancing
effect (see Table 5). However, there is not a correlation between individual
classification accuracy and Mean Absolute Error Rate of the CC process, so that the CC
approach cannot ever be considered reliable estimation or statistically significant.
Finally, the cumulative approach achieves high effectiveness (Multiple R-squared
is 0.9781 for the negative category with 5-fold cross validation on the same set of
queries) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This paper reported some experiences gained during the development of a
scalable system for real-time analytics on social networks.</p>
      <p>We have presented how some architectural components resulting from the
analysis of a typical querying process that can be used to implement several
functionalities of the system. These components can be adopted both for developing a
multi-index and a distributed index implementation of the system. We also
identified a potential bottleneck in the decoration phase: the related component has to
be carefully developed in the distributed version of the system.</p>
      <p>Furthermore, we have shown how to estimate real-time document category
proportions for topical opinion retrieval for big data. Outcomes are produced either
by a direct count or by estimation of category sizes based on a supervised
automated classification with a smoothing technique to recover the number of
misclassified documents. The use of MNB and SVM classifiers or information-based
dictionaries to estimate category proportions are highly effective and achieves
almost perfect accuracy if a training phase on the query is also performed.
Search, classify and quantification for analytics can be thus effectively conducted
in real-time.</p>
      <p>Acknowledgement: Work carried out under the Research Agreement between
Almawave and Fondazione Ugo Bordoni.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Giambattista</given-names>
            <surname>Amati</surname>
          </string-name>
          , Marco Bianchi, and
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Marcone</surname>
          </string-name>
          .
          <article-title>Sentiment estimation on twitter</article-title>
          .
          <source>In IIR</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Badue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ziviani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Ziviani</surname>
          </string-name>
          .
          <article-title>Analyzing imbalance among homogeneous index servers in a web search system</article-title>
          . Inf. Process. Manage.,
          <volume>43</volume>
          (
          <issue>3</issue>
          ):
          <fpage>592</fpage>
          -
          <lpage>608</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Berthier</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          , et al.
          <source>Modern Information Retrieval</source>
          , volume
          <volume>463</volume>
          . ACM press New York,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Callan</surname>
          </string-name>
          .
          <article-title>Distributed Information Retrieval</article-title>
          . In In: Advances in Information Retrieval, pages
          <fpage>127</fpage>
          -
          <lpage>150</lpage>
          . Kluwer Academic Publishers,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>George</given-names>
            <surname>Forman</surname>
          </string-name>
          .
          <article-title>Counting positives accurately despite inaccurate classification</article-title>
          .
          <source>In Joa˜o Gama</source>
          , Rui Camacho, Pavel Brazdil, Al´ıpio Jorge, and Lu´ıs Torgo, editors,
          <source>ECML</source>
          , volume
          <volume>3720</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>564</fpage>
          -
          <lpage>575</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>George</given-names>
            <surname>Forman</surname>
          </string-name>
          .
          <article-title>Quantifying counts and costs via classification</article-title>
          .
          <source>Data Min. Knowl. Discov.</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ):
          <fpage>164</fpage>
          -
          <lpage>206</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Hopkins</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gary</given-names>
            <surname>King</surname>
          </string-name>
          .
          <article-title>A method of automated nonparametric content analysis for social science</article-title>
          .
          <source>American Journal of Political Science</source>
          ,
          <volume>54</volume>
          (
          <issue>1</issue>
          ):
          <fpage>229</fpage>
          -
          <lpage>247</lpage>
          ,
          <issue>01</issue>
          /
          <year>2010</year>
          2010.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Gary</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ying</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Kenji</given-names>
            <surname>Shibuya</surname>
          </string-name>
          .
          <article-title>Designing verbal autopsy studies</article-title>
          .
          <source>Population Health Metrics</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Leah</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Larkey</surname>
            ,
            <given-names>Margaret E.</given-names>
          </string-name>
          <string-name>
            <surname>Connell</surname>
            ,
            <given-names>and Jamie</given-names>
          </string-name>
          <string-name>
            <surname>Callan</surname>
          </string-name>
          .
          <article-title>Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data</article-title>
          .
          <source>In CIKM 2000</source>
          , pages
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          . ACM Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P S</given-names>
            <surname>Levy and E H Kass</surname>
          </string-name>
          .
          <article-title>A three-population model for sequential screening for bacteriuria</article-title>
          .
          <source>American J. of Epidemiology</source>
          ,
          <volume>91</volume>
          (
          <issue>2</issue>
          ):
          <fpage>148</fpage>
          -
          <lpage>54</lpage>
          ,
          <year>1970</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Xiaoyong</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bruce W.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Cluster-based retrieval using language models</article-title>
          .
          <source>In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval</source>
          , pages
          <fpage>186</fpage>
          -
          <lpage>193</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Craig</surname>
            <given-names>Macdonald</given-names>
          </string-name>
          , Iadh Ounis, and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Soboroff</surname>
          </string-name>
          .
          <article-title>Overview of the TREC 2007 blog track</article-title>
          . In Ellen M.
          <article-title>Voorhees</article-title>
          and Lori P. Buckland, editors,
          <source>TREC, volume Special Publication 500-274. National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Alistair</given-names>
            <surname>Moffat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Webber</surname>
          </string-name>
          , Justin Zobel, and
          <string-name>
            <surname>Ricardo</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Yates</surname>
          </string-name>
          .
          <article-title>A pipelined architecture for distributed text query evaluation</article-title>
          .
          <source>Inf</source>
          . Retr.,
          <volume>10</volume>
          (
          <issue>3</issue>
          ):
          <fpage>205</fpage>
          -
          <lpage>231</lpage>
          ,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Iadh</surname>
            <given-names>Ounis</given-names>
          </string-name>
          , Maarten de Rijke, Craig Macdonald, Gilad Mishne, and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Soboroff</surname>
          </string-name>
          .
          <article-title>Overview of the trec-2006 blog track</article-title>
          .
          <source>In Text Retrieval Conference</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Diego</surname>
            <given-names>Puppin</given-names>
          </string-name>
          , Fabrizio Silvestri, and
          <string-name>
            <given-names>Domenico</given-names>
            <surname>Laforenza</surname>
          </string-name>
          .
          <article-title>Query-driven document partitioning and collection selection</article-title>
          .
          <source>In InfoScale '06: Proceedings of the 1st international conference on Scalable information systems</source>
          . ACM Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>