=Paper= {{Paper |id=None |storemode=property |title=A Study on Evaluation on Opinion Retrieval Systems |pdfUrl=https://ceur-ws.org/Vol-560/paper12.pdf |volume=Vol-560 |dblpUrl=https://dblp.org/rec/conf/iir/AmatiACGG10 }} ==A Study on Evaluation on Opinion Retrieval Systems== https://ceur-ws.org/Vol-560/paper12.pdf
         A study on evaluation on opinion retrieval systems

                Giambattista Amati                            Giuseppe Amodeo                    Valerio Capozio
              Fondazione Ugo Bordoni                      Dept. of Computer Science,          Dept. of Mathematics,
                   Rome, Italy                               University of L’Aquila           University of Rome "Tor
                      gba@fub.it                                 L’Aquila, Italy                     Vergata"
                                                                gamodeo@fub.it                     Rome, Italy
                                                                                        valeriocapozio@gmail.com
                                          Carlo Gaibisso                       Giorgio Gambosi
                                   Istituto di Analisi dei Sistemi            Dept. of Mathematics,
                                      ed Informatica "Antonio                 University of Rome "Tor
                                           Ruberti" - CNR                            Vergata"
                                             Rome, Italy                           Rome, Italy
                                  carlo.gaibisso@iasi.cnr.it            gambosi@mat.uniroma2.it

ABSTRACT                                                                 1.     INTRODUCTION
We study the evaluation of opinion retrieval systems. Opin-                 Sentiment analysis aims to documents classification, ac-
ion retrieval is a relatively new research area, nevertheless            cording to opinions, sentiments, or, more generally, sub-
classical evaluation measures, those adopted for ad hoc re-              jective features contained in text. The study and evalua-
trieval, such as MAP, precision at 10 etc., were used to assess          tion of efficient solutions to detect sentiments in text is a
the quality of rankings. In this paper we investigate the ef-            popular research area, and different techniques have been
fectiveness of these standard evaluation measures for topical            applied coming from natural language processing, compu-
opinion retrieval. In doing this we split the opinion dimen-             tational linguistics, machine learning, information retrieval
sion from the relevance one and use opinion classifiers, with            and text mining.
varying accuracy, to analyse how opinion retrieval perfor-                  The application of sentimental analysis to Information Re-
mance changes by perturbing the outcomes of the opinion                  trieval goes back to the novelty track of TREC 2003 [13].
classifiers. Classifiers could be studied in two modalities,             Topical opinion retrieval is also known as opinion retrieval
that is either to re-rank or to filter out directly documents            or opinion finding [4, 9, 11]. In [5, 3, 2, ?] dictionary-based
obtained through a first relevance retrieval. In this paper we           methodologies for topical opinion retrieval are proposed. An
formally outline both approaches, while for now focussing on             application of opinion finding to blogs was introduced in the
the filtering process.                                                   Blog Track of TREC 2006 [8]. However, there is not yet
   The proposed approach aims to establish the correlation               a comprehensive study of evaluation of topical opinion sys-
between the accuracy of the classifiers and the performance              tems, and in particular of the interaction and correlation
of the topical opinion retrieval. In this way it will be possi-          between relevance and sentiment assessments.
ble to assess the effectiveness of the opinion component by                 At first glance, evaluation of opinion retrieval systems
comparing the effectiveness of the relevance baseline with               seems to not deserve any further investigation or extra ef-
that of the topical opinion.                                             fort with respect to the evaluation of conventional retrieval
                                                                         systems. Traditional evaluation measures, such as the Mean
Categories and Subject Descriptors                                       Average Precision (MAP) or the precision at 10 [8, 6, 10,
                                                                         11], can be still used to evaluate rankings of opinionated
H.3.0 [Information Storage and Retrieval]: General;                      documents that are also assessed to be relevant to a given
H.3.1 [Information Storage and Retrieval]: Content                       topic. However, if we give a deeper look at the performance
Analysis and Indexing; H.3.3 [Information Storage and                    of topical opinion systems we are struck by the diversity in
Retrieval]: Information Search and Retrieval                             the observed values of performance. For example the best
                                                                         run for topic relevance in the blog track of TREC 2008 [10]
General Terms                                                            achieves a MAP value equal to 0.4954, that drops to 0.4052,
Theory, Experimentation                                                  as concerns the MAP of opinion, in the opinion finding task.
                                                                         Performance degradation is as expected because any variable
                                                                         which is additional to relevance, i.e. the opinion one, must
Keywords                                                                 deteriorate the system performance. However, we do not
Sentiment Analysis, Opinion Retrieval, Opinion Finding,                  have yet a way to set apart the effectiveness of the opinion
Classification                                                           detection component and evaluate how effective it is, or to
                                                                         determine whether and to which extent, the relevance and
                                                                         opinion detection components are influenced by each other.
                                                                         It seems evident that an evaluation methodology or at least
Appears in the Proceedings of the 1st Italian Information Retrieval      some benchmarks are needed to make it possible to assess
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                   how effective the opinion component is. To exemplify: how
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
effective is the performance value of opinion MAP 0.4052             Therefore the number of misclassified documents is (1−k)·
when we start from an initial relevance MAP of 0.4954? It         n, where n is the number of classified documents. Assuming
is indeed a matter of fact that opinion MAP in TREC [8, 6,        the independence between opinion and relevance, the mis-
10], seems to be highly dependent on the relevance MAP of         classified documents will be distributed randomly between
the first-pass retrieval [9].                                     relevant and not relevant.
   The general issue is thus the following: can we assume            The outcomes of these artificial classifiers are then used to
that absolute values of MAP can be used as they are to            modify the baseline. This can be done following two different
compare different tasks, in our case the topical opinion and      approaches:
the ad hoc relevance task; and thus: evaluation measures
can be used without any MAP normalization to compare                   • a filtering process: when documents of the baseline are
or to assess the state of the art of different techniques on             deemed as not opinionated by the classifier, they are
opinion finding?                                                         removed from the ranking;
   At this aim, we introduce a completely novel methodolog-
                                                                       • a re-ranking process: when documents of the baseline
ical framework which:
                                                                         are considered as opinionated by the classifier, they
     • provides a bound for the best achievable opinion MAP,             receive a “reward” in their rank.
       for a given relevance document ranking;
                                                                     The filtering process uses the classifier in its classical mean-
     • predicts the performance of topical opinion retrieval      ing. This process is particularly suitable to analyse the ef-
       given the performance of the topic retrieval and opin-     fectiveness of the technique itself to opinion detection, as a
       ion detection;                                             classification task [12], and its effects on topical opinion per-
                                                                  formance. Opinion filtering also gives some interesting clues
     • viceversa, provides whether a given opinion detection      on what is the optimal performance achievable by an opin-
       technique gives a significant or marginal contribution     ion retrieval technique based on filtering, and also whether
       to the state of the art;                                   filtering strategy is in general superior or not to even very
                                                                  simple re-ranking strategies.
     • investigates the robustness of evaluation measures for
                                                                     In the re-ranking process a “reward” function for the doc-
       opinion retrieval effectiveness.
                                                                  uments has to be defined. In such a case we introduce bias
     • indicates what re-ranking or filtering strategy is best    in assigning correct rewards, and we thus may observe the
       suited to improve topical retrieval by opinion classi-     effectiveness of a re-ranking algorithm as long as the opinion
       fiers.                                                     detection performance changes.
                                                                     By “comparing” the results of an opinion retrieval system
   This paper is organized as follows. The proposed evalua-       with the filtering process, or the re-ranking process at several
tion method is presented in sections 2 and 4; section 3 in-       levels of accuracy, we can obtain relevant clues about:
troduces the collection used for tests. Results are presented
in section 5, and conclusions follow in section 6.                     • the overall contribution introduced by the opinion sys-
                                                                         tem only and its robustness;
2.    EVALUATION APPROACH                                              • the effectiveness of the opinion detection component;
   An opinion retrieval system is based on a topic retrieval
and an opinion detection subsystem [9]: different kinds of          In the following we formally describe both the approaches
“information” are retrieved and weighted in order to gener-       and focus on the experimentation concerning the filtering
ate a final ranking of documents that reflects their relevance    process only.
with both topic and opinion content. To analyse the effec-
tiveness of the whole system, we should be able to quantify       3.     EXPERIMENTATION ENVIRONMENT
not only the performance of the final result, but also the con-      We used the BLOG06 [7] collection and the data sets of the
tribution of each subsystem. As usual, the evaluation metric      Blog Track of TREC 2006, 2007 and 2008 [8, 6, 10] for our
used in literature for the final ranking is the MAP. But MAP      experimentation. Since 2006, Blog Track has an evaluation
(of relevance and opinion) for the final ranking is not suf-      track on blogs where the main task is opinion retrieval, that
ficient to fully assess the performance of the whole system:      is the task of selecting the opinionated blog posts relevant
the contribution of each component, taken separately, needs       to a given topic [9]. BLOG06 collection size is 148 GB and
to be identified.                                                 contains spam as well as possibly non-blogs and non-English
   The input to the proposed topical opinion evaluation pro-      pages.
cess is the relevance baseline, i.e. the ranking of documents        The data set consists of 150 topics and a list, the Qrels,
generated by the topic retrieval system, here considered as       in which the relevance and content of opinion of documents
a black box. The effectiveness of the topic retrieval compo-      are assessed with respect to each topic. An item in the
nent is measured by the MAP of opinion and relevance of           list identifies a topic t, a document d and a judgement of
this baseline.                                                    relevance/opinion assigned as follows:
   The evaluation of the effectiveness of the opinion detec-
tion component, relies on artificially defined classifiers of          • 0 if d is not relevant with respect to t;
                                     k
opinion. The artificial classifier CO   classifies documents as        • 1 if d is relevant to t, but does not contain comments
opinionated, O, or not opinionated, O, with accuracy k,                  on t;
0 ≤ k ≤ 1. The classification process is independent from
                                                             k
the topic relevance of documents. To achieve accuracy k CO             • 2 if d is relevant to t and contains positive comments
properly classifies each document with probability k.                    on t;
     • 3 if d is relevant to t and contains neutral comments       will have a higher rank after their removal. Even with the
       on t;                                                       re-ranking approach we have a similar situation, but this
                                                                   precision boosting phenomenon is attenuated by the fact
     • 4 if d is relevant to t and contains negative comments      that re-ranking is not based on as drastic decision as that
       on t.                                                       of a removal, and the repositioning of a document does not
   Note that not relevant documents are not classified ac-         propagate to all documents that are below it in the original
cording to their opinion content.                                  ranking.
   In the following, [x] denotes the set of documents labelled
                                                                                              O            O
by an x = 0, 1, 2, 3, 4, and not labelled documents belong to
                                                                                   R    |{[2]∪[3]∪[4]}|   |[1]|
[0] by default.
   TREC organizers also provide the best five baselines, pro-                      R          NA          NA
duced by some participants, denoted by BL1 , BL2 , . . . , BL5 .
                                                                   Table 1: the contingency table for an opinion-only
4.    EVALUATION FRAMEWORK                                         classifier for documents in the BLOG06 collection.
                                           k
   The behaviour of artificial classifier CO is defined through    R denotes relevance, R non-relevance; O denotes
              k
the Qrels. CO predicts the right opinion orientation of each       opinion, O non-opinion. With the notation [x] we re-
document in the collection by searching it in the Qrels. The       fer to the class of documents labelled by x = 1, 2, 3, 4
accuracy k is simulated by the introduction of a bias in the       in the Qrels.
classification. Documents not appearing or assessed as not
                                                                                       k                                       RC
relevant in the Qrels, will be classified according to the dis-       Together with CO   , we introduce a random classifier CO
tribution of probability of opinionated and not opinionated        that classifies documents according to the a priori distribu-
documents among the relevant ones. Taking into account             tion of opinionated documents in the collection. It repre-
both relevance and opinion in the test collection we obtain        sents a good approximation of the random behaviour of a
the contingency Table 1. As shown in table 1, the Qrels does       classifier. More precisely, this classifier assesses a document
not provide the opinion classes for not relevant documents.        as opinionated with probability P (O) and as not opinion-
The missing data complicate a little bit, but not much, the        ated with probability Pr(O) = 1 − Pr(O).
construction of our classifiers. To overcome the problem, we
assume that                                                        4.1    Filtering approach
                                                                      As already stated, in the filtering approach documents
                    P r(O|R) = P r(O|R)                      (1)   classified as not opinionated are removed from the baseline.
Equation 1 asserts that there is not a sufficient reason to        Note that while relevant documents contribute and improve
have a different distribution of opinion among relevant and        the evaluation measure, if correctly classified, the not rele-
not relevant documents. An a priori probability, Pr(O),            vant ones do not contribute directly to this measure.
for opinionated documents is still unknown. However equa-             In conclusion if a not relevant document is classified as
tion 1 implies that O and R are independent, thus                  opinionated not being actually opinionated, then this mis-
                                                                   classification will not affect the evaluation measure. Differ-
                      P r(O|R) = P r(O)                      (2)   ently the removal of not relevant documents regardless of
From equations 1 and 2 follows that                                their real opinion orientation, always positively affects the
                                                                   ranking, even if misclassified.
         P r(O|R) = P r(O|R) = P r(O) = 1 − P r(O)           (3)      For relevant documents instead the misclassification al-
                                                                   ways negatively affects the ranking.
   Equations 2 and 3 are equivalent to assume that the set
                                                                      With this approach we can observe how hard is to over-
{[2] ∪ [3] ∪ [4]}, as defined in Table 1, is a sample of the set
                                                                   come the baseline, i.e. we can identify how effective must
of opinionated documents. Thus, without loss of generality,
                                                                   be the opinion detection technique to improve the starting
we can define Pr(O) using only the documents classified as
                                                                   topic retrieval.
relevant by the Qrels as follows:
                             |{[2] ∪ [3] ∪ [4]}|                   4.2    Re-ranking approach
                P (O) =                                      (4)     Re-ranking techniques essentially are fusion models [9]
                          |{[1] ∪ [2] ∪ [3] ∪ [4]}|
                                                                   that combine a relevance score sR (d) and an opinion score
and consequently                                                   sO (d) (or two ranks derived from these scores) for a docu-
                                           |[1]|                   ment d. The new score sOR (d) is a function of the two non
          P (O) = 1 − P (O) =                                (5)   negative scores, sR (d) and sO (d):
                                 |{[1] ∪ [2] ∪ [3] ∪ [4]}|
   In the following we study whether and how the set of rel-                        sOR (d) = f (sR (d), sO (d))              (6)
evant and not relevant documents classified as opinionated           Given a classifier COk
                                                                                            , we define a new score sCOR (d) based
affects the topical opinion ranking.                               on the outcomes of CO   k
                                                                                               according to which the baseline is
   We have to say that for both approaches, filtering or re-       re-ranked. sCOR (d) is defined as follows:
ranking, a misclassification may have controversial effects                           (
on the effectiveness of the final ranking. If we filter docu-               C             f (sR (d), sO (d)) if d ∈C k O
                                                                           sOR (d) =                                 O         (7)
ments by opinions with a classifier, for example, the mis-                                    f (sR (d), 0)  if d 6∈C k O
                                                                                                                   O
classified and removed not relevant documents may bring a
positive contribution to the precision measures, because all       where ∈C k denotes the classifier outcome, that is when the
                                                                           O
opinionated and relevant documents that were below them,           document is assigned to a given class. Note when k = 100%
and assuming that f (·, ·) is a not decreasing function of          also evident that higher the baseline MAP is, higher the ac-
sO (·), i.e. f (sR (d), x) ≥ f (sR (d), x0 ), ∀x ≥ x0 , the opin-   curacy of classifier must be to introduce some benefits with
ion MAP of any ranking based on sOR (·) does not exceed             a filtering approach with respect to relevance only retrieval.
that based on sCOR (·) .
  All the above considerations can be further extended to
the case in witch the sOR (d) is based on the ranks of d
                                                                    6.   CONCLUSIONS AND FUTURE WORKS
instead of on its scores (of relevance and opinion).                   The opinion retrieval problem seems to be a relatively
                                                                    hard task: the combination of two variables like topic rel-
                                                                    evance and opinion, requires a deep analysis on their cor-
5.     EXPERIMENTATION RESULTS                                      relation. From the results of TREC competitions [8, 6, 10,
   In this paper we report the experimentation results for the      9], emerges the lack of exhaustive evaluations measures: the
filtering approach. The filtering process has been repeated         MAP, Precision at 10 and R-Precision are not sufficient alone
20 times for each baseline and for accuracy k = 0.5, 0.6,           to give a complete analysis on the systems performances.
0.7,0.8,0.9,1. Mean values of the MAPs are reported.                   Up to now we have studied only the filtering of documents
   Table 2 reports, in decreasing order, the relevance MAPs         by opinions. This strategy however requires a very high
(M APR ) and the opinion MAPs (M APO ) for each baseline.           accuracy of the classification. We will compute the study
                                                                    with re-ranking approach starting from the approach used
                          Baselines                                 in [1, 2].
                                                                       Our approach is able to provide an indicative accuracy
                          MAPR MAPO
                                                                    of the opinion component of the topical opinion retrieval
                    BL4    0.4776    0.3542                         system. It also allows us to propose an evaluation frame-
                    BL5    0.4247    0.2974                         work, able to evaluate the effectiveness of opinion retrieval
                                                                    systems.
                    BL3    0.4079    0.3007
                    BL1    0.3540    0.2470                         7.   REFERENCES
                    BL2    0.3382    0.2657                          [1] G. Amati, E. Ambrosi, M. Bianchi, C. Gaibisso, and
                                                                         G. Gambosi. Fub, iasi-cnr and university of tor
                                                                         vergata at trec 2007 blog track. In Proc. of the 16th
Table 2: MAP of relevance (MAPR ) and opinion
                                                                         Text Retrieval Conference (TREC), 2007.
(MAPO ) of the five baselines.
                                                                     [2] G. Amati, G. Amodeo, M. Bianchi, C. Gaibisso, and
                                                                         G. Gambosi. A uniform theoretic approach to opinion
   In figure 1 MAP values are reported for each baseline as
                                                                         and information retrieval, in Intelligent Information
long as the accuracy of classifiers changes. The dotted lines
                                                                         Access, G. Armano, M. de Gemmis, G. Semeraro, and
represent the baselines opinion MAPs and the dot-dashed
                                                                         E. Vargiu (eds.) Studies in Computational
lines represent the baseline relevance MAPs. The MAP val-
                                                                         Intelligence. Springer, to appear.
ues of random classifier is also reported as the dashed lines
in the graphs.                                                       [3] J. Skomorowski and O. Vechtomova. Ad hoc retrieval
   Analysing the MAP trend we can infer the following ob-                of documents with topical opinion. In G. Amati,
servations:                                                              C. Carpineto, and G. Romano, editors, ECIR, volume
                                                                         4425 of Lecture Notes in Computer Science, pages
     1. the baseline MAPR is an upper bound for the MAP0                 405–417. Springer, 2007.
        obtained with a filtering approach;                          [4] K. Eguchi and V. Lavrenko. Sentiment retrieval using
                                                                         generative models. In Proceedings of the 2006
     2. the random classifier always deteriorate the perfor-
                                                                         Conference on Empirical Methods in Natural Language
        mance of the baseline MAP0 .
                                                                         Processing, pages 345–354, Sydney, Australia, July
     3. the minimal accuracy needed to improve by filtering              2006. Association for Computational Linguistics.
        the baseline MAP0 is very high, at least 80%;                [5] G. Mishne. Multiple ranking strategies for opinion
                                                                         retrieval in blogs. In The Fifteenth Text REtrieval
     4. there is a linear correlation between the MAP0 achiev-           Conference (TREC 2006) Proceedings, 2006.
        able by a classifier with accuracy k and the accuracy        [6] C. Macdonald, I. Ounis, and I. Soboroff. Overview of
        itself.                                                          the trec-2007 blog track. In Proc. of the 16th Text
First three remarks says that filtering strategy is very dan-            Retrieval Conference (TREC), 2007.
gerous for MAP0 performance, that is removing documents              [7] Crag Macdonald and Iadh Ounis. The trec blogs06
affects greatly the performance of the topical opinion re-               collection : Creating and analysing a blog test
trieval.                                                                 collection. Technical report, University of Glasgow
   From the above considerations, we may conclude that the               Scotland, UK, 2006.
opinion retrieval task is not easy and that having good re-          [8] I. Ounis, M. de Rijke, C. Macdonald, G. A. Mishne,
sults with a filtering approach requires a too high accuracy.            and I. Soboroff. Overview of the trec-2006 blog track.
The experimentation instead allows us to identify a plausible            In TREC 2006 Working Notes, 2006.
range for the MAP achievable by an opinion retrieval system:         [9] I. Ounis, C. Macdonald, and I. Soboroff. On the trec
the classifier with accuracy 100% and the random classifiers             blog track. In Proc. of the 2nd International
obtains performance that can be considered as thresholds                 Conference on Weblogs and Social Media (ICWSM),
for the best and the worst opinion detection system. It is               2008.
                                                        k
Figure 1: MAPs of opinion of the baselines filtered by CO   and for k = 0.5, 0.6, 0.7, 0.8, 0.9, 1. The opinion MAPs
(dotted lines) and relevance MAPs (dot-dashed lines) of the baselines are also reported. Finally dashed lines
                                                      RC
show the opinion MAPs for the baselines filtered by CO    .


[10] I. Ounis, C. Macdonald, and I. Soboroff. Overview of
     the trec-2008 blog track. In Proc. of the 17th Text
     Retrieval Conference (TREC), 2008.
[11] B. Pang and L. Lee. Opinion mining and sentiment
     analysis. Foundations and Trends in Information
     Retrieval, 2(1–2):1–135, 2008.
[12] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?
     sentiment classification using machine learning
     techniques. In Proc. of the ACL-02 conference on
     Empirical Methods in Natural Language Processing,
     pages 79–86, 2002.
[13] Ian Soboroff and Donna Harman. Overview of the trec
     2003 novelty track. In TREC, pages 38–53, 2003.