=Paper=
{{Paper
|id=None
|storemode=property
|title=A Study on Evaluation on Opinion Retrieval Systems
|pdfUrl=https://ceur-ws.org/Vol-560/paper12.pdf
|volume=Vol-560
|dblpUrl=https://dblp.org/rec/conf/iir/AmatiACGG10
}}
==A Study on Evaluation on Opinion Retrieval Systems==
A study on evaluation on opinion retrieval systems Giambattista Amati Giuseppe Amodeo Valerio Capozio Fondazione Ugo Bordoni Dept. of Computer Science, Dept. of Mathematics, Rome, Italy University of L’Aquila University of Rome "Tor gba@fub.it L’Aquila, Italy Vergata" gamodeo@fub.it Rome, Italy valeriocapozio@gmail.com Carlo Gaibisso Giorgio Gambosi Istituto di Analisi dei Sistemi Dept. of Mathematics, ed Informatica "Antonio University of Rome "Tor Ruberti" - CNR Vergata" Rome, Italy Rome, Italy carlo.gaibisso@iasi.cnr.it gambosi@mat.uniroma2.it ABSTRACT 1. INTRODUCTION We study the evaluation of opinion retrieval systems. Opin- Sentiment analysis aims to documents classification, ac- ion retrieval is a relatively new research area, nevertheless cording to opinions, sentiments, or, more generally, sub- classical evaluation measures, those adopted for ad hoc re- jective features contained in text. The study and evalua- trieval, such as MAP, precision at 10 etc., were used to assess tion of efficient solutions to detect sentiments in text is a the quality of rankings. In this paper we investigate the ef- popular research area, and different techniques have been fectiveness of these standard evaluation measures for topical applied coming from natural language processing, compu- opinion retrieval. In doing this we split the opinion dimen- tational linguistics, machine learning, information retrieval sion from the relevance one and use opinion classifiers, with and text mining. varying accuracy, to analyse how opinion retrieval perfor- The application of sentimental analysis to Information Re- mance changes by perturbing the outcomes of the opinion trieval goes back to the novelty track of TREC 2003 [13]. classifiers. Classifiers could be studied in two modalities, Topical opinion retrieval is also known as opinion retrieval that is either to re-rank or to filter out directly documents or opinion finding [4, 9, 11]. In [5, 3, 2, ?] dictionary-based obtained through a first relevance retrieval. In this paper we methodologies for topical opinion retrieval are proposed. An formally outline both approaches, while for now focussing on application of opinion finding to blogs was introduced in the the filtering process. Blog Track of TREC 2006 [8]. However, there is not yet The proposed approach aims to establish the correlation a comprehensive study of evaluation of topical opinion sys- between the accuracy of the classifiers and the performance tems, and in particular of the interaction and correlation of the topical opinion retrieval. In this way it will be possi- between relevance and sentiment assessments. ble to assess the effectiveness of the opinion component by At first glance, evaluation of opinion retrieval systems comparing the effectiveness of the relevance baseline with seems to not deserve any further investigation or extra ef- that of the topical opinion. fort with respect to the evaluation of conventional retrieval systems. Traditional evaluation measures, such as the Mean Categories and Subject Descriptors Average Precision (MAP) or the precision at 10 [8, 6, 10, 11], can be still used to evaluate rankings of opinionated H.3.0 [Information Storage and Retrieval]: General; documents that are also assessed to be relevant to a given H.3.1 [Information Storage and Retrieval]: Content topic. However, if we give a deeper look at the performance Analysis and Indexing; H.3.3 [Information Storage and of topical opinion systems we are struck by the diversity in Retrieval]: Information Search and Retrieval the observed values of performance. For example the best run for topic relevance in the blog track of TREC 2008 [10] General Terms achieves a MAP value equal to 0.4954, that drops to 0.4052, Theory, Experimentation as concerns the MAP of opinion, in the opinion finding task. Performance degradation is as expected because any variable which is additional to relevance, i.e. the opinion one, must Keywords deteriorate the system performance. However, we do not Sentiment Analysis, Opinion Retrieval, Opinion Finding, have yet a way to set apart the effectiveness of the opinion Classification detection component and evaluate how effective it is, or to determine whether and to which extent, the relevance and opinion detection components are influenced by each other. It seems evident that an evaluation methodology or at least Appears in the Proceedings of the 1st Italian Information Retrieval some benchmarks are needed to make it possible to assess Workshop (IIR’10), January 27–28, 2010, Padova, Italy. how effective the opinion component is. To exemplify: how http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. effective is the performance value of opinion MAP 0.4052 Therefore the number of misclassified documents is (1−k)· when we start from an initial relevance MAP of 0.4954? It n, where n is the number of classified documents. Assuming is indeed a matter of fact that opinion MAP in TREC [8, 6, the independence between opinion and relevance, the mis- 10], seems to be highly dependent on the relevance MAP of classified documents will be distributed randomly between the first-pass retrieval [9]. relevant and not relevant. The general issue is thus the following: can we assume The outcomes of these artificial classifiers are then used to that absolute values of MAP can be used as they are to modify the baseline. This can be done following two different compare different tasks, in our case the topical opinion and approaches: the ad hoc relevance task; and thus: evaluation measures can be used without any MAP normalization to compare • a filtering process: when documents of the baseline are or to assess the state of the art of different techniques on deemed as not opinionated by the classifier, they are opinion finding? removed from the ranking; At this aim, we introduce a completely novel methodolog- • a re-ranking process: when documents of the baseline ical framework which: are considered as opinionated by the classifier, they • provides a bound for the best achievable opinion MAP, receive a “reward” in their rank. for a given relevance document ranking; The filtering process uses the classifier in its classical mean- • predicts the performance of topical opinion retrieval ing. This process is particularly suitable to analyse the ef- given the performance of the topic retrieval and opin- fectiveness of the technique itself to opinion detection, as a ion detection; classification task [12], and its effects on topical opinion per- formance. Opinion filtering also gives some interesting clues • viceversa, provides whether a given opinion detection on what is the optimal performance achievable by an opin- technique gives a significant or marginal contribution ion retrieval technique based on filtering, and also whether to the state of the art; filtering strategy is in general superior or not to even very simple re-ranking strategies. • investigates the robustness of evaluation measures for In the re-ranking process a “reward” function for the doc- opinion retrieval effectiveness. uments has to be defined. In such a case we introduce bias • indicates what re-ranking or filtering strategy is best in assigning correct rewards, and we thus may observe the suited to improve topical retrieval by opinion classi- effectiveness of a re-ranking algorithm as long as the opinion fiers. detection performance changes. By “comparing” the results of an opinion retrieval system This paper is organized as follows. The proposed evalua- with the filtering process, or the re-ranking process at several tion method is presented in sections 2 and 4; section 3 in- levels of accuracy, we can obtain relevant clues about: troduces the collection used for tests. Results are presented in section 5, and conclusions follow in section 6. • the overall contribution introduced by the opinion sys- tem only and its robustness; 2. EVALUATION APPROACH • the effectiveness of the opinion detection component; An opinion retrieval system is based on a topic retrieval and an opinion detection subsystem [9]: different kinds of In the following we formally describe both the approaches “information” are retrieved and weighted in order to gener- and focus on the experimentation concerning the filtering ate a final ranking of documents that reflects their relevance process only. with both topic and opinion content. To analyse the effec- tiveness of the whole system, we should be able to quantify 3. EXPERIMENTATION ENVIRONMENT not only the performance of the final result, but also the con- We used the BLOG06 [7] collection and the data sets of the tribution of each subsystem. As usual, the evaluation metric Blog Track of TREC 2006, 2007 and 2008 [8, 6, 10] for our used in literature for the final ranking is the MAP. But MAP experimentation. Since 2006, Blog Track has an evaluation (of relevance and opinion) for the final ranking is not suf- track on blogs where the main task is opinion retrieval, that ficient to fully assess the performance of the whole system: is the task of selecting the opinionated blog posts relevant the contribution of each component, taken separately, needs to a given topic [9]. BLOG06 collection size is 148 GB and to be identified. contains spam as well as possibly non-blogs and non-English The input to the proposed topical opinion evaluation pro- pages. cess is the relevance baseline, i.e. the ranking of documents The data set consists of 150 topics and a list, the Qrels, generated by the topic retrieval system, here considered as in which the relevance and content of opinion of documents a black box. The effectiveness of the topic retrieval compo- are assessed with respect to each topic. An item in the nent is measured by the MAP of opinion and relevance of list identifies a topic t, a document d and a judgement of this baseline. relevance/opinion assigned as follows: The evaluation of the effectiveness of the opinion detec- tion component, relies on artificially defined classifiers of • 0 if d is not relevant with respect to t; k opinion. The artificial classifier CO classifies documents as • 1 if d is relevant to t, but does not contain comments opinionated, O, or not opinionated, O, with accuracy k, on t; 0 ≤ k ≤ 1. The classification process is independent from k the topic relevance of documents. To achieve accuracy k CO • 2 if d is relevant to t and contains positive comments properly classifies each document with probability k. on t; • 3 if d is relevant to t and contains neutral comments will have a higher rank after their removal. Even with the on t; re-ranking approach we have a similar situation, but this precision boosting phenomenon is attenuated by the fact • 4 if d is relevant to t and contains negative comments that re-ranking is not based on as drastic decision as that on t. of a removal, and the repositioning of a document does not Note that not relevant documents are not classified ac- propagate to all documents that are below it in the original cording to their opinion content. ranking. In the following, [x] denotes the set of documents labelled O O by an x = 0, 1, 2, 3, 4, and not labelled documents belong to R |{[2]∪[3]∪[4]}| |[1]| [0] by default. TREC organizers also provide the best five baselines, pro- R NA NA duced by some participants, denoted by BL1 , BL2 , . . . , BL5 . Table 1: the contingency table for an opinion-only 4. EVALUATION FRAMEWORK classifier for documents in the BLOG06 collection. k The behaviour of artificial classifier CO is defined through R denotes relevance, R non-relevance; O denotes k the Qrels. CO predicts the right opinion orientation of each opinion, O non-opinion. With the notation [x] we re- document in the collection by searching it in the Qrels. The fer to the class of documents labelled by x = 1, 2, 3, 4 accuracy k is simulated by the introduction of a bias in the in the Qrels. classification. Documents not appearing or assessed as not k RC relevant in the Qrels, will be classified according to the dis- Together with CO , we introduce a random classifier CO tribution of probability of opinionated and not opinionated that classifies documents according to the a priori distribu- documents among the relevant ones. Taking into account tion of opinionated documents in the collection. It repre- both relevance and opinion in the test collection we obtain sents a good approximation of the random behaviour of a the contingency Table 1. As shown in table 1, the Qrels does classifier. More precisely, this classifier assesses a document not provide the opinion classes for not relevant documents. as opinionated with probability P (O) and as not opinion- The missing data complicate a little bit, but not much, the ated with probability Pr(O) = 1 − Pr(O). construction of our classifiers. To overcome the problem, we assume that 4.1 Filtering approach As already stated, in the filtering approach documents P r(O|R) = P r(O|R) (1) classified as not opinionated are removed from the baseline. Equation 1 asserts that there is not a sufficient reason to Note that while relevant documents contribute and improve have a different distribution of opinion among relevant and the evaluation measure, if correctly classified, the not rele- not relevant documents. An a priori probability, Pr(O), vant ones do not contribute directly to this measure. for opinionated documents is still unknown. However equa- In conclusion if a not relevant document is classified as tion 1 implies that O and R are independent, thus opinionated not being actually opinionated, then this mis- classification will not affect the evaluation measure. Differ- P r(O|R) = P r(O) (2) ently the removal of not relevant documents regardless of From equations 1 and 2 follows that their real opinion orientation, always positively affects the ranking, even if misclassified. P r(O|R) = P r(O|R) = P r(O) = 1 − P r(O) (3) For relevant documents instead the misclassification al- ways negatively affects the ranking. Equations 2 and 3 are equivalent to assume that the set With this approach we can observe how hard is to over- {[2] ∪ [3] ∪ [4]}, as defined in Table 1, is a sample of the set come the baseline, i.e. we can identify how effective must of opinionated documents. Thus, without loss of generality, be the opinion detection technique to improve the starting we can define Pr(O) using only the documents classified as topic retrieval. relevant by the Qrels as follows: |{[2] ∪ [3] ∪ [4]}| 4.2 Re-ranking approach P (O) = (4) Re-ranking techniques essentially are fusion models [9] |{[1] ∪ [2] ∪ [3] ∪ [4]}| that combine a relevance score sR (d) and an opinion score and consequently sO (d) (or two ranks derived from these scores) for a docu- |[1]| ment d. The new score sOR (d) is a function of the two non P (O) = 1 − P (O) = (5) negative scores, sR (d) and sO (d): |{[1] ∪ [2] ∪ [3] ∪ [4]}| In the following we study whether and how the set of rel- sOR (d) = f (sR (d), sO (d)) (6) evant and not relevant documents classified as opinionated Given a classifier COk , we define a new score sCOR (d) based affects the topical opinion ranking. on the outcomes of CO k according to which the baseline is We have to say that for both approaches, filtering or re- re-ranked. sCOR (d) is defined as follows: ranking, a misclassification may have controversial effects ( on the effectiveness of the final ranking. If we filter docu- C f (sR (d), sO (d)) if d ∈C k O sOR (d) = O (7) ments by opinions with a classifier, for example, the mis- f (sR (d), 0) if d 6∈C k O O classified and removed not relevant documents may bring a positive contribution to the precision measures, because all where ∈C k denotes the classifier outcome, that is when the O opinionated and relevant documents that were below them, document is assigned to a given class. Note when k = 100% and assuming that f (·, ·) is a not decreasing function of also evident that higher the baseline MAP is, higher the ac- sO (·), i.e. f (sR (d), x) ≥ f (sR (d), x0 ), ∀x ≥ x0 , the opin- curacy of classifier must be to introduce some benefits with ion MAP of any ranking based on sOR (·) does not exceed a filtering approach with respect to relevance only retrieval. that based on sCOR (·) . All the above considerations can be further extended to the case in witch the sOR (d) is based on the ranks of d 6. CONCLUSIONS AND FUTURE WORKS instead of on its scores (of relevance and opinion). The opinion retrieval problem seems to be a relatively hard task: the combination of two variables like topic rel- evance and opinion, requires a deep analysis on their cor- 5. EXPERIMENTATION RESULTS relation. From the results of TREC competitions [8, 6, 10, In this paper we report the experimentation results for the 9], emerges the lack of exhaustive evaluations measures: the filtering approach. The filtering process has been repeated MAP, Precision at 10 and R-Precision are not sufficient alone 20 times for each baseline and for accuracy k = 0.5, 0.6, to give a complete analysis on the systems performances. 0.7,0.8,0.9,1. Mean values of the MAPs are reported. Up to now we have studied only the filtering of documents Table 2 reports, in decreasing order, the relevance MAPs by opinions. This strategy however requires a very high (M APR ) and the opinion MAPs (M APO ) for each baseline. accuracy of the classification. We will compute the study with re-ranking approach starting from the approach used Baselines in [1, 2]. Our approach is able to provide an indicative accuracy MAPR MAPO of the opinion component of the topical opinion retrieval BL4 0.4776 0.3542 system. It also allows us to propose an evaluation frame- BL5 0.4247 0.2974 work, able to evaluate the effectiveness of opinion retrieval systems. BL3 0.4079 0.3007 BL1 0.3540 0.2470 7. REFERENCES BL2 0.3382 0.2657 [1] G. Amati, E. Ambrosi, M. Bianchi, C. Gaibisso, and G. Gambosi. Fub, iasi-cnr and university of tor vergata at trec 2007 blog track. In Proc. of the 16th Table 2: MAP of relevance (MAPR ) and opinion Text Retrieval Conference (TREC), 2007. (MAPO ) of the five baselines. [2] G. Amati, G. Amodeo, M. Bianchi, C. Gaibisso, and G. Gambosi. A uniform theoretic approach to opinion In figure 1 MAP values are reported for each baseline as and information retrieval, in Intelligent Information long as the accuracy of classifiers changes. The dotted lines Access, G. Armano, M. de Gemmis, G. Semeraro, and represent the baselines opinion MAPs and the dot-dashed E. Vargiu (eds.) Studies in Computational lines represent the baseline relevance MAPs. The MAP val- Intelligence. Springer, to appear. ues of random classifier is also reported as the dashed lines in the graphs. [3] J. Skomorowski and O. Vechtomova. Ad hoc retrieval Analysing the MAP trend we can infer the following ob- of documents with topical opinion. In G. Amati, servations: C. Carpineto, and G. Romano, editors, ECIR, volume 4425 of Lecture Notes in Computer Science, pages 1. the baseline MAPR is an upper bound for the MAP0 405–417. Springer, 2007. obtained with a filtering approach; [4] K. Eguchi and V. Lavrenko. Sentiment retrieval using generative models. In Proceedings of the 2006 2. the random classifier always deteriorate the perfor- Conference on Empirical Methods in Natural Language mance of the baseline MAP0 . Processing, pages 345–354, Sydney, Australia, July 3. the minimal accuracy needed to improve by filtering 2006. Association for Computational Linguistics. the baseline MAP0 is very high, at least 80%; [5] G. Mishne. Multiple ranking strategies for opinion retrieval in blogs. In The Fifteenth Text REtrieval 4. there is a linear correlation between the MAP0 achiev- Conference (TREC 2006) Proceedings, 2006. able by a classifier with accuracy k and the accuracy [6] C. Macdonald, I. Ounis, and I. Soboroff. Overview of itself. the trec-2007 blog track. In Proc. of the 16th Text First three remarks says that filtering strategy is very dan- Retrieval Conference (TREC), 2007. gerous for MAP0 performance, that is removing documents [7] Crag Macdonald and Iadh Ounis. The trec blogs06 affects greatly the performance of the topical opinion re- collection : Creating and analysing a blog test trieval. collection. Technical report, University of Glasgow From the above considerations, we may conclude that the Scotland, UK, 2006. opinion retrieval task is not easy and that having good re- [8] I. Ounis, M. de Rijke, C. Macdonald, G. A. Mishne, sults with a filtering approach requires a too high accuracy. and I. Soboroff. Overview of the trec-2006 blog track. The experimentation instead allows us to identify a plausible In TREC 2006 Working Notes, 2006. range for the MAP achievable by an opinion retrieval system: [9] I. Ounis, C. Macdonald, and I. Soboroff. On the trec the classifier with accuracy 100% and the random classifiers blog track. In Proc. of the 2nd International obtains performance that can be considered as thresholds Conference on Weblogs and Social Media (ICWSM), for the best and the worst opinion detection system. It is 2008. k Figure 1: MAPs of opinion of the baselines filtered by CO and for k = 0.5, 0.6, 0.7, 0.8, 0.9, 1. The opinion MAPs (dotted lines) and relevance MAPs (dot-dashed lines) of the baselines are also reported. Finally dashed lines RC show the opinion MAPs for the baselines filtered by CO . [10] I. Ounis, C. Macdonald, and I. Soboroff. Overview of the trec-2008 blog track. In Proc. of the 17th Text Retrieval Conference (TREC), 2008. [11] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2):1–135, 2008. [12] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proc. of the ACL-02 conference on Empirical Methods in Natural Language Processing, pages 79–86, 2002. [13] Ian Soboroff and Donna Harman. Overview of the trec 2003 novelty track. In TREC, pages 38–53, 2003.