Categories and Subject Descriptors

A study on evaluation on opinion retrieval systems

Giambattista Amati

gba@fub.it 2

Giuseppe Amodeo

gamodeo@fub.it 0

Valerio Capozio

valeriocapozio@gmail.com 1

Carlo Gaibisso

carlo.gaibisso@iasi.cnr.it 3

Giorgio Gambosi

gambosi@mat.uniroma2.it 1 0 Dept. of Computer Science, University of L'Aquila , L'Aquila , Italy 1 Dept. of Mathematics, University of Rome "Tor, Vergata" , Rome , Italy 2 Fondazione Ugo Bordoni , Rome , Italy 3 Istituto di Analisi dei Sistemi, ed Informatica "Antonio, Ruberti" - CNR , Rome , Italy

2010

27 28

We study the evaluation of opinion retrieval systems. Opinion retrieval is a relatively new research area, nevertheless classical evaluation measures, those adopted for ad hoc retrieval, such as MAP, precision at 10 etc., were used to assess the quality of rankings. In this paper we investigate the effectiveness of these standard evaluation measures for topical opinion retrieval. In doing this we split the opinion dimension from the relevance one and use opinion classi ers, with varying accuracy, to analyse how opinion retrieval performance changes by perturbing the outcomes of the opinion classi ers. Classi ers could be studied in two modalities, that is either to re-rank or to lter out directly documents obtained through a rst relevance retrieval. In this paper we formally outline both approaches, while for now focussing on the ltering process. The proposed approach aims to establish the correlation between the accuracy of the classi ers and the performance of the topical opinion retrieval. In this way it will be possible to assess the e ectiveness of the opinion component by comparing the e ectiveness of the relevance baseline with that of the topical opinion.

Categories and Subject Descriptors

H.3.0 [Information Storage and Retrieval]: General; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

1. INTRODUCTION

Sentiment analysis aims to documents classi cation, according to opinions, sentiments, or, more generally, subjective features contained in text. The study and evaluation of e cient solutions to detect sentiments in text is a popular research area, and di erent techniques have been applied coming from natural language processing, computational linguistics, machine learning, information retrieval and text mining.

The application of sentimental analysis to Information Retrieval goes back to the novelty track of TREC 2003 [ 13 ]. Topical opinion retrieval is also known as opinion retrieval or opinion nding [ 4, 9, 11 ]. In [5, 3, 2, ?] dictionary-based methodologies for topical opinion retrieval are proposed. An application of opinion nding to blogs was introduced in the Blog Track of TREC 2006 [ 8 ]. However, there is not yet a comprehensive study of evaluation of topical opinion systems, and in particular of the interaction and correlation between relevance and sentiment assessments.

At rst glance, evaluation of opinion retrieval systems seems to not deserve any further investigation or extra effort with respect to the evaluation of conventional retrieval systems. Traditional evaluation measures, such as the Mean Average Precision (MAP) or the precision at 10 [ 8, 6, 10, 11 ], can be still used to evaluate rankings of opinionated documents that are also assessed to be relevant to a given topic. However, if we give a deeper look at the performance of topical opinion systems we are struck by the diversity in the observed values of performance. For example the best run for topic relevance in the blog track of TREC 2008 [ 10 ] achieves a MAP value equal to 0.4954, that drops to 0.4052, as concerns the MAP of opinion, in the opinion nding task. Performance degradation is as expected because any variable which is additional to relevance, i.e. the opinion one, must deteriorate the system performance. However, we do not have yet a way to set apart the e ectiveness of the opinion detection component and evaluate how e ective it is, or to determine whether and to which extent, the relevance and opinion detection components are in uenced by each other. It seems evident that an evaluation methodology or at least some benchmarks are needed to make it possible to assess how e ective the opinion component is. To exemplify: how e ective is the performance value of opinion MAP 0.4052 when we start from an initial relevance MAP of 0.4954? It is indeed a matter of fact that opinion MAP in TREC [ 8, 6, 10 ], seems to be highly dependent on the relevance MAP of the rst-pass retrieval [ 9 ].

The general issue is thus the following: can we assume that absolute values of MAP can be used as they are to compare di erent tasks, in our case the topical opinion and the ad hoc relevance task; and thus: evaluation measures can be used without any MAP normalization to compare or to assess the state of the art of di erent techniques on opinion nding?

At this aim, we introduce a completely novel methodological framework which: provides a bound for the best achievable opinion MAP, for a given relevance document ranking; predicts the performance of topical opinion retrieval given the performance of the topic retrieval and opinion detection; viceversa, provides whether a given opinion detection technique gives a signi cant or marginal contribution to the state of the art; investigates the robustness of evaluation measures for opinion retrieval e ectiveness. indicates what re-ranking or ltering strategy is best suited to improve topical retrieval by opinion classiers.

This paper is organized as follows. The proposed evaluation method is presented in sections 2 and 4; section 3 introduces the collection used for tests. Results are presented in section 5, and conclusions follow in section 6.

EVALUATION APPROACH

An opinion retrieval system is based on a topic retrieval and an opinion detection subsystem [ 9 ]: di erent kinds of \information" are retrieved and weighted in order to generate a nal ranking of documents that re ects their relevance with both topic and opinion content. To analyse the e ectiveness of the whole system, we should be able to quantify not only the performance of the nal result, but also the contribution of each subsystem. As usual, the evaluation metric used in literature for the nal ranking is the MAP. But MAP (of relevance and opinion) for the nal ranking is not sufcient to fully assess the performance of the whole system: the contribution of each component, taken separately, needs to be identi ed.

The input to the proposed topical opinion evaluation process is the relevance baseline, i.e. the ranking of documents generated by the topic retrieval system, here considered as a black box. The e ectiveness of the topic retrieval component is measured by the MAP of opinion and relevance of this baseline.

The evaluation of the e ectiveness of the opinion detection component, relies on arti cially de ned classi ers of opinion. The arti cial classi er COk classi es documents as opinionated, O, or not opinionated, O, with accuracy k, 0 k 1. The classi cation process is independent from k the topic relevance of documents. To achieve accuracy k CO properly classi es each document with probability k.

Therefore the number of misclassi ed documents is (1 k) n, where n is the number of classi ed documents. Assuming the independence between opinion and relevance, the misclassi ed documents will be distributed randomly between relevant and not relevant.

The outcomes of these arti cial classi ers are then used to modify the baseline. This can be done following two di erent approaches: a ltering process: when documents of the baseline are deemed as not opinionated by the classi er, they are removed from the ranking; a re-ranking process: when documents of the baseline are considered as opinionated by the classi er, they receive a \reward" in their rank.

The ltering process uses the classi er in its classical meaning. This process is particularly suitable to analyse the effectiveness of the technique itself to opinion detection, as a classi cation task [ 12 ], and its e ects on topical opinion performance. Opinion ltering also gives some interesting clues on what is the optimal performance achievable by an opinion retrieval technique based on ltering, and also whether ltering strategy is in general superior or not to even very simple re-ranking strategies.

In the re-ranking process a \reward" function for the documents has to be de ned. In such a case we introduce bias in assigning correct rewards, and we thus may observe the e ectiveness of a re-ranking algorithm as long as the opinion detection performance changes.

By \comparing" the results of an opinion retrieval system with the ltering process, or the re-ranking process at several levels of accuracy, we can obtain relevant clues about: the overall contribution introduced by the opinion system only and its robustness; the e ectiveness of the opinion detection component; In the following we formally describe both the approaches and focus on the experimentation concerning the ltering process only. 3.

EXPERIMENTATION ENVIRONMENT

We used the BLOG06 [ 7 ] collection and the data sets of the Blog Track of TREC 2006, 2007 and 2008 [ 8, 6, 10 ] for our experimentation. Since 2006, Blog Track has an evaluation track on blogs where the main task is opinion retrieval, that is the task of selecting the opinionated blog posts relevant to a given topic [ 9 ]. BLOG06 collection size is 148 GB and contains spam as well as possibly non-blogs and non-English pages.

The data set consists of 150 topics and a list, the Qrels, in which the relevance and content of opinion of documents are assessed with respect to each topic. An item in the list identi es a topic t, a document d and a judgement of relevance/opinion assigned as follows: 0 if d is not relevant with respect to t; 1 if d is relevant to t, but does not contain comments on t; 2 if d is relevant to t and contains positive comments on t; 3 if d is relevant to t and contains neutral comments on t; 4 if d is relevant to t and contains negative comments on t.

Note that not relevant documents are not classi ed according to their opinion content.

In the following, [x] denotes the set of documents labelled by an x = 0; 1; 2; 3; 4, and not labelled documents belong to [0] by default.

TREC organizers also provide the best ve baselines, produced by some participants, denoted by BL1, BL2, : : : ; BL5.

EVALUATION FRAMEWORK

The behaviour of arti cial classi er COk is de ned through the Qrels. COk predicts the right opinion orientation of each document in the collection by searching it in the Qrels. The accuracy k is simulated by the introduction of a bias in the classi cation. Documents not appearing or assessed as not relevant in the Qrels, will be classi ed according to the distribution of probability of opinionated and not opinionated documents among the relevant ones. Taking into account both relevance and opinion in the test collection we obtain the contingency Table 1. As shown in table 1, the Qrels does not provide the opinion classes for not relevant documents. The missing data complicate a little bit, but not much, the construction of our classi ers. To overcome the problem, we assume that

P r(OjR) = P r(OjR) Equation 1 asserts that there is not a su cient reason to have a di erent distribution of opinion among relevant and not relevant documents. An a priori probability, Pr(O), for opinionated documents is still unknown. However equation 1 implies that O and R are independent, thus

P r(OjR) = P r(O) From equations 1 and 2 follows that

P r(OjR) = P r(OjR) = P r(O) = 1 P r(O)

Equations 2 and 3 are equivalent to assume that the set f[ 2 ] [ [ 3 ] [ [ 4 ]g, as de ned in Table 1, is a sample of the set of opinionated documents. Thus, without loss of generality, we can de ne Pr(O) using only the documents classi ed as relevant by the Qrels as follows: (1) (2) (3) (4) (5) P (O) =

jf[ 2 ] [ [ 3 ] [ [ 4 ]gj jf[ 1 ] [ [ 2 ] [ [ 3 ] [ [ 4 ]gj and consequently

P (O) = 1

P (O) =

j[ 1 ]j jf[ 1 ] [ [ 2 ] [ [ 3 ] [ [ 4 ]gj

In the following we study whether and how the set of relevant and not relevant documents classi ed as opinionated a ects the topical opinion ranking.

We have to say that for both approaches, ltering or reranking, a misclassi cation may have controversial e ects on the e ectiveness of the nal ranking. If we lter documents by opinions with a classi er, for example, the misclassi ed and removed not relevant documents may bring a positive contribution to the precision measures, because all opinionated and relevant documents that were below them, will have a higher rank after their removal. Even with the re-ranking approach we have a similar situation, but this precision boosting phenomenon is attenuated by the fact that re-ranking is not based on as drastic decision as that of a removal, and the repositioning of a document does not propagate to all documents that are below it in the original ranking.

R R

Together with COk, we introduce a random classi er CORC that classi es documents according to the a priori distribution of opinionated documents in the collection. It represents a good approximation of the random behaviour of a classi er. More precisely, this classi er assesses a document as opinionated with probability P (O) and as not opinionated with probability Pr(O) = 1 Pr(O).

As already stated, in the ltering approach documents classi ed as not opinionated are removed from the baseline. Note that while relevant documents contribute and improve the evaluation measure, if correctly classi ed, the not relevant ones do not contribute directly to this measure.

In conclusion if a not relevant document is classi ed as opinionated not being actually opinionated, then this misclassi cation will not a ect the evaluation measure. Di erently the removal of not relevant documents regardless of their real opinion orientation, always positively a ects the ranking, even if misclassi ed.

For relevant documents instead the misclassi cation always negatively a ects the ranking.

With this approach we can observe how hard is to overcome the baseline, i.e. we can identify how e ective must be the opinion detection technique to improve the starting topic retrieval.

Re-ranking techniques essentially are fusion models [ 9 ] that combine a relevance score sR(d) and an opinion score sO(d) (or two ranks derived from these scores) for a document d. The new score sOR(d) is a function of the two non negative scores, sR(d) and sO(d):

sOR(d) = f (sR(d); sO(d)) re-ranked. sCOR(d) is de ned as follows: onGtihveenouatccloamsseis eorfCCOkOk, waeccdoerdninegatnoewwhsiccohrethsCe baseline is OR(d) based sCOR(d) = ( f (sf R(s(Rd)(;ds)O;0()d)) iiff dd 622CCOOkk OO (6) (7) where 2COk denotes the classi er outcome, that is when the document is assigned to a given class. Note when k = 100% and assuming that f ( ; ) is a not decreasing function of sO( ), i.e. f (sR(d); x) f (sR(d); x0); 8x x0, the opinion MAP of any ranking based on sOR( ) does not exceed that based on sCOR( ) .

All the above considerations can be further extended to the case in witch the sOR(d) is based on the ranks of d instead of on its scores (of relevance and opinion).

EXPERIMENTATION RESULTS

In this paper we report the experimentation results for the ltering approach. The ltering process has been repeated 20 times for each baseline and for accuracy k = 0:5, 0:6, 0:7,0:8,0:9,1. Mean values of the MAPs are reported.

Table 2 reports, in decreasing order, the relevance MAPs (M APR) and the opinion MAPs (M APO) for each baseline.

BL4 BL5 BL3 BL1 BL2

Baselines MAPR

In gure 1 MAP values are reported for each baseline as long as the accuracy of classi ers changes. The dotted lines represent the baselines opinion MAPs and the dot-dashed lines represent the baseline relevance MAPs. The MAP values of random classi er is also reported as the dashed lines in the graphs.

Analysing the MAP trend we can infer the following observations: 1. the baseline MAPR is an upper bound for the MAP0 obtained with a ltering approach; 2. the random classi er always deteriorate the performance of the baseline MAP0. 3. the minimal accuracy needed to improve by ltering the baseline MAP0 is very high, at least 80%; 4. there is a linear correlation between the MAP0 achievable by a classi er with accuracy k and the accuracy itself.

First three remarks says that ltering strategy is very dangerous for MAP0 performance, that is removing documents a ects greatly the performance of the topical opinion retrieval.

From the above considerations, we may conclude that the opinion retrieval task is not easy and that having good results with a ltering approach requires a too high accuracy. The experimentation instead allows us to identify a plausible range for the MAP achievable by an opinion retrieval system: the classi er with accuracy 100% and the random classi ers obtains performance that can be considered as thresholds for the best and the worst opinion detection system. It is also evident that higher the baseline MAP is, higher the accuracy of classi er must be to introduce some bene ts with a ltering approach with respect to relevance only retrieval. 6.

CONCLUSIONS AND FUTURE WORKS

The opinion retrieval problem seems to be a relatively hard task: the combination of two variables like topic relevance and opinion, requires a deep analysis on their correlation. From the results of TREC competitions [ 8, 6, 10, 9 ], emerges the lack of exhaustive evaluations measures: the MAP, Precision at 10 and R-Precision are not su cient alone to give a complete analysis on the systems performances.

Up to now we have studied only the ltering of documents by opinions. This strategy however requires a very high accuracy of the classi cation. We will compute the study with re-ranking approach starting from the approach used in [ 1, 2 ].

Our approach is able to provide an indicative accuracy of the opinion component of the topical opinion retrieval system. It also allows us to propose an evaluation framework, able to evaluate the e ectiveness of opinion retrieval systems. 7.

[1]

Amati , E. Ambrosi,

Bianchi ,

Gaibisso , and

Gambosi . Fub, iasi -cnr and university of tor vergata at trec 2007 blog track . In Proc. of the 16th Text Retrieval Conference (TREC) , 2007 .

[2]

Amati , G. Amodeo,

Bianchi ,

Gaibisso , and

Gambosi . A uniform theoretic approach to opinion and information retrieval , in Intelligent Information Access, G. Armano, M. de Gemmis, G. Semeraro, and E. Vargiu (eds.) Studies in Computational Intelligence . Springer, to appear.

[3]

Skomorowski and

Vechtomova . Ad hoc retrieval of documents with topical opinion . In G. Amati,

Carpineto , and G. Romano, editors, ECIR , volume 4425 of Lecture Notes in Computer Science, pages 405 { 417 . Springer, 2007 .

[4]

Eguchi and

Lavrenko . Sentiment retrieval using generative models . In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing , pages 345 { 354 , Sydney , Australia, July 2006 . Association for Computational Linguistics .

[5]

Mishne . Multiple ranking strategies for opinion retrieval in blogs . In The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings , 2006 .

[6]

Macdonald , I. Ounis , and I. Soboro . Overview of the trec-2007 blog track . In Proc. of the 16th Text Retrieval Conference (TREC) , 2007 .

[7]

Crag

Macdonald and

Iadh

Ounis . The trec blogs06 collection : Creating and analysing a blog test collection . Technical report , University of Glasgow Scotland, UK, 2006 .

[8]

Ounis , M. de Rijke, C. Macdonald,

G. A.

Mishne , and I. Soboro . Overview of the trec-2006 blog track . In TREC 2006 Working Notes , 2006 .

[9]

Ounis ,

Macdonald ,

and I.

Soboro . On the trec blog track . In Proc. of the 2nd International Conference on Weblogs and Social Media (ICWSM) , 2008 .

[10]

Ounis ,

Macdonald , and I. Soboro . Overview of the trec-2008 blog track . In Proc. of the 17th Text Retrieval Conference (TREC) , 2008 .

[11]

Pang and

Lee . Opinion mining and sentiment analysis . Foundations and Trends in Information Retrieval , 2 ( 1 {2):1{ 135 , 2008 .

[12]

Pang ,

Lee , and

Vaithyanathan . Thumbs up? sentiment classi cation using machine learning techniques . In Proc. of the ACL-02 conference on Empirical Methods in Natural Language Processing , pages 79 { 86 , 2002 .

[13]

Ian

Soboro and

Donna

Harman . Overview of the trec 2003 novelty track . In TREC , pages 38 { 53 , 2003 .