CCS CONCEPTS

Unanimity-Aware Gain for Highly Subjective Assessments

Tetsuya Sakai

tetsuyasakai@acm.org 0 0 Waseda University

2017

39 42

IR tasks have diversied: human assessments of items such as social media posts can be highly subjective, in which case it becomes necessary to hire many assessors per item to reect their diverse views. For example, the value of a tweet for a given purpose may be judged by (say) ten assessors, and their ratings could be summed up to dene its gain value for computing a graded-relevance evaluation measure. In the present study, we propose a simple variant of this approach, which takes into account the fact that some items receive unanimous ratings while others are more controversial. We generate simulated ratings based on a real social-media-based IR task data to examine the eect of our unanimity-aware approach on the system ranking and on statistical signicance. Our results show that incorporating unanimity can aect statistical signicance test results even when its impact on the gain value is kept to a minimum. Moreover, since our simulated ratings do not consider the correlation present in the assessors' actual ratings, our experiments probably underestimate the eect of introducing unanimity into evaluation. Hence, if researchers accept that unanimous votes should be valued more highly than controversial ones, then our proposed approach may be worth incorporating.

CCS CONCEPTS

•Information systems ! Retrieval eectiveness;

INTRODUCTION

In traditional test-collection-based IR experiments, we oen rely on our experience which says that system rankings would remain stable even if the set of document relevance assessments are replaced by another [ 13 ]. However, IR tasks have diversied: human assessments of items such as social media posts can be highly subjective, in which case it becomes necessary to hire many assessors per item to reect their diverse views. For example, the value of a tweet for a given purpose may be judged by (say) ten assessors, and their ratings could be summed up to dene its gain value for computing a graded-relevance evaluation measure (e.g. [ 8, 11 ]). In the present study, we propose a simple variant of this approach, which takes into account the fact that some items receive unanimous ratings while others are more controversial. We generate simulated ratings based on a real social-media-based IR task data to examine the eect of our unanimity-aware approach on the system ranking and on statistical signicance. Our results show that incorporating Copying permied for private and academic purposes.

EVIA 2017, co-located with NTCIR-13, Tokyo, Japan. © 2017 Copyright held by the author. unanimity can aect statistical signicance test results even when its impact on the gain value is kept to a minimum. Moreover, since our simulated ratings do not consider the correlation present in the assessors’ actual ratings, our experiments probably underestimate the eect of introducing unanimity into evaluation. Hence, if researchers accept that unanimous votes should be valued more highly than controversial ones, then our proposed approach may be worth incorporating. Due to lack of space, we refer the reader to Sakai [ 7 ] for a short overview of studies related to inter-assessor agreement. Below, we briey discuss two studies that helps us to explain the novelty of our approach to utilising multiple relevance assessments.

Megorskaya et al. [ 4 ] studied the benet of communication between multiple assessors in the context of gamied relevance assessment for web search evaluation. e premise in their work is that every document needs to nally receive a single relevance level, as a result of a consensus between the assessors or an overruling by a “referee” etc. is is in contrast to our work, where we are interested in assessment tasks where there may be no such thing as the correct assessment, and therefore it is important to preserve dierent subjective views in the data and in evaluation.

Turpin et al. [ 12 ] propose to use magnitude estimation in document relevance assessments in order to obtain ratio-scale judgments instead of the traditional ordinal- or interval-scale ones, and to interpret the ratio-scale judgments directly as the gain values for computing normalised discounted cumulative gain (nDCG) and expected reciprocal rank (ERR). is is achieved by instructing the assessor to give an arbitrary score to his rst document and subsequently to give a “relative” score to each of the remaining documents, where “relative” means “in comparison to the preceding document.” While their approach and ours both produce continuous relevance assessments, unanimity across judges was not within the scope of their study. 2.2

Social Media Assessments

Wang et al. [ 14 ] examined the eect of assesser dierences in the context of the TREC Tweet Timeline Generation task, by devising two sets of tweet equivalence classes constructed by dierent assessors. eir conclusion is similar to that of Voorhees [ 13 ] who examined the eect of document relevance assessor dierences: despite the substantial dierences in the two sets of clusters, system rankings and the absolute evaluation measure scores based on these two sets were very similar. Sakai et al. [ 8 ] used graded-relevance measures to evaluate a community QA answer ranking task; each answer was assessed by four assessors and its gain value for computing the measures was determined as the sum of assessors’ grades. More recently, Shang et al. [ 11 ] reported on the NTCIR-12 Short Text Conversation task which is basically a tweet retrieval task: in their Japanese subtask, the sum of scores from ten assessors were used to dene the gain value of each tweet. Note that these studies do not take into account whether the assessors are unanimous or not; the sum is all that maers. e recent work of Li and Yoshikawa [ 2 ] is similar in spirit to ours, and deserves a detailed explanation. ey consider the problem of assessing the similarity between two documents using many assessors, and propose to incorporate what they call “confusability” into measures such as Pearson’s correlation. Specically, when computing a correlation value, they propose to weight each labelled item i by 1 ci , where ci is a normalised measure of confusability based on the dierence (Di ) between the highest and the lowest ratings for item i, and so on1. Li and Yoshikawa remark that the same idea can be applied to other measures such as nDCG, although they do not provide any details: here, let us try to faithfully apply their idea to ranked retrieval evaluation based on a group of assessors. Let N be the number of assessors per item, and suppose that each assessor assigns to each item a rating on a scale from 0; 1; : : : ; Dmax . A straightforward way to dene the nal relevance level or the actual gain value for each item would be to just sum up the ratings [ 8, 11 ]: then we would have relevance levels from 0 to N Dmax . For any item i with N independent assessments, let RawGi denote the gain value thus obtained. e above approach of Li and Yoshikawa suggests that we modify each gain value as follows:

More recently, at ICTIR 2017, Maddalena et al. [ 3 ] proposed an evaluation approach whose motivation is almost identical as ours: they also claim that the distribution of the scores from dierent assessors should be utilised for IR evaluation. More specically, they propose to replace a gain value of a document with an interval of gain values or even with a distribution of gain values, so that the 1 Li and Yoshikawa [ 2 ] also considered using standard deviation and entropy to quantify ci , but the present study focusses on the simplest case that relies on Di as we believe that evaluation methods should be as simple as possible. nal evaluation measures are also intervals or distributions. ey call their measures agreement-aware measures.

In contrast to their novel approaches, our proposal simply utilises the original score distribution across assessors to adjust the gain value of each document so that a traditional evaluation measure can be computed. It remains to be seen how the interval and distribution measures of Maddalena et al. can eectively be utilised in IR evaluation venues such as CLEF, NTCIR and TREC. 3

PROPOSED METHOD

Our proposal is very simple and highly intuitive. Given a constant p (0 p 1), let us dene the unanimity-aware gain as follows: UGi = RawGi + pN ¹Dmax

Di º (2) if RawGi > 0; otherwise UGi = RawGi = 0. Here, Dmax Di is a simple measure of unanimity where Di is, as before, the dierence between the maximum and the minimum among the N ratings2. Whereas, p controls the impact of unanimity on the gain. us, while we mainly want to reect RawGi in our evaluation, we apply an “upgrade” according to the degree of unanimity. When the ratings of an item are perfectly unanimous (i.e., Di = 0), we are giving it an extra pN Dmax ; that is, we shall pretend that pN extra assessors gave it the highest possible rating.

For Items 1-3 shown in Table 1, p = 0:2 implies that UG1 = 13; UG2 = 11; UG3 = 10; this is clearly more intuitive than the WGi values. On the other hand, consider Items 5-7 in Table 1: note that if p = 0:2, UG5 = UG6 = UG7 = 3. If this is not desirable, p = 0:1 may be used instead as shown in the same table; however, we shall discuss how to set an appropriate p elsewhere with real assessments in our future work. Hereaer, we only consider a modest impact by leing p = 0:2, as the focus of the present study is to demonstrate that our approach has a practical impact on experimental results even with a small p.

Our approach suggests a slight departure from traditional IR evaluation at the implementation level as well. In traditional IR, we usually prepare discrete relevance levels (e.g. relevant, highly relevant, etc.) to dene the gold standard: we know the number of relevance levels in advance, and we map each relevance level to a gain value at the time of measure calculation. In contrast, our approach suggests that we retain the individual ratings in the test collection, from which gain values can be computed on the y; there is no longer the notion of a predened set of relevance levels. It is easy to see that the highest possible value of UGi is ¹1 + pºN Dmax . Fortunately, there is a readily available IR evaluation tool that accommodates not only relevance-level-based computation but also direct gain-value-based computation, as we shall discuss below. 4

EXPERIMENTS

Let us demonstrate the eect of introducing unanimity-aware gain to an IR task where the ratings of the items can be highly subjective. To this end, we chose to use the recent NTCIR-12 Short Text Conversation (STC) Chinese subtask data [ 11 ] for the following reasons: (1) STC requires the system to return a “reasonable” tweet as a response to a human tweet, and the assessments are expected 2 Variants are possible of course: for example, we could obtain maximum and minimum values aer removing outlier ratings. to be highly subjective; (2) STC was the largest task of NTCIR-12, with 44 runs from 16 teams for the Chinese subtask3. e STC Chinese test collection contains 100 topics (i.e., input Weibo tweets) with relevance assessments (“qrels”) containing the following relevance levels: L0 (judged nonrelevant); L1 (relevant); and L2 (highly relevant).

From the ocial qrels, we created 15 simulated variants with Dmax 2 f2; 4; 8g and N 2 f5; 10; 20; 40; 80g, as follows. For each judged tweet of each topic, L0 is replaced with ¹0; 0; : : :º; whereas, both L1 and L2 are replaced with N simulated ratings obtained by random sampling from a uniform distribution over »0; Dmax ¼. We then compute the unanimity-aware gains using Eq. 2, and evaluate up to top 10 Weibo tweets from each run. Note that randomly sampling N times implies that as N gets large, we are more likely to obtain both 0 and Dmax among the N observations and therefore Di is more likely to be Dmax , i.e., UGi is more likely to reduce to RawGi (Eq. 2). In contrast, real ratings of dierent assessors are probably correlated with one another, which should generally make Di smaller than our simulated ratings. Hence, this experiment probably underestimates the impact of introducing unanimity-aware gain into evaluation.

We use the three ocial measures from STC: nG@1 (normalised gain at rank 1)4, P+ (see below), and nERR (normalised expected reciprocal rank) [ 11 ]. While the ocial STC Chinese subtask used the NTCIREVAL5 toolkit by giving a gain value of 1 to each L1-relevant tweet and 3 to each L2-relevant one, we utilised an alternative functionality of the same tool, which enables us to feed gain values of relevant items directly to it without considering the number of relevance levels. is feature was already available in NTCIREVAL for the purpose of accommodating the global gain proposed by Sakai and Song [ 9 ], which is an idea for obtaining a real-valued gain for each relevant web page for search result diversication evaluation. In their work, global gain was computed from intent-aware probabilities and per-intent graded relevance assessments.

P+, an ocial measure from STC but nevertheless less wellknown than nDCG and nERR, deserves a brief explanation here. Just like nERR, it is a measure suitable for navigational intents. Just as Average Precision (AP) employs (binary) precision to measure the utility of the top r documents for a user group who abandon the ranked list at r , P+ employs the blended ratio, which combines precision and cumulative gain, for the same purpose. Furthermore, just as AP assumes that the distribution of users abandoning the ranked list is uniform across all relevant documents (even if some of them are not retrieved) [ 5 ], P+ assumes that the distribution is uniform over all relevant documents ranked at or above rp , the preferred rank [ 11 ]. Given a ranked list, the preferred rank is the rank of the most relevant document that is closest to the top. In our case, the preferred rank is the rank of the document in the res le that has the highest UGi value and is closest to the top. Both AP and P+ represent the expected utility over their user abandonment distributions. 3 e Japanese subtask actually collected N = 10 individual ratings for each tweet, but had only 25 runs from seven teams. We plan to use this data set as well in a follow up study. 4 Note that neither Discounting nor Cumulation of “nDCG” does not apply at rank 1. 5 hp://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

5 RESULTS AND DISCUSSIONS

Table 2 compares the system rankings based on RawGi vs. UGi in terms of Kendall’s τ with 95% condence intervals, for nG@1, P+ and nERR averaged over the 100 ocial Chinese STC topics. It can be observed that all of the upper condence limits are above one, meaning that the systems rankings based on RawGi and UGi are statistically equivalent. However, except where the 95% CIs are “»1; 1¼,” the two rankings are not identical, even with p = 0:2. Recall also that we should expect to see lower rank correlations if we use real assessors’ ratings with correlations among them.

Probably a more practical concern than the change in the overall system ranking is: does the proposed method aect statistical signicance test results? If a researcher is interested in comparing every system pair, then conducting a pairwise test such as the paired t -test repeatedly (without correcting α ) is not the correct approach: one elegant solution would be to use the randomised Tukey HSD (Honestly Signicantly Dierence) test [ 1 ], which is free from distributional assumptions and ensures that the familywise error rate (i.e., the probability of incorrectly obtaining a statistically signicant dierence for at least one system pair) is α . We use the Discpower6 toolkit to conduct the randomised Tukey HSD test from each topic-by-run score matrix, with B = 5; 000 trials for each test. e STC Chinese subtask had 16 participating teams, and one run from each team (specically, best run in terms of the ocial Mean nG@1 score) is considered in this analysis, giving us 16 152 = 120 comparisons. Do RawGi and UGi give us similar p-values and similar research conclusions?

Table 3 summarises the discrepancy between the signicance test results with RawGi and those with UGi : these are the comparisons where the dierence is statistically signicant at α = 0:05 according to one while not signicant according to the other. pvalues, absolute score dierences (jdX Y j), and eect sizes (ESHSD) are also shown; ESHSD is computed by dividing jdX Y j by the residual standard deviation of each experimental condition [ 6 ]7. It can be observed that the eect of introducing unanimity-aware gain 6 hp://research.nii.ac.jp/ntcir/tools/discpower-en.html 7 is form of eect size measures the dierence between two systems in standard deviation units; unlike the p-value, is not a function of the sample size.

Dmax

2 (b) P+ (c) nERR cannot be overlooked, even with p = 0:2. For example, when Dmax = 2; N = 5, there are three discrepancies between nG@1 based on RawGi and that based on UGi among the 120 comparisons. Whereas, as was anticipated in Section 4, it can be observed that the impact of introducing unanimity is not observed for N = 40; 80. Again, with real ratings that tend to resemble one another and make Di smaller that these random ratings do, we will probably observe a more substantial impact of introducing unanimity-aware gain into evaluation. 6

CONCLUSIONS AND FUTURE WORK

We proposed a simple and intuitive approach to incorporating the assessors’ subjective yet unanimous decisions into gain-value-based retrieval evaluation, and demonstrated that this will aect experimental outcomes. Our results show that incorporating unanimity can aect statistical signicance test results even when its impact on the gain value is kept to a minimum. Moreover, since our simulated ratings do not consider the correlation present in the assessors’ actual ratings, our experiments probably underestimate the eect of introducing unanimity-aware gain into evaluation. Hence, if researchers accept that unanimous votes should be valued more highly than controversial ones, then our proposed approach may be worth incorporating. We also demonstrated how the proposed approach of directly feeding gain values to an existing evaluation tool can be accomplished, while bypassing the notion of discrete relevance levels.

Following the present study, the proposed unanimity-aware gain approach was applied to the recent NTCIR-13 Short Text Conversation (STC-2) Chinese subtask [ 10 ], with p = 0:2. ere, according to the randomised Tukey HSD test, three extra statistically significantly dierent system pairs were obtained by using unanimityaware nG@1 instead of the traditional nG@1; one extra statistically signicantly dierent system pair was obtained by using unanimityaware P+ instead of the traditional P+. us the sets of statistically signicantly dierent system pairs according to the unanimityaware approach were supersets of the corresponding sets based on the traditional gain values. However, there were only N = 3

In future work, we would like to apply our approach to diverse

[1]

Ben

Cartere e. 2012 . Multiple testing in statistical analysis of systems-based information retrieval experiments . ACM TOIS 30 , 1 ( 2012 ).

[2]

Jiyi

Li and

Masatoshi

Yoshikawa . 2016 . Evaluation with Confusable Ground Truth . In Proceedings of AIRS 2016 (LNCS 9994) . 363 - 369 .

[3]

Eddy

Maddalena , Kevin Roitero, Gianluca Demartini, and

Stefano

Mizzaro . 2017 . Considering Assessor Agreement in IR Evaluation . In Proceedings of ACM ICTIR 2017 . 75 - 82 .

[4]

Olga

Megorskaya , Vladimir Kukushkin, and

Pavel

Serdyukov . 2015 . On the Relation between Asessor's Agreement and Accuracy in Gamied Relevance Assessment . In Proceedings of ACM SIGIR 2015 . 605 - 614 .

[5] Stephen

Robertson . 2008 . A New Interpretation of Average Precision . In Proceedings of ACM SIGIR 2008 . 689 - 690 .

[6]

Tetsuya

Sakai . 2014 . Statistical Reform in Information Retrieval? SIGIR Forum 48 , 1 ( 2014 ), 3 - 12 .

[7]

Tetsuya

Sakai . 2017 . e E ect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students . In Proceedings of EVIA 2017 .

[8]

Tetsuya

Sakai , Daisuke Ishikawa, Noriko Kando, Yohei Seki, Kazuko Kuriyama, and Chin-Yew Lin . 2011 . Using Graded-Relevance Metrics for Evaluating Community QA Answer Selection . In Proceedings of ACM WSDM 2011 . 187 - 196 .

[9]

Tetsuya

Sakai and

Ruihua

Song . 2011 . Evaluating Diversied Search Results Using Per-Intent Graded Relevance . In Proceedings of ACM SIGIR 2011 . 1043 - 1052 .

[10] Lifeng

Shang

, Tetsuya Sakai,

Hang

Li ,

Ryuichiro

Higashinaka , Yusuke Miyao, Yuki Arase, and

Masako

Nomoto . 2017 . Overview of the NTCIR-13 Short Text Conversation Task . In Proceedings of NTCIR-13.

[11] Lifeng

Shang

, Tetsuya Sakai, Zhengdong Lu,

Hang

Li ,

Ryuichiro

Higashinaka , and

Yusuke

Miyao . 2016 . Overview of the NTCIR-12 Short Text Conversation Task . In Proceedings of NTCIR-12 . 473 - 484 .

[12]

Andrew

Turpin , Falk Scholer, Stefano Mizzaro, and

Eddy

Maddalena . 2015 . e Benets of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation . In Proceedings of ACM SIGIR 2015 . 565 - 574 .

[13] Ellen

Voorhees . 1998 . Variations in Relevance Judgments and the Measurement of Retrieval Eectiveness . In Proceedings of ACM SIGIR 1998 . 315 - 323 .

[14] Yulu

Wang

, Garrick Sherman,

Jimmy

Lin , and

Miles

Efron . 2015 . Assessor Dierences and User Preferences in Tweet Timeline Generation . In Proceedings of ACM SIGIR 2015 . 615 - 624 .