=Paper=
{{Paper
|id=Vol-2008/paper_5
|storemode=property
|title=The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students
|pdfUrl=https://ceur-ws.org/Vol-2008/paper_5.pdf
|volume=Vol-2008
|authors=Tetsuya Sakai
|dblpUrl=https://dblp.org/rec/conf/ntcir/Sakai17b
}}
==The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students==
The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students Tetsuya Sakai Waseda University tetsuyasakai@acm.org ABSTRACT of the subjective nature of the relevance assessment process on the This paper reports on a case study on the inter-assessor disagree- final IR evaluation results. ments in the English NTCIR-13 We Want Web (WWW) collection. This paper reports on a case study on the inter-assessor dis- For each of our 50 topics, pooled documents were independently agreements in a recently-constructed ad hoc web search test col- judged by three assessors: two “lancers” and one Waseda University lection, namely, the English NTCIR-13 We Want Web (WWW) col- student. A lancer is a worker hired through a Japanese part time lection [10]. For each of our 50 topics, pooled documents were job matching website, where the hirer is required to rate the quality independently judged by three assessors: two “lancers” and one of the lancer’s work upon task completion and therefore the lancer Waseda University student. A lancer is a worker hired through has a reputation to maintain. Nine lancers and five students were a Japanese part time job matching website1 , where the hirer is hired in total; the hourly pay was the same for all assessors. On required to rate the quality of the lancer’s work upon task comple- the whole, the inter-assessor agreement between two lancers is tion and therefore the lancer has a reputation to maintain2 . Nine statistically significantly higher than that between a lancer and lancers and five students were hired in total; the hourly pay was the a student. We then compared the system rankings and statistical same for all assessors. On the whole, the inter-assessor agreement significance test results according to different qrels versions cre- between two lancers is statistically significantly higher than that ated by changing which asessors to rely on: overall, the outcomes between a lancer and a student (Section 3). We then compared do differ according to the qrels versions, and those that rely on the system rankings and statistical significance test results accord- multiple assessors have a higher discriminative power than those ing to different qrels versions created by changing which asessors that rely on a single assessor. Furthermore, we consider remov- to rely on: overall, the outcomes do differ according to the qrels ing topics with relatively low inter-assessor agreements from the versions, and those that rely on multiple assessors have a higher original topic set: we thus rank systems using 27 high-agreement discriminative power (i.e., the ability to obtain many statistically topics, after removing 23 low-agreement topics. While the system significant system pairs [14, 15]) than those that rely on a single as- ranking with the full topic set and that with the high-agreement set sessor (Section 4.1). Furthermore, we consider removing topics with are statistically equivalent, the ranking with the high-agreement relatively low inter-assessor agreements from the original topic set and that with the low-agreement set are not. Moreover, the set: we thus rank systems using 27 high-agreement topics, after low-agreement set substantially underperforms the full and the removing 23 low-agreement topics. While the system ranking with high-agreement sets in terms of discriminative power. Hence, from the full topic set and that with the high-agreement set are statisti- a statistical point of view, our results suggest that a high-agreement cally equivalent, the ranking with the high-agreement set and that topic set is more useful for finding concrete research conclusions with the low-agreement set are not. Moreover, the low-agreement than a low-agreement one. set substantially underperforms the full and the high-agreement sets in terms of discriminative power (Section 4.2). Hence, from a CCS CONCEPTS statistical point of view, our results suggest that a high-agreement topic set is more useful for finding concrete research conclusions •Information systems → Retrieval effectiveness; than a low-agreement one. KEYWORDS 2 RELATED WORK/NOVELTY OF OUR WORK inter-assessor agreement; p-values; relevance assessments; statisti- cal significance Studies on the effect of inter-assessor (dis)agreement on IR system evaluation have a long history; Bailey et al. [2] provides a concise survey on this topic covering the period 1969-2008. More recent 1 INTRODUCTION work in the literature includes Carterette and Soboroff [4], Webber, While IR researchers often view laboratory IR evaluation results as Chandar, and Carterette [21], Demeester et al. [5] Megorskaya, something objective, at the core of any laboratory IR experiments Kukushkin, and Serdyukov [12], Wang et al. [20], Ferrante, Ferro, lie the relevance assessments, which are the result of subjective and Maistro [6], and Maddalena et al. [11]. Among these studies, the judgements of documents by a person, or multiple persons, based work of Voorhees [19] from 2000 (or the earlier version reported at on a particular (intepretation of an) information need. Hence it is SIGIR 1998) is probably one of the most well-known; below, we first of utmost importance for IR researchers to understand the effects 1 http://www.lancers.jp/ (in Japanese). See also https://www.techinasia.com/ Copying permitted for private and academic purposes. lancers-produces-200-million-freelancing-gigs-growing (in English) EVIA 2017, co-located with NTCIR-13, Tokyo, Japan. 2 The lancer then rates the hirer; therefore the hirer also has a reputation to maintain © 2017 Copyright held by the author. on the website. 31 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai highlight the differences between her work and the present study, • As our topics were sampled from a query log, none of our since the primary research question of the present study is whether assessors are the topic originators (or “primary” [19] or her well-known findings generalise to our new test collection with “gold” assessors [2]); The assessors were not provided with experimental settings that are quite different from hers in several any information other than the query (e.g., a narrative ways. After that, we also briefly compare the present study with field [1, 8]): the definition for a highly relevant document the recent, closely-related work of Maddalena et al. [11] from ICTIR was: “it is likely that the user who entered this search query 2017. will find this page relevant”; that for a relevant document Voorhees [19] examined the effect of using different qrels ver- was: “it is possible that the user who entered this search sions on ad hoc IR system evaluation. Her experiments used the query will find this page relevant ” [10]. TREC-4 and TREC-6 data3 . In particular, in her experiments with • We discuss inter-assessor agreeement and system ranking the 50 TREC-4 topics, she hired two additional assessors in addition agreement using stastical tools, namely, linear weighted κ to the primary assessor who created the topic, and discussed the with 95%CIs (which, unlike raw overlap measures, takes pairwise inter-assessor agreement in terms of overlap as well as chance agreement into account [2]) and Kendall’s τ with recall and precision: overlap is defined as the size of the intersec- 95%CIs. Moreover, we employ the randomised Tukey HSD tion of two relevant sets divided by the size of the union; recall test [3, 16] to discuss the discrepancies in statistical sig- and precison are defined by one of the relevant set as the gold nificance test results. Furthermore, we consider removing data. However, it was not quite the case that the three assessors topics that appear to be unreliable in terms of inter-assessor judged the same document pool independently: the document sets agreement. provided to the additional assessors were created after the primary While the recent work of Maddalena et al. [11] addressed sev- assessment, by mixing both relevant and nonrelevant documents eral research questions related to inter-assessor agreement, one from the primary assessor’s judgements. Moreover, all documents aspect of their study is closely related to our analysis with high- judged relevant by the primary assessor but not included in the agreement and low-agreement topic sets. Maddalena et al. utilised document set for the additional assessors were counted towards the TREC 2010 Relevance Feedback track data and exactly five the set intersection when computing the inter-assessor agreement. different relevance assessments for each ClueWeb document, and Her experiments with the TREC-6 experiments relied on a different used Krippendorph’s α [9] to quantify the inter-assessor agreement. setting, where University of Waterloo created their own pools and They defined high-agreement and lowagreeement topics based on relevance assessments independent of the original pools and assess- Krippendorph’s α, and reported that high-agreement topics can ments. She considered binary relevance only4 , and therefore she predict the system ranking with the full topic set more accurately considered Average Precision and Recall at 1000 as effectiveness than low-agreement topics. The analysis in the present study differs evaluation measures. Her main conclusion was: “The actual value from the above as discussed below: of the effectiveness measure was affected by the different conditions, but in each case the relative performance of the retrieved runs was • Krippendorph’s α disregards which assessments came from almost always the same. These results validate the use of the TREC which assessors, as it is a measure of the overall reliability test collections for comparative retrieval experiments.” of the data. In the present study, where we only have three The present study differs from that of Voorhees in the following assessors, we are more interested in the agreement between aspects at least: every pair of assessors and hence utilise Cohen’s linear • We use a new English web search test collection contructed weighted κ. Hence our definition of a high/low-agreement for the NTCIR-13 WWW task, with depth-30 pools. topic differs from that of Maddalena et al.: according to • For each of our 50 topics, the same pool was completely our definition, a topic is in high agreement if the κ is sta- independently judged by three assessors. Nine assessors tistically significantly positive for every pair of assessors. were hired through the lancers website, and an additional • While Maddalena et al. discussed the absolute effective- five assessors were hired at Waseda University, so that ness scores and system rankings only, the present study each topic was judged by two lancers and one student. discusses statistical significance testing after replacing the • We collected graded relevance assessments from each as- full topic set with the high/low-agreement set. sessor: highly relevant (2 points), relevant (1 point), non- • Maddalena et al. focussed on nDCG; we discuss the three relevant (0) and error (0) for cases where the web pages aforementioned official measures of the NTCIR-13 WWW to judge could not be displayed. When consolidating the task [10]. multiple assessments, the raw scores were added to form more fine-grained graded relevance data. 3 DATA • We use graded relevance measures at cutoff 10 (repre- senting the quality of the first search engine result page), The NTCIR-13 WWW English subtask created 100 test topics; 13 namely nDCG@10, Q@10, and nERR@10 [17], which are runs were submitted from three teams. We acknowledge that this the official measures of the WWW task. is a clear limitation of the present study: we would have liked a larger number of runs from a larger number of teams. However, we 3 The document collections are: disks 2 and 3 for TREC-4; disks 4 and 5 for TREC-6 [8]. claim that this limitation does not invalidate neither our approach 4 The original Waterloo assessments on a tertiary scale, but were collapsed into binary to analysing inter-assessor disagreement nor the actual results on for her analysis. the system ranking and statistical significance. We hope to repeat 32 The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Table 1: Pairwise inter-assessor agreement: lancer1, lancer2, and student. (a) lancer1 vs. lancer2 (b) lancer1 vs. student (c) lancer2 vs. student 0 1 2 0 1 2 0 1 2 0 3991 1354 487 0 3406 1540 886 0 3215 1232 938 raw scores 1 947 1260 882 1 1051 1100 938 1 1203 1415 1043 2 447 1047 799 2 416 787 1090 2 455 780 933 Linear weighted Cohen’s κ with 95%CI 0.336 [0.322, 0.351] 0.283 [0.268, 0.298] 0.261 [0.246, 0.276] Mean (min/max) per-topic linear weighted Cohen’s κ 0.293 (−0.007/0.683) 0.211 (−0.128/0.583) 0.225 (−0.054/0.918) #Topics where per-topic κ is not statistically significantly positive 8 15 19 Binary Cohen’s κ with 95%CI 0.424 [0.407, 0.441] 0.309 [0.292, 0.327] 0.314 [0.296, 0.331] Binary raw agreement 0.712 0.653 0.659 Table 2: Number of Lx-relevant documents in each qrels. L0 L1 L2 L3 L4 L5 L6 total judged all3 2,603 1,897 2,135 1,535 1,537 1,035 472 11,214 2lancers 3,991 2,301 2,194 1,929 799 - - 11,214 lancer1 5,832 3,089 2,293 - - - - 11,214 lancer2 5,385 3,661 2,168 - - - - 11,214 student 4,873 3,427 2,914 - - - - 11,214 the same analysis on a larger set of runs in the next round of the disagreement is 15 . It should be noted that κ represents how much WWW task. agreement there is beyond chance. For evaluating the 13 runs submitted to the WWW task, we Table 1 summarises the inter-assessor agreement results. The created a depth-30 pool for each of the 100 topics and this resulted “raw scores” section shows the 3×3 confusion matrices for each in a total of 22,912 documents to judge. We hired nine lancers who pair of assessors; the counts were summed across topics, although speak English through the lancers website: the job call and the lancer1, lancer2, and student are actually not single persons. The relevance assessment instructions were published on the website linear weighted κ’s were computed based on these matrices. It can in English. None of them had any prior experience in relevance be observed that the lancer-lancer κ is statistically significantly assessments. Topics were assigned at random to the nine assessors higher than the lancer-student κ’s, which means that the lancers so that each topic had two independent judgments from two lancers. agree with each other more than they do with students. While the The official relevance assessments of the WWW task were formed lack of gold data prevents us from concluding that lancers are more by consolidating the two lancer scores: since each lancer gave 0, 1, reliable than students, it does suggest that lancers are worth hiring or 2, the final relevance levels were L0-L4. if we are looking for high inter-assessor agreement. As we shall see For the present study, we focus on a subset of the above test set, in Section 4.1, the discriminative power results discussed in also which contains 50 topics whose topic IDs are odd numbers. The support this observation. number of pooled documents for this topic set is 11,214. We then We also computed per-topic linear weighted κ’s so that the as- hired five students from the Department of Computer Science and sessments of exactly three individuals are compared against one Engineering, Waseda University, to provide a third set of judgments another: the mean, minimum and maximum values are also pro- for each of the 50 topics. The instructions given to them were vided in Table 1. It can be observed that the lowest per-topic κ identical to those given to the lancers. The students also did not observed is −0.128, for “lancer1 vs. student”; the 95%CI for this have any prior experience in relevance assessments. Moreover, instance was [−0.262, 0.0006], suggesting the lack of agreement lancers and students all received an hourly pay of 1,200 Japanese beyond chance6 . Yen. However, hiring lancers is more expensive, because we have Table 1 also shows the number of topics for which the per-topic to pay about 20% to Lancers the company on top of what we pay κ’s were not statistically significantly positive, that is, the 95%CI to the individual lancers. The purpose of collecting the third set lower limits were not positive, as exemplified by the above instance. of assessments was to compare the lancer-lancer inter-assessor These numbers indicate that the lancer-lancer agreements were agreement with the lancer-student agreement, which should shed statistically significantly positive for 50 − 8 = 42 topics while the some light on the reliability of the different assessor types. All of lancer-student agreements were statistically significantly positive the assessors completed the work in about one month. for only 35 (31) topics. Again, the lancers agree with each other It should be noted that all of our assessors are “bronze” accord- more than they do with students. ing to the definition by Bailey et al. [2]: they are neither topic originators nor topic experts. 5 Fleiss’ κ [7], designed for more than two assessors, is applicable to nominal categories To quantify inter-assessor agreement, we compute Cohen’s lin- only; the same goes for Randoph’s κ free [13]; see also our discussion on Krippendorff’s α [9] in Section 2. ear weighted κ for every assessor pair, where, for example, the 6 It should be noted that negative κ ’s are not unsual in the context of inter-assessor weight for a (0, 2) disagreement is 2 and that for a (0, 1) or (1, 2) agreement: for example, according to a figure from Bailey et al. [2], when a “gold” assessor (i.e., top originator) was compared with a bronze asessor, a version of κ was in the [−0.6, −0.4] range for one topic, despite the fact that the assessors must have read the narrative fields of the TREC Enterprise 2007 test collection [1]. 33 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai There were 27 topics where all three per-topic κ’s were statisti- (a) Mean nDCG@10 cally significantly positive; for the remaining 23 topics, at least one 0.8 per-topic κ was not. We shall refer to the set of the former 27 topics 0.7 as the high-agreement set and the latter as the low-agreement set. We shall utilise these subsets in Section 4.2. 0.6 Also in Table 1, the binary Cohen’s κ row shows the κ values 0.5 after collapsing the 3×3 matrices into 2×2 matrices by treating highly relevant and relevant as just relevant. Again, the lancer- 0.4 lancer κ is statistically significantly higher than the lancer-student κ’s. Finally, the table shows the raw agreement based on the 2×2 0.3 confusion matrices: the counts of (0, 0) and (1, 1) are divided by 0.2 those of (0, 0), (0, 1), (1, 0), and (1, 1). It can be observed that only all3 2lancers lancer1 lancer2 student the lancer-lancer agreement exceeds 70%. We summed the raw scores of lancer1, lancer2, and student to 0.8 (b) Mean Q@10 form a qrels set which we call all3; we also summed the raw scores of lancer1 and lancer2 to form a qrels set which we call 2lancers. 0.7 Table 2 shows the distribution of documents across the relevance 0.6 levels. Note that all3 and 2lancers are on 7-point and 5-point scales, respectively, while the others are on a 3-point scale. In this way, 0.5 we preserve the views of individual assessors instead of collapsing the assessments into binary or to force them to reach a consensus. 0.4 Note that nDCG@10, Q@10, and nERR@10 can fully utilise the 0.3 rich relevance assessments. As we shall see in the next section, this approach to combining multiple relevance assessments is beneficial. 0.2 For alternatives to simply summing up the raw assessor scores, all3 2lancers lancer1 lancer2 student we refer the reader to Maddalena et al. [11] and Sakai [18]: these (c) Mean nERR@10 approaches are beyond the scope of the present study. 0.8 0.7 4 RESULTS AND DISCUSSIONS 0.6 4.1 Different Qrels Versions 0.5 The previous section showed that assessors do disagree, and that the lancers agree with each other more than they do with students. 0.4 This section investigates the effect of inter-assessor disagreement on system ranking and statistical significance through comparisons 0.3 across the five qrels versions: all3, 2lancers, lancer1, lancer2, and 0.2 student. all3 2lancers lancer1 lancer2 student 4.1.1 System Ranking. Figure 1 visualises the system rankings Figure 1: Mean effectiveness scores according to different and the actual mean effectiveness scores according to the five dif- qrels. The x-axis represents the run ranking according to ferent qrels, for nDCG@10, Q@10, and nERR@10. In each graph, the all3 qrels. the runs have been sorted by the all3 scores, and therefore if every curve is monotonically decreasing, that means all the qrels versions (95%CI[0.616, 1.077]). The actual system swaps that Figure 1 shows produce system rankings that are identical to the one based on all3. are probably more important than these summary statistics. First, it can be observed that the absolute effectiveness scores differ 4.1.2 Statistically Significant Differences across Systems. The depending on the qrels version used, just as Voorhees [19] observed next and perhaps more important question is: how do the different with Average Precision and Recall@1000. Second, and more impor- qrels versions affect pairwise statistical significance test results? If tantly, the five system rankings are not identical: for example, in the researcher is interested in the difference between every system Figure 1(c), the top performing run according to nERR@10 with pair, a proper multiple comparison procedure should be employed all3 is only the fifth best run according to the same measure with to ensure that the familywise error rate is bounded above by the lancer1. The nDCG@10 and Q@10 curves are relatively consistent significance criterion α [3]. In this study we use the distribution- across the qrels versions. Table 3 quantifies the above observation free randomised Tukey HSD test using the Discpower tool7 , with in terms of Kendall’s τ , with 95%CIs: while the CI upper limits B = 10, 000 trials [16]. The input to the tool is a topic-by-run score show that all the rankings are statisticallly equivalent, the widths matrix: in our case, for every combination of evaluation measure of the CIs due to the small sample size (13 runs) suggest that the and qrels version, we have a 50×13 matrix. results should be viewed with caution. For example, the τ for the aforementioned case of all3 vs. lancer1 with nERR@10 is 0.821 7 http://research.nii.ac.jp/ntcir/tools/discpower-en.html 34 The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Table 3: System ranking consistency in term of Kendall’s τ , with 95%CIs, for every pair of qrels (13 runs). (a) Mean nDCG@10 2lancers lancer1 lancer2 student all3 0.974 0.923 0.949 0.923 [0.696, 1.048] [0.660, 1.032] [0.660, 1.032] [0.694, 1.035] 2lancers - 0.949 0.974 0.897 [0.894, 1.054] [0.754, 1.092] [0.882, 1.053] lancer1 - - 0.923 0.846 [0.822, 1.075] [0.953, 1.034] lancer2 - - - 0.872 [0.815, 1.069] (b) Mean Q@10 2lancers lancer1 lancer2 student all3 0.923 0.846 0.872 0.897 [0.660, 1.018] [0.701, 1.028] [0.835, 1.049] [0.606, 1.019] 2lancers - 0.872 0.949 0.872 [0.784, 1.062] [0.660, 1.032] [0.683, 1.061] lancer1 - - 0.821 0.897 [0.683, 1.061] [0.842, 1.055] lancer2 - - - 0.821 [0.585, 1.056] (c) Mean nERR@10 2lancers lancer1 lancer2 student all3 0.923 0.821 0.923 0.872 [0.498, 1.040] [0.616, 1.077] [0.637, 1.056] [0.540, 1.050] 2lancers - 0.846 0.949 0.846 [0.784, 1.062] [0.585, 1.056] [0.784, 1.062] lancer1 - - 0.795 0.692 [0.673, 1.019] [0.822, 1.075] lancer2 - - - 0.846 [0.590, 1.000] Table 4 shows the results of comparing the outcomes of signifi- in the sense that the 95%CI lower limit of the per-topic κ was cance test results at α = 0.05: for example, Table 4(a) shows that, positive. A low-agreement topic is one for which at least one in terms of nDCG@10, all3 obtained 2 + 29 = 31 statistically sig- assessor pair did not show any agreement beyond chance, and nificantly different run pairs, while 2lancers obtained 29 + 2 = 31, therefore deemed unreliable. While to the best of our knowledge and that the two qrels versions had 29 pairs in common. Thus the this kind of close analysis of inter-assessor agreement is rarely done Statistical Significance Overlap (SSO) is 29/(2 + 29 + 2) = 87.9%. The prior to evaluating the submitted runs, removing such topics at an table shows that the student results disagree relatively often with early stage may be a useful practice for ensuring test collection the others: for example, in Table 4(b), Q@10 with lancer1 have 11 reliability. Hence, in this section, we focus on the all3 qrels, and statistically significantly different run pairs that are not statistically compare the evaluation outcomes when the full topic set (50 topics) significantly different according to the same measure with student, is replaced with just the high-agreement set or even just the low- while the opposite is true for five pairs. The two qrels versions have agreement set. The fact that the high-agreement and low-agreement 15 pairs in common and the SSO is only 48.4%. Thus, different qrels sets are similar in sample size is highly convenient for comparing versions can lead to different research conclusions. them in terms of discriminative power. Table 5 shows the number of statistically significantly different run pairs (i.e., discriminative power [14, 15]) deduced from Table 4. It can be observed that combining multiple assessors’ labels and 4.2.1 System Ranking. Figure 2 visualises the system rankings thereby having fine-grained relevance levels can result in high and the actual mean effectiveness scores according to the three discriminative power, and also that student underperforms the topic sets, for nDCG@10, Q@10, and nERR@10. Again, in each others in terms of discriminative power. Thus, it appears that graph, the runs have been sorted by the all3 scores (mean over 50 student is not only different from the two lancers: they fail to topics). Table 6 compares the system rankings in terms of Kendall’s provide many significantly different pairs. τ with 95%CIs. The values in bold indicate the cases where the two rankings are statistically not equivalent. It can be observed that 4.2 Using Reliable Topics Only while the system rankings by the full set and the high-agreement In Section 3, we defined a high-agreement set containing 27 topics set are statistically equivalent, those by the full set and the low- and a low-agreement set containg 23 topics. A high-agreement agreement set are not. Thus, the properties of the high-agreement topic is one for which every assessor pair “statistically agreed,” topics appear to be dominant in the full topic set. 35 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai Table 4: Statistical Significance Overlap between two qrels (a) Mean nDCG@10 versions (α = 0.05). 0.8 (a) Mean nDCG@10 0.7 2lancers lancer1 lancer2 student 0.6 all3 2/29/2 4/27/3 5/26/0 11/20/0 (87.9%) (79.4%) (83.9%) (64.5%) 0.5 2lancers - 3/28/2 6/25/1 12/19/1 (84.8%) (78.1%) (59.4%) 0.4 lancer1 - - 5/25/1 12/18/2 (80.6%) (56.2%) 0.3 lancer2 - - - 7/19/1 (70.4%) 0.2 all3 (50 topics) all3 (27 good topics) all3 (23 bad topics) (b) Mean Q@10 2lancers lancer1 lancer2 student (b) Mean Q@10 0.8 all3 3/26/3 6/23/7 3/26/2 11/18/2 (81.2%) (71.9%) (83.9%) (58.1%) 0.7 2lancers - 5/24/2 4/25/3 14/15/5 (77.4%) (78.1%) (44.1%) 0.6 lancer1 - - 4/22/6 11/15/5 (68.8%) (48.4%) 0.5 lancer2 - - - 11/17/3 (54.8%) 0.4 (c) Mean nERR@10 0.3 2lancers lancer1 lancer2 student all3 3/16/1 5/14/1 5/14/2 7/12/0 0.2 (80.0%) (70.0%) (66.7%) (63.2%) all3 (50 topics) all3 (27 good topics) all3 (23 bad topics) 2lancers - 3/14/1 4/13/3 8/9/3 (77.8%) (65.0%) (45.0%) (a) Mean nERR@10 0.8 lancer1 - - 3/12/4 6/9/3 (63.2%) (50.0%) 0.7 lancer2 - - - 6/10/2 (55.6%) 0.6 Table 5: Number of significantly different run pairs deduced 0.5 from Table 4 (α = 0.05). 0.4 (a) Mean (b) Mean (c) Mean nDCG@10 Q@10 nERR@10 0.3 all3 31 29 19 2lancers 31 29 17 0.2 lancer1 30 26 15 all3 (50 topics) all3 (27 good topics) all3 (23 bad topics) lancer2 26 28 16 student 20 20 12 Figure 2: Mean effectiveness scores according to different topic sets. The x-axis represents the run ranking according to all3 (50 topics). 4.2.2 Statistically Significanct Differences across Systems. Ta- ble 7 compares the outcomes of statistical significance test results Table 8 shows the number of statistically significantly different (Randomised Tukey HSD with B = 10, 000 trials) across the three pairs for each condition based on Table 7. Again, it can be observed topic sets in a way similar to Table 4. Note that the two subsets that the high-agreement set is substantially more discriminative are inherently less discriminative than the full set as the sample than the low-agreement set, despite the fact that the sample sizes sizes are about half that of the full set. It can be observed that are similar. Thus, the results suggest that, from a statistical point of the set of statistically signifantly different pairs according to the view, a high-agreement topic set is more useful for finding concrete high-agreement (low-agreement) set is always a subset of the set research conclusions than a low-agreement one. of statistically signifantly different pairs according to the full topic set. More interestingly, the set of statistically signifantly different 5 CONCLUSIONS AND FUTURE WORK pairs according to the low-agreement set is almost a subet of the This paper reported on a case study involving only 13 runs con- set of statistically signifantly different pairs according to the high- tributed from only three teams. Hence we do not claim that our agreement set: for example, Table 7(b) shows that there is only one finding will generalise; we merely hope to apply the same method- system pair for which the low-agreement set obtained a statistically ology to test collections that will be created for the future rounds of significant difference while the high-agreement set did not in terms the WWW task and possibly even other tasks. Our main findings of Q@10. using the English NTCIR-13 WWW test collection are as follows: 36 The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Table 6: System ranking consistency in term of Kendall’s τ , with 95%CIs, for every pair of topic sets (13 runs). (a) Mean nDCG@10 all3 (27 high-agreement topics) all3 (23 low-agreement topics) all3 (50 topics) 0.872 [0.696, 1.048] 0.846 [0.660, 1.032] all3 (27 high-agreement topics) - 0.718 [0.450, 0.986] (b) Mean Q@10 all3 (27 high-agreement topics) all3 (23 low-agreement topics) all3 (50 topics) 0.923 [0.784, 1.062] 0.846 [0.673, 1.019] all3 (27 high-agreement topics) - 0.769 [0.566, 0.972] (c) Mean nERR@10 all3 (27 high-agreement topics) all3 (23 low-agreement topics) all3 (50 topics) 0.846 [0.660, 1.032] 0.769 [0.545, 0.994] all3 (27 high-agreement topics) - 0.615 [0.319, 0.912] Table 7: Statistical Significance Overlap between two topic sets (α = 0.05). (a) Mean nDCG@10 all3 (27 high-agreement topics) all3 (23 low-agreement topics) all3 (50 topics) 4/27/0 (87.1%) 22/9/0 (29.0%) all3 (27 high-agreement topics) - 18/9/0 (33.3%) (b) Mean Q@10 all3 (27 high-agreement topics) all3 (23 low-agreement topics) all3 (50 topics) 4/25/0 (86.2%) 19/10/0 (34.5%) all3 (27 high-agreement topics) - 16/9/1 (34.6%) (c) Mean nERR@10 all3 (27 high-agreement topics) all3 (23 low-agreement topics) all3 (50 topics) 4/15/0 (78.9%) 13/6/0 (31.6%) all3 (27 high-agreement topics) - 10/5/1 (31.2%) Table 8: Number of significantly different run pairs deduced from Table 7 (α = 0.05). (a) Mean (b) Mean (c) Mean nDCG@10 (b) Q@10 (c) nERR@10 all3 (50 topics) 31 29 19 high-agreement (27 topics) 27 25 15 low-agreement (23 topics) 9 10 6 • Lancer-lancer inter-assessor agreements are statistically ACKNOWLEDGEMENTS significantly higher than lancer-student agreements. The I thank the PLY team (Peng Xiao, Lingtao Li, Yimeng Fan) of my student qrels is less discriminative than the lancers qrels. laboratory for developing the PLY relevance assessment tool and While the lack of gold data prevents us from concluding collecting the assessments. I also thank the NTCIR-13 WWW task which type of assessors is more reliable, these results sug- organisers and participants for making this study possible. gest that hiring lancers has some merit despite the extra cost. • Different qrels versions based on different (combinations of) assessors can lead to somewhat different system rank- REFERENCES ings and statistical significance test results. Combining [1] Peter Bailey, Nick Craswell adn Arjen P. de Vries, and Ian Soboroff. 2008. multiple assessors’ labels to form fine-grained relevance Overview of the TREC 2007 Enterprise Track. In Proceedings of TREC 2007. [2] Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and levels is beneficial in terms of discriminative power. Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does • Removing 23 low-agreement topics (in terms of inter-assessor It Matter?. In Proceedings of ACM SIGIR 2008. 667–674. agreement) from the full set of 50 topics prior to evaluating [3] Ben Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30, 1 (2012). runs did not have a major impact on the evaluation results, [4] Ben Carterette and Ian Soboroff. 2010. The Effect of Assessor Errors on IR System as the properties of the 27 high-agreement topics are domi- Evaluation. In Procceedings of ACM SIGIR 2010. 539–546. nant in the full set. However, replacing the high-agreement [5] Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Dolf Trieschnigg, and Chris Develder. 2014. Exploiting User Disagreement for Web Search Evalua- set with the low-agreement set resulted in a statistically tion: an Experimental Approach. In Proceedings of ACM WSDM 2014. 33–42. significantly different system ranking, and substantially [6] Marco Ferrante, Nicola Ferro, and Maria Maistro. 2017. AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors. ACM TOIS 36, 2 (2017). lower discriminative power. Hence, from a statistical point [7] Joseph L. Fleiss. 1971. Measuring Nominal Scale Agreement among Many Raters. of view, our results suggest that a high-agreement topic set Psychological Bulletin 76, 5 (1971), 378–382. is more useful for finding concrete research conclusions [8] Donna K. Harman. 2005. The TREC Ad Hoc Experiments. In TREC: Experiment and Evaluation in Information Retrieval, Ellen M. Voorhees and Donna K. Harman than a low-agreement one. (Eds.). The MIT Press, Chapter 4. 37 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai [9] Klaus Krippendorff. 2013. Content Analysis: An Introduction to Its Methodology [15] Tetsuya Sakai. 2007. Alternatives to Bpref. In Proceedings of ACM SIGIR 2007. (Third Edition). Sage Publications. 71–78. [10] Cheng Luo, Tetsuya Sakai, Yiqun Liu, Zhicheng Dou, Chenyan Xiong, and [16] Tetsuya Sakai. 2012. Evaluation with Informational and Navigational Intents. In Jingfang Xu. 2017. Overview of the NTCIR-13 WWW Task. In Proceedings of Proceedings of WWW 2012. 499–508. NTCIR-13. [17] Tetsuya Sakai. 2014. Metrics, Statistics, Tests. In PROMISE Winter School 2013: [11] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017. Bridging between Information Retrieval and Databases (LNCS 8173). 116–163. Considering Assessor Agreement in IR Evaluation. In Proceedings of ACM ICTIR [18] Tetsuya Sakai. 2017. Unanimity-Aware Gain for Highly Subjective Assessments. 2017. 75–82. In Proceedings of EVIA 2017. [12] Olga Megorskaya, Vladimir Kukushkin, and Pavel Serdyukov. 2015. On the [19] Ellen Voorhees. 2000. Variations in Relevance Judgments and the Measurement of Relation between Asessor’s Agreement and Accuracy in Gamified Relevance Retrieval Effectiveness. Information Processing and Management (2000), 697–716. Assessment. In Proceedings of ACM SIGIR 2015. 605–614. [20] Yulu Wang, Garrick Sherman, Jimmy Lin, and Miles Efron. 2015. Assessor [13] Justus J. Randolph. 2005. Free-Marginal Multirater Kappa (Multirater κ free ): An Differences and User Preferences in Tweet Timeline Generation. In Proceedings Alternative to Fleiss’ Fixed Marginal Multirater Kappa. In Joensuu Learning and of ACM SIGIR 2015. 615–624. Instruction Symposium 2005. [21] William Webber, Praveen Chandar, and Ben Carterette. 2012. Alternative As- [14] Tetsuya Sakai. 2006. Evaluating Evaluation Metrics based on the Bootstrap. In sessor Disagreement and Retrieval Depth. In Proceedings of ACM CIKM 2012. Proceedings of ACM SIGIR 2006. 525–532. 125–134. 38