=Paper=
{{Paper
|id=Vol-2008/paper_5
|storemode=property
|title=The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students
|pdfUrl=https://ceur-ws.org/Vol-2008/paper_5.pdf
|volume=Vol-2008
|authors=Tetsuya Sakai
|dblpUrl=https://dblp.org/rec/conf/ntcir/Sakai17b
}}
==The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students==
<pdf width="1500px">https://ceur-ws.org/Vol-2008/paper_5.pdf</pdf>
<pre>
           The Effect of Inter-Assessor Disagreement on IR System
            Evaluation: A Case Study with Lancers and Students
                                                               Tetsuya Sakai
                                                             Waseda University
                                                           tetsuyasakai@acm.org

ABSTRACT                                                                      of the subjective nature of the relevance assessment process on the
This paper reports on a case study on the inter-assessor disagree-            final IR evaluation results.
ments in the English NTCIR-13 We Want Web (WWW) collection.                      This paper reports on a case study on the inter-assessor dis-
For each of our 50 topics, pooled documents were independently                agreements in a recently-constructed ad hoc web search test col-
judged by three assessors: two “lancers” and one Waseda University            lection, namely, the English NTCIR-13 We Want Web (WWW) col-
student. A lancer is a worker hired through a Japanese part time              lection [10]. For each of our 50 topics, pooled documents were
job matching website, where the hirer is required to rate the quality         independently judged by three assessors: two “lancers” and one
of the lancer’s work upon task completion and therefore the lancer            Waseda University student. A lancer is a worker hired through
has a reputation to maintain. Nine lancers and five students were             a Japanese part time job matching website1 , where the hirer is
hired in total; the hourly pay was the same for all assessors. On             required to rate the quality of the lancer’s work upon task comple-
the whole, the inter-assessor agreement between two lancers is                tion and therefore the lancer has a reputation to maintain2 . Nine
statistically significantly higher than that between a lancer and             lancers and five students were hired in total; the hourly pay was the
a student. We then compared the system rankings and statistical               same for all assessors. On the whole, the inter-assessor agreement
significance test results according to different qrels versions cre-          between two lancers is statistically significantly higher than that
ated by changing which asessors to rely on: overall, the outcomes             between a lancer and a student (Section 3). We then compared
do differ according to the qrels versions, and those that rely on             the system rankings and statistical significance test results accord-
multiple assessors have a higher discriminative power than those              ing to different qrels versions created by changing which asessors
that rely on a single assessor. Furthermore, we consider remov-               to rely on: overall, the outcomes do differ according to the qrels
ing topics with relatively low inter-assessor agreements from the             versions, and those that rely on multiple assessors have a higher
original topic set: we thus rank systems using 27 high-agreement              discriminative power (i.e., the ability to obtain many statistically
topics, after removing 23 low-agreement topics. While the system              significant system pairs [14, 15]) than those that rely on a single as-
ranking with the full topic set and that with the high-agreement set          sessor (Section 4.1). Furthermore, we consider removing topics with
are statistically equivalent, the ranking with the high-agreement             relatively low inter-assessor agreements from the original topic
set and that with the low-agreement set are not. Moreover, the                set: we thus rank systems using 27 high-agreement topics, after
low-agreement set substantially underperforms the full and the                removing 23 low-agreement topics. While the system ranking with
high-agreement sets in terms of discriminative power. Hence, from             the full topic set and that with the high-agreement set are statisti-
a statistical point of view, our results suggest that a high-agreement        cally equivalent, the ranking with the high-agreement set and that
topic set is more useful for finding concrete research conclusions            with the low-agreement set are not. Moreover, the low-agreement
than a low-agreement one.                                                     set substantially underperforms the full and the high-agreement
                                                                              sets in terms of discriminative power (Section 4.2). Hence, from a
CCS CONCEPTS                                                                  statistical point of view, our results suggest that a high-agreement
                                                                              topic set is more useful for finding concrete research conclusions
•Information systems → Retrieval effectiveness;
                                                                              than a low-agreement one.
KEYWORDS
                                                                              2     RELATED WORK/NOVELTY OF OUR WORK
inter-assessor agreement; p-values; relevance assessments; statisti-
cal significance                                                              Studies on the effect of inter-assessor (dis)agreement on IR system
                                                                              evaluation have a long history; Bailey et al. [2] provides a concise
                                                                              survey on this topic covering the period 1969-2008. More recent
1    INTRODUCTION                                                             work in the literature includes Carterette and Soboroff [4], Webber,
While IR researchers often view laboratory IR evaluation results as           Chandar, and Carterette [21], Demeester et al. [5] Megorskaya,
something objective, at the core of any laboratory IR experiments             Kukushkin, and Serdyukov [12], Wang et al. [20], Ferrante, Ferro,
lie the relevance assessments, which are the result of subjective             and Maistro [6], and Maddalena et al. [11]. Among these studies, the
judgements of documents by a person, or multiple persons, based               work of Voorhees [19] from 2000 (or the earlier version reported at
on a particular (intepretation of an) information need. Hence it is           SIGIR 1998) is probably one of the most well-known; below, we first
of utmost importance for IR researchers to understand the effects
                                                                              1  http://www.lancers.jp/ (in Japanese). See also https://www.techinasia.com/
Copying permitted for private and academic purposes.                          lancers-produces-200-million-freelancing-gigs-growing (in English)
EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.                            2 The lancer then rates the hirer; therefore the hirer also has a reputation to maintain
© 2017 Copyright held by the author.                                          on the website.


                                                                         31
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                 Tetsuya Sakai


highlight the differences between her work and the present study,                                   • As our topics were sampled from a query log, none of our
since the primary research question of the present study is whether                                   assessors are the topic originators (or “primary” [19] or
her well-known findings generalise to our new test collection with                                    “gold” assessors [2]); The assessors were not provided with
experimental settings that are quite different from hers in several                                   any information other than the query (e.g., a narrative
ways. After that, we also briefly compare the present study with                                      field [1, 8]): the definition for a highly relevant document
the recent, closely-related work of Maddalena et al. [11] from ICTIR                                  was: “it is likely that the user who entered this search query
2017.                                                                                                 will find this page relevant”; that for a relevant document
   Voorhees [19] examined the effect of using different qrels ver-                                    was: “it is possible that the user who entered this search
sions on ad hoc IR system evaluation. Her experiments used the                                        query will find this page relevant ” [10].
TREC-4 and TREC-6 data3 . In particular, in her experiments with                                    • We discuss inter-assessor agreeement and system ranking
the 50 TREC-4 topics, she hired two additional assessors in addition                                  agreement using stastical tools, namely, linear weighted κ
to the primary assessor who created the topic, and discussed the                                      with 95%CIs (which, unlike raw overlap measures, takes
pairwise inter-assessor agreement in terms of overlap as well as                                      chance agreement into account [2]) and Kendall’s τ with
recall and precision: overlap is defined as the size of the intersec-                                 95%CIs. Moreover, we employ the randomised Tukey HSD
tion of two relevant sets divided by the size of the union; recall                                    test [3, 16] to discuss the discrepancies in statistical sig-
and precison are defined by one of the relevant set as the gold                                       nificance test results. Furthermore, we consider removing
data. However, it was not quite the case that the three assessors                                     topics that appear to be unreliable in terms of inter-assessor
judged the same document pool independently: the document sets                                        agreement.
provided to the additional assessors were created after the primary
                                                                                                  While the recent work of Maddalena et al. [11] addressed sev-
assessment, by mixing both relevant and nonrelevant documents
                                                                                               eral research questions related to inter-assessor agreement, one
from the primary assessor’s judgements. Moreover, all documents
                                                                                               aspect of their study is closely related to our analysis with high-
judged relevant by the primary assessor but not included in the
                                                                                               agreement and low-agreement topic sets. Maddalena et al. utilised
document set for the additional assessors were counted towards
                                                                                               the TREC 2010 Relevance Feedback track data and exactly five
the set intersection when computing the inter-assessor agreement.
                                                                                               different relevance assessments for each ClueWeb document, and
Her experiments with the TREC-6 experiments relied on a different
                                                                                               used Krippendorph’s α [9] to quantify the inter-assessor agreement.
setting, where University of Waterloo created their own pools and
                                                                                               They defined high-agreement and lowagreeement topics based on
relevance assessments independent of the original pools and assess-
                                                                                               Krippendorph’s α, and reported that high-agreement topics can
ments. She considered binary relevance only4 , and therefore she
                                                                                               predict the system ranking with the full topic set more accurately
considered Average Precision and Recall at 1000 as effectiveness
                                                                                               than low-agreement topics. The analysis in the present study differs
evaluation measures. Her main conclusion was: “The actual value
                                                                                               from the above as discussed below:
of the effectiveness measure was affected by the different conditions,
but in each case the relative performance of the retrieved runs was                                 • Krippendorph’s α disregards which assessments came from
almost always the same. These results validate the use of the TREC                                    which assessors, as it is a measure of the overall reliability
test collections for comparative retrieval experiments.”                                              of the data. In the present study, where we only have three
   The present study differs from that of Voorhees in the following                                   assessors, we are more interested in the agreement between
aspects at least:                                                                                     every pair of assessors and hence utilise Cohen’s linear
       • We use a new English web search test collection contructed                                   weighted κ. Hence our definition of a high/low-agreement
         for the NTCIR-13 WWW task, with depth-30 pools.                                              topic differs from that of Maddalena et al.: according to
       • For each of our 50 topics, the same pool was completely                                      our definition, a topic is in high agreement if the κ is sta-
         independently judged by three assessors. Nine assessors                                      tistically significantly positive for every pair of assessors.
         were hired through the lancers website, and an additional                                  • While Maddalena et al. discussed the absolute effective-
         five assessors were hired at Waseda University, so that                                      ness scores and system rankings only, the present study
         each topic was judged by two lancers and one student.                                        discusses statistical significance testing after replacing the
       • We collected graded relevance assessments from each as-                                      full topic set with the high/low-agreement set.
         sessor: highly relevant (2 points), relevant (1 point), non-                               • Maddalena et al. focussed on nDCG; we discuss the three
         relevant (0) and error (0) for cases where the web pages                                     aforementioned official measures of the NTCIR-13 WWW
         to judge could not be displayed. When consolidating the                                      task [10].
         multiple assessments, the raw scores were added to form
         more fine-grained graded relevance data.
                                                                                               3   DATA
       • We use graded relevance measures at cutoff 10 (repre-
         senting the quality of the first search engine result page),                          The NTCIR-13 WWW English subtask created 100 test topics; 13
         namely nDCG@10, Q@10, and nERR@10 [17], which are                                     runs were submitted from three teams. We acknowledge that this
         the official measures of the WWW task.                                                is a clear limitation of the present study: we would have liked a
                                                                                               larger number of runs from a larger number of teams. However, we
3 The document collections are: disks 2 and 3 for TREC-4; disks 4 and 5 for TREC-6 [8].        claim that this limitation does not invalidate neither our approach
4 The original Waterloo assessments on a tertiary scale, but were collapsed into binary        to analysing inter-assessor disagreement nor the actual results on
for her analysis.                                                                              the system ranking and statistical significance. We hope to repeat


                                                                                          32
The Effect of Inter-Assessor Disagreement on IR System
Evaluation: A Case Study with Lancers and Students                            EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.

                                  Table 1: Pairwise inter-assessor agreement: lancer1, lancer2, and student.
                                                                            (a) lancer1 vs. lancer2       (b) lancer1 vs. student           (c) lancer2 vs. student
                                                                                    0        1     2              0        1      2                  0       1      2
                                                                            0 3991 1354 487               0 3406 1540           886         0 3215 1232           938
   raw scores                                                               1     947 1260 882            1 1051 1100           938         1 1203 1415 1043
                                                                            2     447 1047 799            2     416     787 1090            2     455      780    933
   Linear weighted Cohen’s κ with 95%CI                                       0.336 [0.322, 0.351]           0.283 [0.268, 0.298]              0.261 [0.246, 0.276]
   Mean (min/max) per-topic linear weighted Cohen’s κ                         0.293 (−0.007/0.683)          0.211 (−0.128/0.583)               0.225 (−0.054/0.918)
   #Topics where per-topic κ is not statistically significantly positive                8                             15                                 19
   Binary Cohen’s κ with 95%CI                                                0.424 [0.407, 0.441]           0.309 [0.292, 0.327]              0.314 [0.296, 0.331]
   Binary raw agreement                                                               0.712                         0.653                              0.659
                                           Table 2: Number of Lx-relevant documents in each qrels.
                                                     L0       L1       L2      L3       L4        L5     L6     total judged
                                          all3    2,603    1,897    2,135   1,535    1,537     1,035    472            11,214
                                       2lancers   3,991    2,301    2,194   1,929      799         -      -            11,214
                                       lancer1    5,832    3,089    2,293       -        -         -      -            11,214
                                       lancer2    5,385    3,661    2,168       -        -         -      -            11,214
                                       student    4,873    3,427    2,914       -        -         -      -            11,214


the same analysis on a larger set of runs in the next round of the                  disagreement is 15 . It should be noted that κ represents how much
WWW task.                                                                           agreement there is beyond chance.
   For evaluating the 13 runs submitted to the WWW task, we                            Table 1 summarises the inter-assessor agreement results. The
created a depth-30 pool for each of the 100 topics and this resulted                “raw scores” section shows the 3×3 confusion matrices for each
in a total of 22,912 documents to judge. We hired nine lancers who                  pair of assessors; the counts were summed across topics, although
speak English through the lancers website: the job call and the                     lancer1, lancer2, and student are actually not single persons. The
relevance assessment instructions were published on the website                     linear weighted κ’s were computed based on these matrices. It can
in English. None of them had any prior experience in relevance                      be observed that the lancer-lancer κ is statistically significantly
assessments. Topics were assigned at random to the nine assessors                   higher than the lancer-student κ’s, which means that the lancers
so that each topic had two independent judgments from two lancers.                  agree with each other more than they do with students. While the
The official relevance assessments of the WWW task were formed                      lack of gold data prevents us from concluding that lancers are more
by consolidating the two lancer scores: since each lancer gave 0, 1,                reliable than students, it does suggest that lancers are worth hiring
or 2, the final relevance levels were L0-L4.                                        if we are looking for high inter-assessor agreement. As we shall see
   For the present study, we focus on a subset of the above test set,               in Section 4.1, the discriminative power results discussed in also
which contains 50 topics whose topic IDs are odd numbers. The                       support this observation.
number of pooled documents for this topic set is 11,214. We then                       We also computed per-topic linear weighted κ’s so that the as-
hired five students from the Department of Computer Science and                     sessments of exactly three individuals are compared against one
Engineering, Waseda University, to provide a third set of judgments                 another: the mean, minimum and maximum values are also pro-
for each of the 50 topics. The instructions given to them were                      vided in Table 1. It can be observed that the lowest per-topic κ
identical to those given to the lancers. The students also did not                  observed is −0.128, for “lancer1 vs. student”; the 95%CI for this
have any prior experience in relevance assessments. Moreover,                       instance was [−0.262, 0.0006], suggesting the lack of agreement
lancers and students all received an hourly pay of 1,200 Japanese                   beyond chance6 .
Yen. However, hiring lancers is more expensive, because we have                        Table 1 also shows the number of topics for which the per-topic
to pay about 20% to Lancers the company on top of what we pay                       κ’s were not statistically significantly positive, that is, the 95%CI
to the individual lancers. The purpose of collecting the third set                  lower limits were not positive, as exemplified by the above instance.
of assessments was to compare the lancer-lancer inter-assessor                      These numbers indicate that the lancer-lancer agreements were
agreement with the lancer-student agreement, which should shed                      statistically significantly positive for 50 − 8 = 42 topics while the
some light on the reliability of the different assessor types. All of               lancer-student agreements were statistically significantly positive
the assessors completed the work in about one month.                                for only 35 (31) topics. Again, the lancers agree with each other
   It should be noted that all of our assessors are “bronze” accord-                more than they do with students.
ing to the definition by Bailey et al. [2]: they are neither topic
originators nor topic experts.                                                      5 Fleiss’ κ [7], designed for more than two assessors, is applicable to nominal categories

   To quantify inter-assessor agreement, we compute Cohen’s lin-                    only; the same goes for Randoph’s κ free [13]; see also our discussion on Krippendorff’s
                                                                                    α [9] in Section 2.
ear weighted κ for every assessor pair, where, for example, the                     6 It should be noted that negative κ ’s are not unsual in the context of inter-assessor
weight for a (0, 2) disagreement is 2 and that for a (0, 1) or (1, 2)               agreement: for example, according to a figure from Bailey et al. [2], when a “gold”
                                                                                    assessor (i.e., top originator) was compared with a bronze asessor, a version of κ was
                                                                                    in the [−0.6, −0.4] range for one topic, despite the fact that the assessors must have
                                                                                    read the narrative fields of the TREC Enterprise 2007 test collection [1].


                                                                               33
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                    Tetsuya Sakai


   There were 27 topics where all three per-topic κ’s were statisti-                                                                             (a) Mean nDCG@10
cally significantly positive; for the remaining 23 topics, at least one           0.8

per-topic κ was not. We shall refer to the set of the former 27 topics            0.7
as the high-agreement set and the latter as the low-agreement set.
We shall utilise these subsets in Section 4.2.                                    0.6
   Also in Table 1, the binary Cohen’s κ row shows the κ values
                                                                                  0.5
after collapsing the 3×3 matrices into 2×2 matrices by treating
highly relevant and relevant as just relevant. Again, the lancer-                 0.4
lancer κ is statistically significantly higher than the lancer-student
κ’s. Finally, the table shows the raw agreement based on the 2×2                  0.3

confusion matrices: the counts of (0, 0) and (1, 1) are divided by
                                                                                  0.2
those of (0, 0), (0, 1), (1, 0), and (1, 1). It can be observed that only                              all3   2lancers   lancer1   lancer2   student
the lancer-lancer agreement exceeds 70%.
   We summed the raw scores of lancer1, lancer2, and student to                   0.8
                                                                                                                                                 (b) Mean Q@10
form a qrels set which we call all3; we also summed the raw scores
of lancer1 and lancer2 to form a qrels set which we call 2lancers.                0.7

Table 2 shows the distribution of documents across the relevance
                                                                                  0.6
levels. Note that all3 and 2lancers are on 7-point and 5-point scales,
respectively, while the others are on a 3-point scale. In this way,               0.5
we preserve the views of individual assessors instead of collapsing
the assessments into binary or to force them to reach a consensus.                0.4

Note that nDCG@10, Q@10, and nERR@10 can fully utilise the                        0.3
rich relevance assessments. As we shall see in the next section, this
approach to combining multiple relevance assessments is beneficial.               0.2
   For alternatives to simply summing up the raw assessor scores,                                      all3   2lancers   lancer1   lancer2   student

we refer the reader to Maddalena et al. [11] and Sakai [18]: these                                                                               (c) Mean nERR@10
approaches are beyond the scope of the present study.                             0.8

                                                                                  0.7

4 RESULTS AND DISCUSSIONS
                                                                                  0.6
4.1 Different Qrels Versions
                                                                                  0.5
The previous section showed that assessors do disagree, and that
the lancers agree with each other more than they do with students.                0.4
This section investigates the effect of inter-assessor disagreement
on system ranking and statistical significance through comparisons                0.3

across the five qrels versions: all3, 2lancers, lancer1, lancer2, and             0.2
student.                                                                                               all3   2lancers   lancer1   lancer2   student


   4.1.1 System Ranking. Figure 1 visualises the system rankings                 Figure 1: Mean effectiveness scores according to different
and the actual mean effectiveness scores according to the five dif-              qrels. The x-axis represents the run ranking according to
ferent qrels, for nDCG@10, Q@10, and nERR@10. In each graph,                     the all3 qrels.
the runs have been sorted by the all3 scores, and therefore if every
curve is monotonically decreasing, that means all the qrels versions             (95%CI[0.616, 1.077]). The actual system swaps that Figure 1 shows
produce system rankings that are identical to the one based on all3.             are probably more important than these summary statistics.
First, it can be observed that the absolute effectiveness scores differ             4.1.2 Statistically Significant Differences across Systems. The
depending on the qrels version used, just as Voorhees [19] observed              next and perhaps more important question is: how do the different
with Average Precision and Recall@1000. Second, and more impor-                  qrels versions affect pairwise statistical significance test results? If
tantly, the five system rankings are not identical: for example, in              the researcher is interested in the difference between every system
Figure 1(c), the top performing run according to nERR@10 with                    pair, a proper multiple comparison procedure should be employed
all3 is only the fifth best run according to the same measure with               to ensure that the familywise error rate is bounded above by the
lancer1. The nDCG@10 and Q@10 curves are relatively consistent                   significance criterion α [3]. In this study we use the distribution-
across the qrels versions. Table 3 quantifies the above observation              free randomised Tukey HSD test using the Discpower tool7 , with
in terms of Kendall’s τ , with 95%CIs: while the CI upper limits                 B = 10, 000 trials [16]. The input to the tool is a topic-by-run score
show that all the rankings are statisticallly equivalent, the widths             matrix: in our case, for every combination of evaluation measure
of the CIs due to the small sample size (13 runs) suggest that the               and qrels version, we have a 50×13 matrix.
results should be viewed with caution. For example, the τ for the
aforementioned case of all3 vs. lancer1 with nERR@10 is 0.821                    7 http://research.nii.ac.jp/ntcir/tools/discpower-en.html


                                                                            34
The Effect of Inter-Assessor Disagreement on IR System
Evaluation: A Case Study with Lancers and Students                           EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.

            Table 3: System ranking consistency in term of Kendall’s τ , with 95%CIs, for every pair of qrels (13 runs).
                                        (a) Mean nDCG@10
                                                     2lancers          lancer1              lancer2          student
                                           all3        0.974            0.923                0.949            0.923
                                                  [0.696, 1.048]    [0.660, 1.032]       [0.660, 1.032]   [0.694, 1.035]
                                        2lancers         -              0.949                0.974            0.897
                                                                    [0.894, 1.054]       [0.754, 1.092]   [0.882, 1.053]
                                         lancer1          -               -                  0.923            0.846
                                                                                         [0.822, 1.075]   [0.953, 1.034]
                                         lancer2          -                  -                 -              0.872
                                                                                                          [0.815, 1.069]
                                        (b) Mean Q@10
                                                     2lancers          lancer1              lancer2          student
                                           all3        0.923            0.846                0.872            0.897
                                                  [0.660, 1.018]    [0.701, 1.028]       [0.835, 1.049]   [0.606, 1.019]
                                        2lancers         -              0.872                0.949            0.872
                                                                    [0.784, 1.062]       [0.660, 1.032]   [0.683, 1.061]
                                         lancer1          -               -                  0.821            0.897
                                                                                         [0.683, 1.061]   [0.842, 1.055]
                                         lancer2          -                  -                 -              0.821
                                                                                                          [0.585, 1.056]
                                        (c) Mean nERR@10
                                                      2lancers         lancer1              lancer2          student
                                           all3         0.923           0.821                0.923            0.872
                                                   [0.498, 1.040]   [0.616, 1.077]       [0.637, 1.056]   [0.540, 1.050]
                                        2lancers          -             0.846                0.949            0.846
                                                                    [0.784, 1.062]       [0.585, 1.056]   [0.784, 1.062]
                                         lancer1          -               -                  0.795            0.692
                                                                                         [0.673, 1.019]   [0.822, 1.075]
                                         lancer2          -                  -                 -              0.846
                                                                                                          [0.590, 1.000]


   Table 4 shows the results of comparing the outcomes of signifi-                    in the sense that the 95%CI lower limit of the per-topic κ was
cance test results at α = 0.05: for example, Table 4(a) shows that,                   positive. A low-agreement topic is one for which at least one
in terms of nDCG@10, all3 obtained 2 + 29 = 31 statistically sig-                     assessor pair did not show any agreement beyond chance, and
nificantly different run pairs, while 2lancers obtained 29 + 2 = 31,                  therefore deemed unreliable. While to the best of our knowledge
and that the two qrels versions had 29 pairs in common. Thus the                      this kind of close analysis of inter-assessor agreement is rarely done
Statistical Significance Overlap (SSO) is 29/(2 + 29 + 2) = 87.9%. The                prior to evaluating the submitted runs, removing such topics at an
table shows that the student results disagree relatively often with                   early stage may be a useful practice for ensuring test collection
the others: for example, in Table 4(b), Q@10 with lancer1 have 11                     reliability. Hence, in this section, we focus on the all3 qrels, and
statistically significantly different run pairs that are not statistically            compare the evaluation outcomes when the full topic set (50 topics)
significantly different according to the same measure with student,                   is replaced with just the high-agreement set or even just the low-
while the opposite is true for five pairs. The two qrels versions have                agreement set. The fact that the high-agreement and low-agreement
15 pairs in common and the SSO is only 48.4%. Thus, different qrels                   sets are similar in sample size is highly convenient for comparing
versions can lead to different research conclusions.                                  them in terms of discriminative power.
   Table 5 shows the number of statistically significantly different
run pairs (i.e., discriminative power [14, 15]) deduced from Table 4.
It can be observed that combining multiple assessors’ labels and                         4.2.1 System Ranking. Figure 2 visualises the system rankings
thereby having fine-grained relevance levels can result in high                       and the actual mean effectiveness scores according to the three
discriminative power, and also that student underperforms the                         topic sets, for nDCG@10, Q@10, and nERR@10. Again, in each
others in terms of discriminative power. Thus, it appears that                        graph, the runs have been sorted by the all3 scores (mean over 50
student is not only different from the two lancers: they fail to                      topics). Table 6 compares the system rankings in terms of Kendall’s
provide many significantly different pairs.                                           τ with 95%CIs. The values in bold indicate the cases where the two
                                                                                      rankings are statistically not equivalent. It can be observed that
4.2    Using Reliable Topics Only                                                     while the system rankings by the full set and the high-agreement
In Section 3, we defined a high-agreement set containing 27 topics                    set are statistically equivalent, those by the full set and the low-
and a low-agreement set containg 23 topics. A high-agreement                          agreement set are not. Thus, the properties of the high-agreement
topic is one for which every assessor pair “statistically agreed,”                    topics appear to be dominant in the full topic set.


                                                                                 35
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                   Tetsuya Sakai

Table 4: Statistical Significance Overlap between two qrels                                                                                      (a) Mean nDCG@10
versions (α = 0.05).                                                             0.8


           (a) Mean nDCG@10                                                      0.7
                     2lancers    lancer1     lancer2   student
                                                                                 0.6
              all3     2/29/2     4/27/3      5/26/0   11/20/0
                      (87.9%)    (79.4%)     (83.9%)   (64.5%)                   0.5
           2lancers       -       3/28/2      6/25/1   12/19/1
                                 (84.8%)     (78.1%)   (59.4%)                   0.4
           lancer1       -           -        5/25/1   12/18/2
                                             (80.6%)   (56.2%)                   0.3
           lancer2       -          -            -      7/19/1
                                                       (70.4%)                   0.2
                                                                                             all3 (50 topics)   all3 (27 good topics)   all3 (23 bad topics)
           (b) Mean Q@10
                     2lancers    lancer1     lancer2   student                                                                                   (b) Mean Q@10
                                                                                 0.8
              all3     3/26/3     6/23/7      3/26/2   11/18/2
                      (81.2%)    (71.9%)     (83.9%)   (58.1%)                   0.7
           2lancers       -       5/24/2      4/25/3   14/15/5
                                 (77.4%)     (78.1%)   (44.1%)                   0.6
           lancer1       -           -        4/22/6   11/15/5
                                             (68.8%)   (48.4%)                   0.5
           lancer2       -          -            -     11/17/3
                                                       (54.8%)                   0.4

           (c) Mean nERR@10
                                                                                 0.3
                      2lancers   lancer1     lancer2   student
              all3      3/16/1    5/14/1      5/14/2    7/12/0                   0.2
                       (80.0%)   (70.0%)     (66.7%)   (63.2%)                               all3 (50 topics)   all3 (27 good topics)   all3 (23 bad topics)
           2lancers        -      3/14/1      4/13/3    8/9/3
                                 (77.8%)     (65.0%)   (45.0%)                                                                                   (a) Mean nERR@10
                                                                                 0.8
           lancer1       -           -        3/12/4    6/9/3
                                             (63.2%)   (50.0%)                   0.7
           lancer2       -          -            -      6/10/2
                                                       (55.6%)                   0.6

Table 5: Number of significantly different run pairs deduced                     0.5
from Table 4 (α = 0.05).
                                                                                 0.4
                        (a) Mean        (b) Mean    (c) Mean
                       nDCG@10           Q@10      nERR@10                       0.3
               all3         31              29          19
            2lancers        31              29          17                       0.2
            lancer1         30              26          15                                   all3 (50 topics)   all3 (27 good topics)   all3 (23 bad topics)

            lancer2         26              28          16
            student         20              20          12                      Figure 2: Mean effectiveness scores according to different
                                                                                topic sets. The x-axis represents the run ranking according
                                                                                to all3 (50 topics).
   4.2.2 Statistically Significanct Differences across Systems. Ta-
ble 7 compares the outcomes of statistical significance test results               Table 8 shows the number of statistically significantly different
(Randomised Tukey HSD with B = 10, 000 trials) across the three                 pairs for each condition based on Table 7. Again, it can be observed
topic sets in a way similar to Table 4. Note that the two subsets               that the high-agreement set is substantially more discriminative
are inherently less discriminative than the full set as the sample              than the low-agreement set, despite the fact that the sample sizes
sizes are about half that of the full set. It can be observed that              are similar. Thus, the results suggest that, from a statistical point of
the set of statistically signifantly different pairs according to the           view, a high-agreement topic set is more useful for finding concrete
high-agreement (low-agreement) set is always a subset of the set                research conclusions than a low-agreement one.
of statistically signifantly different pairs according to the full topic
set. More interestingly, the set of statistically signifantly different         5      CONCLUSIONS AND FUTURE WORK
pairs according to the low-agreement set is almost a subet of the               This paper reported on a case study involving only 13 runs con-
set of statistically signifantly different pairs according to the high-         tributed from only three teams. Hence we do not claim that our
agreement set: for example, Table 7(b) shows that there is only one             finding will generalise; we merely hope to apply the same method-
system pair for which the low-agreement set obtained a statistically            ology to test collections that will be created for the future rounds of
significant difference while the high-agreement set did not in terms            the WWW task and possibly even other tasks. Our main findings
of Q@10.                                                                        using the English NTCIR-13 WWW test collection are as follows:


                                                                           36
The Effect of Inter-Assessor Disagreement on IR System
Evaluation: A Case Study with Lancers and Students                     EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.

        Table 6: System ranking consistency in term of Kendall’s τ , with 95%CIs, for every pair of topic sets (13 runs).
                        (a) Mean nDCG@10
                                                          all3 (27 high-agreement topics)    all3 (23 low-agreement topics)
                                 all3 (50 topics)                0.872 [0.696, 1.048]               0.846 [0.660, 1.032]
                        all3 (27 high-agreement topics)                    -                       0.718 [0.450, 0.986]
                        (b) Mean Q@10
                                                          all3 (27 high-agreement topics)    all3 (23 low-agreement topics)
                                 all3 (50 topics)                0.923 [0.784, 1.062]               0.846 [0.673, 1.019]
                        all3 (27 high-agreement topics)                    -                       0.769 [0.566, 0.972]
                        (c) Mean nERR@10
                                                          all3 (27 high-agreement topics)    all3 (23 low-agreement topics)
                                 all3 (50 topics)                0.846 [0.660, 1.032]              0.769 [0.545, 0.994]
                        all3 (27 high-agreement topics)                    -                       0.615 [0.319, 0.912]
                            Table 7: Statistical Significance Overlap between two topic sets (α = 0.05).
                        (a) Mean nDCG@10
                                                          all3 (27 high-agreement topics)    all3 (23 low-agreement topics)
                                 all3 (50 topics)                   4/27/0 (87.1%)                     22/9/0 (29.0%)
                        all3 (27 high-agreement topics)                    -                           18/9/0 (33.3%)
                        (b) Mean Q@10
                                                          all3 (27 high-agreement topics)    all3 (23 low-agreement topics)
                                 all3 (50 topics)                   4/25/0 (86.2%)                    19/10/0 (34.5%)
                        all3 (27 high-agreement topics)                    -                           16/9/1 (34.6%)
                        (c) Mean nERR@10
                                                          all3 (27 high-agreement topics)    all3 (23 low-agreement topics)
                                 all3 (50 topics)                   4/15/0 (78.9%)                     13/6/0 (31.6%)
                        all3 (27 high-agreement topics)                    -                           10/5/1 (31.2%)
                      Table 8: Number of significantly different run pairs deduced from Table 7 (α = 0.05).
                                                                     (a) Mean     (b) Mean         (c) Mean
                                                                    nDCG@10       (b) Q@10      (c) nERR@10
                                            all3 (50 topics)             31           29               19
                                      high-agreement (27 topics)         27           25               15
                                      low-agreement (23 topics)          9            10               6


     • Lancer-lancer inter-assessor agreements are statistically            ACKNOWLEDGEMENTS
       significantly higher than lancer-student agreements. The             I thank the PLY team (Peng Xiao, Lingtao Li, Yimeng Fan) of my
       student qrels is less discriminative than the lancers qrels.         laboratory for developing the PLY relevance assessment tool and
       While the lack of gold data prevents us from concluding              collecting the assessments. I also thank the NTCIR-13 WWW task
       which type of assessors is more reliable, these results sug-         organisers and participants for making this study possible.
       gest that hiring lancers has some merit despite the extra
       cost.
     • Different qrels versions based on different (combinations
       of) assessors can lead to somewhat different system rank-            REFERENCES
       ings and statistical significance test results. Combining             [1] Peter Bailey, Nick Craswell adn Arjen P. de Vries, and Ian Soboroff. 2008.
       multiple assessors’ labels to form fine-grained relevance                 Overview of the TREC 2007 Enterprise Track. In Proceedings of TREC 2007.
                                                                             [2] Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and
       levels is beneficial in terms of discriminative power.                    Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does
     • Removing 23 low-agreement topics (in terms of inter-assessor              It Matter?. In Proceedings of ACM SIGIR 2008. 667–674.
       agreement) from the full set of 50 topics prior to evaluating         [3] Ben Carterette. 2012. Multiple testing in statistical analysis of systems-based
                                                                                 information retrieval experiments. ACM TOIS 30, 1 (2012).
       runs did not have a major impact on the evaluation results,           [4] Ben Carterette and Ian Soboroff. 2010. The Effect of Assessor Errors on IR System
       as the properties of the 27 high-agreement topics are domi-               Evaluation. In Procceedings of ACM SIGIR 2010. 539–546.
       nant in the full set. However, replacing the high-agreement           [5] Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Dolf Trieschnigg,
                                                                                 and Chris Develder. 2014. Exploiting User Disagreement for Web Search Evalua-
       set with the low-agreement set resulted in a statistically                tion: an Experimental Approach. In Proceedings of ACM WSDM 2014. 33–42.
       significantly different system ranking, and substantially             [6] Marco Ferrante, Nicola Ferro, and Maria Maistro. 2017. AWARE: Exploiting
                                                                                 Evaluation Measures to Combine Multiple Assessors. ACM TOIS 36, 2 (2017).
       lower discriminative power. Hence, from a statistical point           [7] Joseph L. Fleiss. 1971. Measuring Nominal Scale Agreement among Many Raters.
       of view, our results suggest that a high-agreement topic set              Psychological Bulletin 76, 5 (1971), 378–382.
       is more useful for finding concrete research conclusions              [8] Donna K. Harman. 2005. The TREC Ad Hoc Experiments. In TREC: Experiment
                                                                                 and Evaluation in Information Retrieval, Ellen M. Voorhees and Donna K. Harman
       than a low-agreement one.                                                 (Eds.). The MIT Press, Chapter 4.


                                                                       37
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                            Tetsuya Sakai


 [9] Klaus Krippendorff. 2013. Content Analysis: An Introduction to Its Methodology           [15] Tetsuya Sakai. 2007. Alternatives to Bpref. In Proceedings of ACM SIGIR 2007.
     (Third Edition). Sage Publications.                                                           71–78.
[10] Cheng Luo, Tetsuya Sakai, Yiqun Liu, Zhicheng Dou, Chenyan Xiong, and                    [16] Tetsuya Sakai. 2012. Evaluation with Informational and Navigational Intents. In
     Jingfang Xu. 2017. Overview of the NTCIR-13 WWW Task. In Proceedings of                       Proceedings of WWW 2012. 499–508.
     NTCIR-13.                                                                                [17] Tetsuya Sakai. 2014. Metrics, Statistics, Tests. In PROMISE Winter School 2013:
[11] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017.                 Bridging between Information Retrieval and Databases (LNCS 8173). 116–163.
     Considering Assessor Agreement in IR Evaluation. In Proceedings of ACM ICTIR             [18] Tetsuya Sakai. 2017. Unanimity-Aware Gain for Highly Subjective Assessments.
     2017. 75–82.                                                                                  In Proceedings of EVIA 2017.
[12] Olga Megorskaya, Vladimir Kukushkin, and Pavel Serdyukov. 2015. On the                   [19] Ellen Voorhees. 2000. Variations in Relevance Judgments and the Measurement of
     Relation between Asessor’s Agreement and Accuracy in Gamified Relevance                       Retrieval Effectiveness. Information Processing and Management (2000), 697–716.
     Assessment. In Proceedings of ACM SIGIR 2015. 605–614.                                   [20] Yulu Wang, Garrick Sherman, Jimmy Lin, and Miles Efron. 2015. Assessor
[13] Justus J. Randolph. 2005. Free-Marginal Multirater Kappa (Multirater κ free ): An             Differences and User Preferences in Tweet Timeline Generation. In Proceedings
     Alternative to Fleiss’ Fixed Marginal Multirater Kappa. In Joensuu Learning and               of ACM SIGIR 2015. 615–624.
     Instruction Symposium 2005.                                                              [21] William Webber, Praveen Chandar, and Ben Carterette. 2012. Alternative As-
[14] Tetsuya Sakai. 2006. Evaluating Evaluation Metrics based on the Bootstrap. In                 sessor Disagreement and Retrieval Depth. In Proceedings of ACM CIKM 2012.
     Proceedings of ACM SIGIR 2006. 525–532.                                                       125–134.


                                                                                         38

</pre>