Unanimity-Aware Gain for Highly Subjective Assessments
                                                               Tetsuya Sakai
                                                             Waseda University
                                                           tetsuyasakai@acm.org

ABSTRACT                                                                      unanimity can affect statistical significance test results even when
IR tasks have diversified: human assessments of items such as so-             its impact on the gain value is kept to a minimum. Moreover, since
cial media posts can be highly subjective, in which case it becomes           our simulated ratings do not consider the correlation present in
necessary to hire many assessors per item to reflect their diverse            the assessors’ actual ratings, our experiments probably underesti-
views. For example, the value of a tweet for a given purpose may              mate the effect of introducing unanimity into evaluation. Hence,
be judged by (say) ten assessors, and their ratings could be summed           if researchers accept that unanimous votes should be valued more
up to define its gain value for computing a graded-relevance evalu-           highly than controversial ones, then our proposed approach may
ation measure. In the present study, we propose a simple variant of           be worth incorporating.
this approach, which takes into account the fact that some items
receive unanimous ratings while others are more controversial. We             2 RELATED WORK
generate simulated ratings based on a real social-media-based IR
                                                                              2.1 Document Relevance Assessments
task data to examine the effect of our unanimity-aware approach
on the system ranking and on statistical significance. Our results            Due to lack of space, we refer the reader to Sakai [7] for a short
show that incorporating unanimity can affect statistical signifi-             overview of studies related to inter-assessor agreement. Below, we
cance test results even when its impact on the gain value is kept to          briefly discuss two studies that helps us to explain the novelty of
a minimum. Moreover, since our simulated ratings do not consider              our approach to utilising multiple relevance assessments.
the correlation present in the assessors’ actual ratings, our experi-            Megorskaya et al. [4] studied the benefit of communication be-
ments probably underestimate the effect of introducing unanimity              tween multiple assessors in the context of gamified relevance as-
into evaluation. Hence, if researchers accept that unanimous votes            sessment for web search evaluation. The premise in their work is
should be valued more highly than controversial ones, then our                that every document needs to finally receive a single relevance level,
proposed approach may be worth incorporating.                                 as a result of a consensus between the assessors or an overruling
                                                                              by a “referee” etc. This is in contrast to our work, where we are
CCS CONCEPTS                                                                  interested in assessment tasks where there may be no such thing
                                                                              as the correct assessment, and therefore it is important to preserve
•Information systems → Retrieval effectiveness;
                                                                              different subjective views in the data and in evaluation.
                                                                                 Turpin et al. [12] propose to use magnitude estimation in docu-
KEYWORDS
                                                                              ment relevance assessments in order to obtain ratio-scale judgments
effect sizes; evaluation measures; inter-assessor agreement; p-values;        instead of the traditional ordinal- or interval-scale ones, and to in-
social media; statistical significance                                        terpret the ratio-scale judgments directly as the gain values for
                                                                              computing normalised discounted cumulative gain (nDCG) and
1    INTRODUCTION                                                             expected reciprocal rank (ERR). This is achieved by instructing
In traditional test-collection-based IR experiments, we often rely on         the assessor to give an arbitrary score to his first document and
our experience which says that system rankings would remain sta-              subsequently to give a “relative” score to each of the remaining
ble even if the set of document relevance assessments are replaced            documents, where “relative” means “in comparison to the preceding
by another [13]. However, IR tasks have diversified: human assess-            document.” While their approach and ours both produce continu-
ments of items such as social media posts can be highly subjective,           ous relevance assessments, unanimity across judges was not within
in which case it becomes necessary to hire many assessors per item            the scope of their study.
to reflect their diverse views. For example, the value of a tweet for
a given purpose may be judged by (say) ten assessors, and their               2.2    Social Media Assessments
ratings could be summed up to define its gain value for computing             Wang et al. [14] examined the effect of assesser differences in the
a graded-relevance evaluation measure (e.g. [8, 11]). In the present          context of the TREC Tweet Timeline Generation task, by devising
study, we propose a simple variant of this approach, which takes              two sets of tweet equivalence classes constructed by different as-
into account the fact that some items receive unanimous ratings               sessors. Their conclusion is similar to that of Voorhees [13] who
while others are more controversial. We generate simulated rat-               examined the effect of document relevance assessor differences: de-
ings based on a real social-media-based IR task data to examine               spite the substantial differences in the two sets of clusters, system
the effect of our unanimity-aware approach on the system ranking              rankings and the absolute evaluation measure scores based on these
and on statistical significance. Our results show that incorporating          two sets were very similar. Sakai et al. [8] used graded-relevance
Copying permitted for private and academic purposes.
                                                                              measures to evaluate a community QA answer ranking task; each
EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.                            answer was assessed by four assessors and its gain value for com-
© 2017 Copyright held by the author.                                          puting the measures was determined as the sum of assessors’ grades.


                                                                         39
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                                Tetsuya Sakai

    Table 1: Examples of RawGi , WGi , UGi when D max = 3.                                        final evaluation measures are also intervals or distributions. They
  Item i    Ratings ( N = 5)   RawG i    Di     WG i    UG i (p = 0.2)    UG i (p = 0.1)
                                                                                                  call their measures agreement-aware measures.
  Item 1        22222            10       0     10.0          13               11.5                  In contrast to their novel approaches, our proposal simply utilises
  Item 2        11233            10       2     3.3           11               10.5
  Item 3        02233            10       3     0.0           10                10                the original score distribution across assessors to adjust the gain
  Item 4        11111            5        0     5.0            8               6.5                value of each document so that a traditional evaluation measure
  Item 5        00003            3        3     0.0            3                 3
  Item 6        00002            2        2     0.7            3               2.5                can be computed. It remains to be seen how the interval and distri-
  Item 7        00001            1        1     0.7            3                 2                bution measures of Maddalena et al. can effectively be utilised in
                                                                                                  IR evaluation venues such as CLEF, NTCIR and TREC.

More recently, Shang et al. [11] reported on the NTCIR-12 Short                                   3    PROPOSED METHOD
Text Conversation task which is basically a tweet retrieval task: in                              Our proposal is very simple and highly intuitive. Given a constant
their Japanese subtask, the sum of scores from ten assessors were                                 p (0 ≤ p ≤ 1), let us define the unanimity-aware gain as follows:
used to define the gain value of each tweet. Note that these studies
do not take into account whether the assessors are unanimous or                                                        UGi = RawGi + pN (D max − D i )                           (2)
not; the sum is all that matters.                                                                 if RawGi > 0; otherwise UGi = RawGi = 0. Here, D max − D i is a
                                                                                                  simple measure of unanimity where D i is, as before, the difference
2.3        Li and Yoshikawa                                                                       between the maximum and the minimum among the N ratings2 .
The recent work of Li and Yoshikawa [2] is similar in spirit to ours,                             Whereas, p controls the impact of unanimity on the gain. Thus,
and deserves a detailed explanation. They consider the problem                                    while we mainly want to reflect RawGi in our evaluation, we apply
of assessing the similarity between two documents using many                                      an “upgrade” according to the degree of unanimity. When the
assessors, and propose to incorporate what they call “confusabil-                                 ratings of an item are perfectly unanimous (i.e., D i = 0), we are
ity” into measures such as Pearson’s correlation. Specifically, when                              giving it an extra pN D max ; that is, we shall pretend that pN extra
computing a correlation value, they propose to weight each labelled                               assessors gave it the highest possible rating.
item i by 1 − c i , where c i is a normalised measure of confusability                                For Items 1-3 shown in Table 1, p = 0.2 implies that UG 1 =
based on the difference (D i ) between the highest and the lowest                                 13, UG 2 = 11, UG 3 = 10; this is clearly more intuitive than the WGi
ratings for item i, and so on1 . Li and Yoshikawa remark that the                                 values. On the other hand, consider Items 5-7 in Table 1: note that
same idea can be applied to other measures such as nDCG, although                                 if p = 0.2, UG 5 = UG 6 = UG 7 = 3. If this is not desirable, p = 0.1
they do not provide any details: here, let us try to faithfully apply                             may be used instead as shown in the same table; however, we shall
their idea to ranked retrieval evaluation based on a group of as-                                 discuss how to set an appropriate p elsewhere with real assessments
sessors. Let N be the number of assessors per item, and suppose                                   in our future work. Hereafter, we only consider a modest impact by
that each assessor assigns to each item a rating on a scale from                                  letting p = 0.2, as the focus of the present study is to demonstrate
0, 1, . . . , D max . A straightforward way to define the final relevance                         that our approach has a practical impact on experimental results
level or the actual gain value for each item would be to just sum                                 even with a small p.
up the ratings [8, 11]: then we would have relevance levels from                                      Our approach suggests a slight departure from traditional IR
0 to N D max . For any item i with N independent assessments, let                                 evaluation at the implementation level as well. In traditional IR,
RawGi denote the gain value thus obtained. The above approach                                     we usually prepare discrete relevance levels (e.g. relevant, highly
of Li and Yoshikawa suggests that we modify each gain value as                                    relevant, etc.) to define the gold standard: we know the number
follows:                                                                                          of relevance levels in advance, and we map each relevance level
                                                                                                  to a gain value at the time of measure calculation. In contrast, our
            WGi = (1 − c i )RawGi = (1 − D i /D max )RawGi .                          (1)
                                                                                                  approach suggests that we retain the individual ratings in the test
Let us consider Items 1-3 shown in Table 1 with N = 5, D max = 3.                                 collection, from which gain values can be computed on the fly; there
Clearly, RawG 1 = RawG 2 = RawG 3 = 10, and according to Eq. 1,                                   is no longer the notion of a predefined set of relevance levels. It is
WG 1 = 10, WG 2 = 3.3, WG 3 = 0. Thus Items 2 and 3 are considered                                easy to see that the highest possible value of UGi is (1 + p)N D max .
worse than (say) Item 4 in Table 1, since WG 4 = RawG 4 = 5. Clearly,                             Fortunately, there is a readily available IR evaluation tool that
a more careful consideration is in order.                                                         accommodates not only relevance-level-based computation but also
                                                                                                  direct gain-value-based computation, as we shall discuss below.
2.4        Maddalena et al.
More recently, at ICTIR 2017, Maddalena et al. [3] proposed an                                    4    EXPERIMENTS
evaluation approach whose motivation is almost identical as ours:                                 Let us demonstrate the effect of introducing unanimity-aware gain
they also claim that the distribution of the scores from different                                to an IR task where the ratings of the items can be highly subjec-
assessors should be utilised for IR evaluation. More specifically,                                tive. To this end, we chose to use the recent NTCIR-12 Short Text
they propose to replace a gain value of a document with an interval                               Conversation (STC) Chinese subtask data [11] for the following
of gain values or even with a distribution of gain values, so that the                            reasons: (1) STC requires the system to return a “reasonable” tweet
                                                                                                  as a response to a human tweet, and the assessments are expected
1 Li and Yoshikawa [2] also considered using standard deviation and entropy to quantify
c i , but the present study focusses on the simplest case that relies on D i as we believe        2 Variants are possible of course: for example, we could obtain maximum and minimum
that evaluation methods should be as simple as possible.                                          values after removing outlier ratings.


                                                                                             40
Unanimity-Aware Gain for Highly Subjective Assessments                                    EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.


to be highly subjective; (2) STC was the largest task of NTCIR-12,                             Table 2: Comparing the system rankings of RawGi vs. UGi
with 44 runs from 16 teams for the Chinese subtask3 . The STC                                  with Kendall’s τ with 95% confidence intervals.
Chinese test collection contains 100 topics (i.e., input Weibo tweets)
                                                                                                D max        N =5           N = 10           N = 20          N = 40            N = 80
with relevance assessments (“qrels”) containing the following rele-                                                                  (a) Mean nG@1
vance levels: L0 (judged nonrelevant); L1 (relevant); and L2 (highly                               2          .987
                                                                                                          [.968, 1.007]
                                                                                                                              .989
                                                                                                                          [.970, 1.007]
                                                                                                                                                1
                                                                                                                                          [.995, 1.005]
                                                                                                                                                                1
                                                                                                                                                           [.995, 1.005]
                                                                                                                                                                                  1
                                                                                                                                                                             [.995, 1.005]
relevant).                                                                                         4          .992            .983             .996             1                 1
                                                                                                          [.976, 1.005]   [.962, 1.004]   [.985, 1.006]    [.995, 1.005]     [.995, 1.005]
   From the official qrels, we created 15 simulated variants with                                  8          .985            .985             .998            .992               1
D max ∈ {2, 4, 8} and N ∈ {5, 10, 20, 40, 80}, as follows. For each                                       [.968, 1.003]   [.963, 1.008]    [.994, 1.005]   [.978, 1.008]     [.995, 1.005]
                                                                                                                                      (b) Mean P+
judged tweet of each topic, L0 is replaced with (0, 0, . . .); whereas,                            2          .985            .994             .998             1                 1
both L1 and L2 are replaced with N simulated ratings obtained by                                          [.964, 1.006]   [.982, 1.005]   [.991, 1.005]       [1, 1]            [1, 1]
                                                                                                   4          .983            .992             .998             1                 1
random sampling from a uniform distribution over [0, D max ]. We                                          [.965, 1.002]   [.977, 1.005]   [.994, 1.004]    [.997, 1.003]     [.997, 1.003]
then compute the unanimity-aware gains using Eq. 2, and evaluate                                   8          .989            .992             .994            .996               1
                                                                                                          [.972, 1.007]   [.981, 1.005]   [.982, 1.005]    [.986, 1.006]        [1, 1]
up to top 10 Weibo tweets from each run. Note that randomly                                                                          (c) Mean nERR
sampling N times implies that as N gets large, we are more likely                                  2          .996            .994              1               1                 1
                                                                                                          [.985, 1.005]   [.982, 1.005]       [1, 1]          [1, 1]            [1, 1]
to obtain both 0 and D max among the N observations and therefore                                  4          .996            .994             .998             1                 1
D i is more likely to be D max , i.e., UGi is more likely to reduce                                       [.986, 1.006]   [.982, 1.005]   [.990, 1.005]       [1, 1]            [1, 1]
                                                                                                   8          .989            .992             .998            .998               1
to RawGi (Eq. 2). In contrast, real ratings of different assessors                                        [.977, 1.005]   [.978, 1.005]   [.991, 1.005]    [.990, 1.005]        [1, 1]
are probably correlated with one another, which should generally
make D i smaller than our simulated ratings. Hence, this experiment
probably underestimates the impact of introducing unanimity-aware                              5       RESULTS AND DISCUSSIONS
gain into evaluation.                                                                          Table 2 compares the system rankings based on RawGi vs. UGi
   We use the three official measures from STC: nG@1 (normalised                               in terms of Kendall’s τ with 95% confidence intervals, for nG@1,
gain at rank 1)4 , P+ (see below), and nERR (normalised expected re-                           P+ and nERR averaged over the 100 official Chinese STC topics. It
ciprocal rank) [11]. While the official STC Chinese subtask used the                           can be observed that all of the upper confidence limits are above
NTCIREVAL5 toolkit by giving a gain value of 1 to each L1-relevant                             one, meaning that the systems rankings based on RawGi and UGi
tweet and 3 to each L2-relevant one, we utilised an alternative func-                          are statistically equivalent. However, except where the 95% CIs are
tionality of the same tool, which enables us to feed gain values of                            “[1, 1],” the two rankings are not identical, even with p = 0.2. Recall
relevant items directly to it without considering the number of rel-                           also that we should expect to see lower rank correlations if we use
evance levels. This feature was already available in NTCIREVAL for                             real assessors’ ratings with correlations among them.
the purpose of accommodating the global gain proposed by Sakai                                    Probably a more practical concern than the change in the over-
and Song [9], which is an idea for obtaining a real-valued gain                                all system ranking is: does the proposed method affect statistical
for each relevant web page for search result diversification evalua-                           significance test results? If a researcher is interested in compar-
tion. In their work, global gain was computed from intent-aware                                ing every system pair, then conducting a pairwise test such as the
probabilities and per-intent graded relevance assessments.                                     paired t-test repeatedly (without correcting α) is not the correct
   P+, an official measure from STC but nevertheless less well-                                approach: one elegant solution would be to use the randomised
known than nDCG and nERR, deserves a brief explanation here.                                   Tukey HSD (Honestly Significantly Difference) test [1], which is
Just like nERR, it is a measure suitable for navigational intents. Just                        free from distributional assumptions and ensures that the fam-
as Average Precision (AP) employs (binary) precision to measure                                ilywise error rate (i.e., the probability of incorrectly obtaining a
the utility of the top r documents for a user group who abandon                                statistically significant difference for at least one system pair) is α.
the ranked list at r , P+ employs the blended ratio, which combines                            We use the Discpower6 toolkit to conduct the randomised Tukey
precision and cumulative gain, for the same purpose. Furthermore,                              HSD test from each topic-by-run score matrix, with B = 5, 000 trials
just as AP assumes that the distribution of users abandoning the                               for each test. The STC Chinese subtask had 16 participating teams,
ranked list is uniform across all relevant documents (even if some                             and one run from each team (specifically, best run in terms of the
of them are not retrieved) [5], P+ assumes that the distribution                               official Mean nG@1 score) is considered in this analysis, giving us
is uniform over all relevant documents ranked at or above rp , the                             16 ∗ 15/2 = 120 comparisons. Do RawGi and UGi give us similar
preferred rank [11]. Given a ranked list, the preferred rank is the                            p-values and similar research conclusions?
rank of the most relevant document that is closest to the top. In our                             Table 3 summarises the discrepancy between the significance
case, the preferred rank is the rank of the document in the res file                           test results with RawGi and those with UGi : these are the compar-
that has the highest UGi value and is closest to the top. Both AP                              isons where the difference is statistically significant at α = 0.05
and P+ represent the expected utility over their user abandonment                              according to one while not significant according to the other. p-
distributions.                                                                                 values, absolute score differences (|d X Y |), and effect sizes (ES HSD )
                                                                                               are also shown; ES HSD is computed by dividing |d X Y | by the resid-
                                                                                               ual standard deviation of each experimental condition [6]7 . It can
3 The Japanese subtask actually collected N = 10 individual ratings for each tweet,
                                                                                               be observed that the effect of introducing unanimity-aware gain
but had only 25 runs from seven teams. We plan to use this data set as well in a follow
up study.                                                                                      6 http://research.nii.ac.jp/ntcir/tools/discpower-en.html
4 Note that neither Discounting nor Cumulation of “nDCG” does not apply at rank 1.             7 This form of effect size measures the difference between two systems in standard
5 http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html                                      deviation units; unlike the p -value, is not a function of the sample size.


                                                                                          41
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                     Tetsuya Sakai

        Table 3: Discrepancies at α = 0.05: p-values (those below α shown in bold), absolute differences, and effect sizes.
                                                                                               (I) RawG i                       (II) UG i
                                      D max   N    system pair                     p -value      |d X Y |   ES HSD   p -value    |d X Y |   ES HSD
                          (a) nG@1        2    5   MSRSC-C-R1 vs. Grad1-C-R1            .064       .1373     1.808      .034       .1361     2.045
                                                   ICL00-C-R1 vs. Grad1-C-R1            .062       .1378     1.814      .029       .1377     2.069
                                                   cyut-C-R1 vs. HITSZ-C-R1             .064       .1375     1.811      .033       .1363     2.048
                                              10   ICL00-C-R1 vs. PolyU-C-R1           .043        .1564     1.723       .057      .1452     1.756
                                              20   Nders-C-R1 vs. picl-C-R2             .051       .1622     1.629      .045       .1632     1.650
                                          4    5   cyut-C-R1 vs. HITSZ-C-R1             .073       .1390     1.752      .038       .1418     1.953
                                          8    5   MSRSC-C-R1 vs. Grad1-C-R1           .048        .1481     1.797       .057      .1413     1.821
                                                   ICL00-C-R1 vs. Grad1-C-R1            .052       .1469     1.782      .049       .1434     1.848
                                                   cyut-C-R1 vs. HITSZ-C-R1             .054       .1466     1.779      .024       .1515     1.952
                                              10   ICL00-C-R1 vs. PolyU-C-R1           .035        .1666     1.668       .055      .1568     1.653
                            (b) P+        2    5   MSRSC-C-R1 vs. Poly-U-C-R1          .037        .1272     2.340       .051      .1176     2.431
                                                   Grad1-C-R1 vs. HITSZ-C-R1           .037        .1273     2.342       .059      .1150     2.377
                                              10   ICL00-C-R1 vs. Grad1-C-R1           .047        .1311     2.251       .070      .1207     2.242
                                              20   ICL00-C-R1 vs. Grad1-C-R1           .049        .1330     2.193       .053      .1319     2.188
                                          4    5   MSRSC-C-R1 vs. Grad1-C-R1           .048        .1240     2.332       .068      .1146     2.343
                                                   PolyU-C-R1 vs. HITSZ-C-R1            .076       .1186     2.230      .047       .1197     2.447
                                          8    5   PolyU-C-R1 vs. HITSZ-C-R1            .082       .1181     2.196      .045       .1210     2.400
                                              10   ICL00-C-R1 vs. Grad1-C-R1           .032        .1363     2.298       .086      .1223     2.159
                                                   Nders-C-R1 vs. PolyU-C-R1           .035        .1357     2.288       .057      .1272     2.245
                           (c) nERR       2    5   BUPTTeam-C-R4 vs. ITNLP-C-R3        .045        .1347     2.208       .052      .1268     2.283
                                                   MSRSC-C-R1 vs. Grad1-C-R1            .060       .1312     2.151      .046       .1280     2.375
                                          4    5   OKSAT-C-R1 vs. PolyU-C-R1           .046        .1380     2.157       .055      .1319     2.191
                                          8    5   MSRSC-C-R1 vs. Grad1-C-R1           .037        .1436     2.151       .052      .1369     2.139
                                              10   ICL00-C-R1 vs. Grad1-C-R1           .042        .1531     2.005       .052      .1480     2.008


cannot be overlooked, even with p = 0.2. For example, when                             on the traditional gain values. However, there were only N = 3
D max = 2, N = 5, there are three discrepancies between nG@1                           assessors per topic.
based on RawGi and that based on UGi among the 120 comparisons.                           In future work, we would like to apply our approach to diverse
Whereas, as was anticipated in Section 4, it can be observed that                      social-media-related tasks with many assessors (i.e., a large N ),
the impact of introducing unanimity is not observed for N = 40, 80.                    where, unlike our simulated ratings, correlations among the asses-
Again, with real ratings that tend to resemble one another and                         sors are present. With real ratings, we expect to observe a larger
make D i smaller that these random ratings do, we will probably                        impact of introducing unanimity-aware gain on the system ranking
observe a more substantial impact of introducing unanimity-aware                       and statistical significance than we did in our simulations.
gain into evaluation.

                                                                                       REFERENCES
                                                                                        [1] Ben Carterette. 2012. Multiple testing in statistical analysis of systems-based
6   CONCLUSIONS AND FUTURE WORK                                                             information retrieval experiments. ACM TOIS 30, 1 (2012).
                                                                                        [2] Jiyi Li and Masatoshi Yoshikawa. 2016. Evaluation with Confusable Ground
We proposed a simple and intuitive approach to incorporating the                            Truth. In Proceedings of AIRS 2016 (LNCS 9994). 363–369.
assessors’ subjective yet unanimous decisions into gain-value-based                     [3] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017.
                                                                                            Considering Assessor Agreement in IR Evaluation. In Proceedings of ACM ICTIR
retrieval evaluation, and demonstrated that this will affect experi-                        2017. 75–82.
mental outcomes. Our results show that incorporating unanimity                          [4] Olga Megorskaya, Vladimir Kukushkin, and Pavel Serdyukov. 2015. On the
                                                                                            Relation between Asessor’s Agreement and Accuracy in Gamified Relevance
can affect statistical significance test results even when its impact                       Assessment. In Proceedings of ACM SIGIR 2015. 605–614.
on the gain value is kept to a minimum. Moreover, since our simu-                       [5] Stephen E. Robertson. 2008. A New Interpretation of Average Precision. In
lated ratings do not consider the correlation present in the assessors’                     Proceedings of ACM SIGIR 2008. 689–690.
                                                                                        [6] Tetsuya Sakai. 2014. Statistical Reform in Information Retrieval? SIGIR Forum
actual ratings, our experiments probably underestimate the effect                           48, 1 (2014), 3–12.
of introducing unanimity-aware gain into evaluation. Hence, if                          [7] Tetsuya Sakai. 2017. The Effect of Inter-Assessor Disagreement on IR System
researchers accept that unanimous votes should be valued more                               Evaluation: A Case Study with Lancers and Students. In Proceedings of EVIA
                                                                                            2017.
highly than controversial ones, then our proposed approach may                          [8] Tetsuya Sakai, Daisuke Ishikawa, Noriko Kando, Yohei Seki, Kazuko Kuriyama,
be worth incorporating. We also demonstrated how the proposed                               and Chin-Yew Lin. 2011. Using Graded-Relevance Metrics for Evaluating Com-
                                                                                            munity QA Answer Selection. In Proceedings of ACM WSDM 2011. 187–196.
approach of directly feeding gain values to an existing evaluation                      [9] Tetsuya Sakai and Ruihua Song. 2011. Evaluating Diversified Search Results
tool can be accomplished, while bypassing the notion of discrete                            Using Per-Intent Graded Relevance. In Proceedings of ACM SIGIR 2011. 1043–
relevance levels.                                                                           1052.
                                                                                       [10] Lifeng Shang, Tetsuya Sakai, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao,
   Following the present study, the proposed unanimity-aware gain                           Yuki Arase, and Masako Nomoto. 2017. Overview of the NTCIR-13 Short Text
approach was applied to the recent NTCIR-13 Short Text Conversa-                            Conversation Task. In Proceedings of NTCIR-13.
tion (STC-2) Chinese subtask [10], with p = 0.2. There, according                      [11] Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka,
                                                                                            and Yusuke Miyao. 2016. Overview of the NTCIR-12 Short Text Conversation
to the randomised Tukey HSD test, three extra statistically signif-                         Task. In Proceedings of NTCIR-12. 473–484.
icantly different system pairs were obtained by using unanimity-                       [12] Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The
                                                                                            Benefits of Magnitude Estimation Relevance Assessments for Information Re-
aware nG@1 instead of the traditional nG@1; one extra statistically                         trieval Evaluation. In Proceedings of ACM SIGIR 2015. 565–574.
significantly different system pair was obtained by using unanimity-                   [13] Ellen M. Voorhees. 1998. Variations in Relevance Judgments and the Measure-
aware P+ instead of the traditional P+. Thus the sets of statistically                      ment of Retrieval Effectiveness. In Proceedings of ACM SIGIR 1998. 315–323.
                                                                                       [14] Yulu Wang, Garrick Sherman, Jimmy Lin, and Miles Efron. 2015. Assessor
significantly different system pairs according to the unanimity-                            Differences and User Preferences in Tweet Timeline Generation. In Proceedings
aware approach were supersets of the corresponding sets based                               of ACM SIGIR 2015. 615–624.


                                                                                  42