<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Unanimity-Aware Gain for Highly Subjective Assessments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetsuya Sakai</string-name>
          <email>tetsuyasakai@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Waseda University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>39</fpage>
      <lpage>42</lpage>
      <abstract>
        <p>IR tasks have diversied: human assessments of items such as social media posts can be highly subjective, in which case it becomes necessary to hire many assessors per item to reect their diverse views. For example, the value of a tweet for a given purpose may be judged by (say) ten assessors, and their ratings could be summed up to dene its gain value for computing a graded-relevance evaluation measure. In the present study, we propose a simple variant of this approach, which takes into account the fact that some items receive unanimous ratings while others are more controversial. We generate simulated ratings based on a real social-media-based IR task data to examine the eect of our unanimity-aware approach on the system ranking and on statistical signicance. Our results show that incorporating unanimity can aect statistical signicance test results even when its impact on the gain value is kept to a minimum. Moreover, since our simulated ratings do not consider the correlation present in the assessors' actual ratings, our experiments probably underestimate the eect of introducing unanimity into evaluation. Hence, if researchers accept that unanimous votes should be valued more highly than controversial ones, then our proposed approach may be worth incorporating.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>•Information systems ! Retrieval eectiveness;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        In traditional test-collection-based IR experiments, we oen rely on
our experience which says that system rankings would remain
stable even if the set of document relevance assessments are replaced
by another [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However, IR tasks have diversied: human
assessments of items such as social media posts can be highly subjective,
in which case it becomes necessary to hire many assessors per item
to reect their diverse views. For example, the value of a tweet for
a given purpose may be judged by (say) ten assessors, and their
ratings could be summed up to dene its gain value for computing
a graded-relevance evaluation measure (e.g. [
        <xref ref-type="bibr" rid="ref11 ref8">8, 11</xref>
        ]). In the present
study, we propose a simple variant of this approach, which takes
into account the fact that some items receive unanimous ratings
while others are more controversial. We generate simulated
ratings based on a real social-media-based IR task data to examine
the eect of our unanimity-aware approach on the system ranking
and on statistical signicance. Our results show that incorporating
Copying permied for private and academic purposes.
      </p>
      <p>
        EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.
© 2017 Copyright held by the author.
unanimity can aect statistical signicance test results even when
its impact on the gain value is kept to a minimum. Moreover, since
our simulated ratings do not consider the correlation present in
the assessors’ actual ratings, our experiments probably
underestimate the eect of introducing unanimity into evaluation. Hence,
if researchers accept that unanimous votes should be valued more
highly than controversial ones, then our proposed approach may
be worth incorporating.
Due to lack of space, we refer the reader to Sakai [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for a short
overview of studies related to inter-assessor agreement. Below, we
briey discuss two studies that helps us to explain the novelty of
our approach to utilising multiple relevance assessments.
      </p>
      <p>
        Megorskaya et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] studied the benet of communication
between multiple assessors in the context of gamied relevance
assessment for web search evaluation. e premise in their work is
that every document needs to nally receive a single relevance level,
as a result of a consensus between the assessors or an overruling
by a “referee” etc. is is in contrast to our work, where we are
interested in assessment tasks where there may be no such thing
as the correct assessment, and therefore it is important to preserve
dierent subjective views in the data and in evaluation.
      </p>
      <p>
        Turpin et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] propose to use magnitude estimation in
document relevance assessments in order to obtain ratio-scale judgments
instead of the traditional ordinal- or interval-scale ones, and to
interpret the ratio-scale judgments directly as the gain values for
computing normalised discounted cumulative gain (nDCG) and
expected reciprocal rank (ERR). is is achieved by instructing
the assessor to give an arbitrary score to his rst document and
subsequently to give a “relative” score to each of the remaining
documents, where “relative” means “in comparison to the preceding
document.” While their approach and ours both produce
continuous relevance assessments, unanimity across judges was not within
the scope of their study.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Social Media Assessments</title>
      <p>
        Wang et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] examined the eect of assesser dierences in the
context of the TREC Tweet Timeline Generation task, by devising
two sets of tweet equivalence classes constructed by dierent
assessors. eir conclusion is similar to that of Voorhees [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] who
examined the eect of document relevance assessor dierences:
despite the substantial dierences in the two sets of clusters, system
rankings and the absolute evaluation measure scores based on these
two sets were very similar. Sakai et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used graded-relevance
measures to evaluate a community QA answer ranking task; each
answer was assessed by four assessors and its gain value for
computing the measures was determined as the sum of assessors’ grades.
More recently, Shang et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] reported on the NTCIR-12 Short
Text Conversation task which is basically a tweet retrieval task: in
their Japanese subtask, the sum of scores from ten assessors were
used to dene the gain value of each tweet. Note that these studies
do not take into account whether the assessors are unanimous or
not; the sum is all that maers.
e recent work of Li and Yoshikawa [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is similar in spirit to ours,
and deserves a detailed explanation. ey consider the problem
of assessing the similarity between two documents using many
assessors, and propose to incorporate what they call
“confusability” into measures such as Pearson’s correlation. Specically, when
computing a correlation value, they propose to weight each labelled
item i by 1 ci , where ci is a normalised measure of confusability
based on the dierence (Di ) between the highest and the lowest
ratings for item i, and so on1. Li and Yoshikawa remark that the
same idea can be applied to other measures such as nDCG, although
they do not provide any details: here, let us try to faithfully apply
their idea to ranked retrieval evaluation based on a group of
assessors. Let N be the number of assessors per item, and suppose
that each assessor assigns to each item a rating on a scale from
0; 1; : : : ; Dmax . A straightforward way to dene the nal relevance
level or the actual gain value for each item would be to just sum
up the ratings [
        <xref ref-type="bibr" rid="ref11 ref8">8, 11</xref>
        ]: then we would have relevance levels from
0 to N Dmax . For any item i with N independent assessments, let
RawGi denote the gain value thus obtained. e above approach
of Li and Yoshikawa suggests that we modify each gain value as
follows:
      </p>
      <p>
        More recently, at ICTIR 2017, Maddalena et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed an
evaluation approach whose motivation is almost identical as ours:
they also claim that the distribution of the scores from dierent
assessors should be utilised for IR evaluation. More specically,
they propose to replace a gain value of a document with an interval
of gain values or even with a distribution of gain values, so that the
1 Li and Yoshikawa [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] also considered using standard deviation and entropy to quantify
ci , but the present study focusses on the simplest case that relies on Di as we believe
that evaluation methods should be as simple as possible.
nal evaluation measures are also intervals or distributions. ey
call their measures agreement-aware measures.
      </p>
      <p>In contrast to their novel approaches, our proposal simply utilises
the original score distribution across assessors to adjust the gain
value of each document so that a traditional evaluation measure
can be computed. It remains to be seen how the interval and
distribution measures of Maddalena et al. can eectively be utilised in
IR evaluation venues such as CLEF, NTCIR and TREC.
3</p>
    </sec>
    <sec id="sec-4">
      <title>PROPOSED METHOD</title>
      <p>Our proposal is very simple and highly intuitive. Given a constant
p (0 p 1), let us dene the unanimity-aware gain as follows:
UGi = RawGi + pN ¹Dmax</p>
      <p>Di º
(2)
if RawGi &gt; 0; otherwise UGi = RawGi = 0. Here, Dmax Di is a
simple measure of unanimity where Di is, as before, the dierence
between the maximum and the minimum among the N ratings2.
Whereas, p controls the impact of unanimity on the gain. us,
while we mainly want to reect RawGi in our evaluation, we apply
an “upgrade” according to the degree of unanimity. When the
ratings of an item are perfectly unanimous (i.e., Di = 0), we are
giving it an extra pN Dmax ; that is, we shall pretend that pN extra
assessors gave it the highest possible rating.</p>
      <p>For Items 1-3 shown in Table 1, p = 0:2 implies that UG1 =
13; UG2 = 11; UG3 = 10; this is clearly more intuitive than the WGi
values. On the other hand, consider Items 5-7 in Table 1: note that
if p = 0:2, UG5 = UG6 = UG7 = 3. If this is not desirable, p = 0:1
may be used instead as shown in the same table; however, we shall
discuss how to set an appropriate p elsewhere with real assessments
in our future work. Hereaer, we only consider a modest impact by
leing p = 0:2, as the focus of the present study is to demonstrate
that our approach has a practical impact on experimental results
even with a small p.</p>
      <p>Our approach suggests a slight departure from traditional IR
evaluation at the implementation level as well. In traditional IR,
we usually prepare discrete relevance levels (e.g. relevant, highly
relevant, etc.) to dene the gold standard: we know the number
of relevance levels in advance, and we map each relevance level
to a gain value at the time of measure calculation. In contrast, our
approach suggests that we retain the individual ratings in the test
collection, from which gain values can be computed on the y; there
is no longer the notion of a predened set of relevance levels. It is
easy to see that the highest possible value of UGi is ¹1 + pºN Dmax .
Fortunately, there is a readily available IR evaluation tool that
accommodates not only relevance-level-based computation but also
direct gain-value-based computation, as we shall discuss below.
4</p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS</title>
      <p>
        Let us demonstrate the eect of introducing unanimity-aware gain
to an IR task where the ratings of the items can be highly
subjective. To this end, we chose to use the recent NTCIR-12 Short Text
Conversation (STC) Chinese subtask data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for the following
reasons: (1) STC requires the system to return a “reasonable” tweet
as a response to a human tweet, and the assessments are expected
2 Variants are possible of course: for example, we could obtain maximum and minimum
values aer removing outlier ratings.
to be highly subjective; (2) STC was the largest task of NTCIR-12,
with 44 runs from 16 teams for the Chinese subtask3. e STC
Chinese test collection contains 100 topics (i.e., input Weibo tweets)
with relevance assessments (“qrels”) containing the following
relevance levels: L0 (judged nonrelevant); L1 (relevant); and L2 (highly
relevant).
      </p>
      <p>From the ocial qrels, we created 15 simulated variants with
Dmax 2 f2; 4; 8g and N 2 f5; 10; 20; 40; 80g, as follows. For each
judged tweet of each topic, L0 is replaced with ¹0; 0; : : :º; whereas,
both L1 and L2 are replaced with N simulated ratings obtained by
random sampling from a uniform distribution over »0; Dmax ¼. We
then compute the unanimity-aware gains using Eq. 2, and evaluate
up to top 10 Weibo tweets from each run. Note that randomly
sampling N times implies that as N gets large, we are more likely
to obtain both 0 and Dmax among the N observations and therefore
Di is more likely to be Dmax , i.e., UGi is more likely to reduce
to RawGi (Eq. 2). In contrast, real ratings of dierent assessors
are probably correlated with one another, which should generally
make Di smaller than our simulated ratings. Hence, this experiment
probably underestimates the impact of introducing unanimity-aware
gain into evaluation.</p>
      <p>
        We use the three ocial measures from STC: nG@1 (normalised
gain at rank 1)4, P+ (see below), and nERR (normalised expected
reciprocal rank) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. While the ocial STC Chinese subtask used the
NTCIREVAL5 toolkit by giving a gain value of 1 to each L1-relevant
tweet and 3 to each L2-relevant one, we utilised an alternative
functionality of the same tool, which enables us to feed gain values of
relevant items directly to it without considering the number of
relevance levels. is feature was already available in NTCIREVAL for
the purpose of accommodating the global gain proposed by Sakai
and Song [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which is an idea for obtaining a real-valued gain
for each relevant web page for search result diversication
evaluation. In their work, global gain was computed from intent-aware
probabilities and per-intent graded relevance assessments.
      </p>
      <p>
        P+, an ocial measure from STC but nevertheless less
wellknown than nDCG and nERR, deserves a brief explanation here.
Just like nERR, it is a measure suitable for navigational intents. Just
as Average Precision (AP) employs (binary) precision to measure
the utility of the top r documents for a user group who abandon
the ranked list at r , P+ employs the blended ratio, which combines
precision and cumulative gain, for the same purpose. Furthermore,
just as AP assumes that the distribution of users abandoning the
ranked list is uniform across all relevant documents (even if some
of them are not retrieved) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], P+ assumes that the distribution
is uniform over all relevant documents ranked at or above rp , the
preferred rank [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Given a ranked list, the preferred rank is the
rank of the most relevant document that is closest to the top. In our
case, the preferred rank is the rank of the document in the res le
that has the highest UGi value and is closest to the top. Both AP
and P+ represent the expected utility over their user abandonment
distributions.
3 e Japanese subtask actually collected N = 10 individual ratings for each tweet,
but had only 25 runs from seven teams. We plan to use this data set as well in a follow
up study.
4 Note that neither Discounting nor Cumulation of “nDCG” does not apply at rank 1.
5 hp://research.nii.ac.jp/ntcir/tools/ntcireval-en.html
      </p>
    </sec>
    <sec id="sec-6">
      <title>5 RESULTS AND DISCUSSIONS</title>
      <p>Table 2 compares the system rankings based on RawGi vs. UGi
in terms of Kendall’s τ with 95% condence intervals, for nG@1,
P+ and nERR averaged over the 100 ocial Chinese STC topics. It
can be observed that all of the upper condence limits are above
one, meaning that the systems rankings based on RawGi and UGi
are statistically equivalent. However, except where the 95% CIs are
“»1; 1¼,” the two rankings are not identical, even with p = 0:2. Recall
also that we should expect to see lower rank correlations if we use
real assessors’ ratings with correlations among them.</p>
      <p>
        Probably a more practical concern than the change in the
overall system ranking is: does the proposed method aect statistical
signicance test results? If a researcher is interested in
comparing every system pair, then conducting a pairwise test such as the
paired t -test repeatedly (without correcting α ) is not the correct
approach: one elegant solution would be to use the randomised
Tukey HSD (Honestly Signicantly Dierence) test [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is
free from distributional assumptions and ensures that the
familywise error rate (i.e., the probability of incorrectly obtaining a
statistically signicant dierence for at least one system pair) is α .
We use the Discpower6 toolkit to conduct the randomised Tukey
HSD test from each topic-by-run score matrix, with B = 5; 000 trials
for each test. e STC Chinese subtask had 16 participating teams,
and one run from each team (specically, best run in terms of the
ocial Mean nG@1 score) is considered in this analysis, giving us
16 152 = 120 comparisons. Do RawGi and UGi give us similar
p-values and similar research conclusions?
      </p>
      <p>
        Table 3 summarises the discrepancy between the signicance
test results with RawGi and those with UGi : these are the
comparisons where the dierence is statistically signicant at α = 0:05
according to one while not signicant according to the other.
pvalues, absolute score dierences (jdX Y j), and eect sizes (ESHSD)
are also shown; ESHSD is computed by dividing jdX Y j by the
residual standard deviation of each experimental condition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]7. It can
be observed that the eect of introducing unanimity-aware gain
6 hp://research.nii.ac.jp/ntcir/tools/discpower-en.html
7 is form of eect size measures the dierence between two systems in standard
deviation units; unlike the p-value, is not a function of the sample size.
      </p>
      <p>Dmax</p>
      <p>2
(b) P+
(c) nERR
cannot be overlooked, even with p = 0:2. For example, when
Dmax = 2; N = 5, there are three discrepancies between nG@1
based on RawGi and that based on UGi among the 120 comparisons.
Whereas, as was anticipated in Section 4, it can be observed that
the impact of introducing unanimity is not observed for N = 40; 80.
Again, with real ratings that tend to resemble one another and
make Di smaller that these random ratings do, we will probably
observe a more substantial impact of introducing unanimity-aware
gain into evaluation.
6</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>We proposed a simple and intuitive approach to incorporating the
assessors’ subjective yet unanimous decisions into gain-value-based
retrieval evaluation, and demonstrated that this will aect
experimental outcomes. Our results show that incorporating unanimity
can aect statistical signicance test results even when its impact
on the gain value is kept to a minimum. Moreover, since our
simulated ratings do not consider the correlation present in the assessors’
actual ratings, our experiments probably underestimate the eect
of introducing unanimity-aware gain into evaluation. Hence, if
researchers accept that unanimous votes should be valued more
highly than controversial ones, then our proposed approach may
be worth incorporating. We also demonstrated how the proposed
approach of directly feeding gain values to an existing evaluation
tool can be accomplished, while bypassing the notion of discrete
relevance levels.</p>
      <p>
        Following the present study, the proposed unanimity-aware gain
approach was applied to the recent NTCIR-13 Short Text
Conversation (STC-2) Chinese subtask [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], with p = 0:2. ere, according
to the randomised Tukey HSD test, three extra statistically
significantly dierent system pairs were obtained by using
unanimityaware nG@1 instead of the traditional nG@1; one extra statistically
signicantly dierent system pair was obtained by using
unanimityaware P+ instead of the traditional P+. us the sets of statistically
signicantly dierent system pairs according to the
unanimityaware approach were supersets of the corresponding sets based
on the traditional gain values. However, there were only N = 3
      </p>
      <p>In future work, we would like to apply our approach to diverse</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ben</given-names>
            <surname>Cartere</surname>
          </string-name>
          e.
          <year>2012</year>
          .
          <article-title>Multiple testing in statistical analysis of systems-based information retrieval experiments</article-title>
          .
          <source>ACM TOIS 30</source>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jiyi</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Masatoshi</given-names>
            <surname>Yoshikawa</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Evaluation with Confusable Ground Truth</article-title>
          .
          <source>In Proceedings of AIRS 2016 (LNCS 9994)</source>
          .
          <fpage>363</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Eddy</given-names>
            <surname>Maddalena</surname>
          </string-name>
          , Kevin Roitero, Gianluca Demartini, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Considering Assessor Agreement in IR Evaluation</article-title>
          .
          <source>In Proceedings of ACM ICTIR</source>
          <year>2017</year>
          .
          <volume>75</volume>
          -
          <fpage>82</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Olga</given-names>
            <surname>Megorskaya</surname>
          </string-name>
          , Vladimir Kukushkin, and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>On the Relation between Asessor's Agreement and Accuracy in Gamied Relevance Assessment</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2015</year>
          .
          <volume>605</volume>
          -
          <fpage>614</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>A New Interpretation of Average Precision</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2008</year>
          .
          <volume>689</volume>
          -
          <fpage>690</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2014</year>
          . Statistical Reform in Information Retrieval?
          <source>SIGIR Forum 48</source>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2017</year>
          . e E
          <article-title>ect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students</article-title>
          .
          <source>In Proceedings of EVIA</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          , Daisuke Ishikawa, Noriko Kando, Yohei Seki, Kazuko Kuriyama, and
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Using Graded-Relevance Metrics for Evaluating Community QA Answer Selection</article-title>
          .
          <source>In Proceedings of ACM WSDM</source>
          <year>2011</year>
          .
          <volume>187</volume>
          -
          <fpage>196</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ruihua</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Evaluating Diversied Search Results Using Per-Intent Graded Relevance</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2011</year>
          .
          <volume>1043</volume>
          -
          <fpage>1052</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Lifeng</surname>
            <given-names>Shang</given-names>
          </string-name>
          , Tetsuya Sakai,
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ryuichiro</given-names>
            <surname>Higashinaka</surname>
          </string-name>
          , Yusuke Miyao, Yuki Arase, and
          <string-name>
            <given-names>Masako</given-names>
            <surname>Nomoto</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the NTCIR-13 Short Text Conversation Task</article-title>
          .
          <source>In Proceedings of NTCIR-13.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Lifeng</surname>
            <given-names>Shang</given-names>
          </string-name>
          , Tetsuya Sakai, Zhengdong Lu,
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ryuichiro</given-names>
            <surname>Higashinaka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yusuke</given-names>
            <surname>Miyao</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the NTCIR-12 Short Text Conversation Task</article-title>
          .
          <source>In Proceedings of NTCIR-12</source>
          .
          <fpage>473</fpage>
          -
          <lpage>484</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Turpin</surname>
          </string-name>
          , Falk Scholer, Stefano Mizzaro, and
          <string-name>
            <given-names>Eddy</given-names>
            <surname>Maddalena</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>e Benets of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2015</year>
          .
          <volume>565</volume>
          -
          <fpage>574</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Variations in Relevance Judgments and the Measurement of Retrieval Eectiveness</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>1998</year>
          .
          <volume>315</volume>
          -
          <fpage>323</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Yulu</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Garrick Sherman,
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Miles</given-names>
            <surname>Efron</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Assessor Dierences and User Preferences in Tweet Timeline Generation</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2015</year>
          .
          <volume>615</volume>
          -
          <fpage>624</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>