<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How to Robustly Combine Judgements from Crowd Assessors with AWARE ?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Ferrante</string-name>
          <email>ferrante@math.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Maistro</string-name>
          <email>maistro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics, University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose the Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) probabilistic framework, a novel methodology for dealing with multiple crowd assessors, who may be contradictory and/or noisy. By modeling relevance judgements and crowd assessors as sources of uncertainty, AWARE directly combines the performance measures computed on the ground-truth generated by the crowd assessors instead of adopting some classi cation technique to merge the labels produced by them. We propose several unsupervised estimators that instantiate the AWARE framework and we compare them with Majority Vote (MV) and Expectation Maximization (EM) showing that AWARE approaches improve both in correctly ranking systems and predicting their actual performance scores.</p>
      </abstract>
      <kwd-group>
        <kwd>crowdsourcing</kwd>
        <kwd>unsupervised estimators</kwd>
        <kwd>AWARE</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Ground-truth is central to the data processing area, as in top-k ranking in
databases, information retrieval, natural language processing, video and image
processing, information extraction and many others. Although ground-truth
enables the scoring and comparison of algorithms with respect to human
judgments, creating a dataset and, in particular, gathering relevance assessments is
an extremely demanding activity, therefore there is an increasing interest for
more e ective and a ordable ways of gathering assessments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Crowdsourcing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] has emerged as a viable option for ground-truth creation
since it allows to cheaply collect multiple assessments for each task. However, it
raises many questions regarding the quality of the collected assessments.
Therefore, in order to obtain a ground-truth good enough to be used for evaluation
purposes, the possibility of discarding the low quality assessors and/or combining
them with more or less sophisticated algorithms has been considered.
      </p>
      <p>The problem of merging multiple crowd assessors has been addressed mostly
from a classi cation point of view, with traditional approaches which focus
mainly on how to select assessors and/or discard low quality assessors and how
to merge judgments from multiple assessors. We can consider this as a kind of
\upstream" approach, because the aggregated ground-truth is created before
systems are evaluated and performance scores are computed.</p>
      <p>In this paper, we address the problem of ground-truth creation from a new
angle, i.e. we investigate how to estimate performance measures in a way more
robust to crowd assessors. In particular, we seek a better estimation of the true
expected value of a performance measure, by leveraging its multiple observations,
generated separately by the relevance judgements of each crowd assessor. We
can consider this as as a kind of \downstream" approach, since the aggregation
happens after performance measures have been computed.</p>
      <p>The main intuition behind our approach is based on the idea that the choice of
the \best" relevance judgments, operated ahead at the pool level, may have a
diverse impact on di erent systems and on various performance measures. Indeed,
systems rank the same documents di erently and therefore the same correctly
labelled or mis-labelled documents impact the performances of di erent systems
in di erent ways. Therefore, we propose the Assessor-driven Weighted Averages
for Retrieval Evaluation (AWARE) probabilistic framework, which allows us to
combine multiple versions of a performance measure, computed from the
groundtruth created by each crowd assessor, into a single composite measure, referred
as the AWARE version of it. The AWARE framework speci es how performance
measures have to be merged on the basis of the estimated crowd assessor
accuracies and we propose several unsupervised estimators of such accuracies. The
experimentation shows that AWARE approaches improve in terms of capability
of correctly ranking systems and predicting their actual performance scores.</p>
      <p>The paper is organized as follows: Section 2 introduces the AWARE
framework; Section 3 gives an intuitive overview of several unsupervised estimators
for determining the assessors accuracies; Section 4 carry out the experimental
evaluation using TREC collections; nally, Section 5 draws some conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The AWARE Framework</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we introduced the following de nitions: let D and T be a set of documents
and a set of topics, respectively; let (REL, ) be a totally ordered set of relevance
degrees. For each pair (t; d) 2 T D, the ground-truth GT is a map which assigns
a relevance degree rel 2 REL to a document d with respect to a topic t.
      </p>
      <p>In order to cope with and leverage crowd assessors, we assume that the
relevance of a document is not deterministically known, but it is described by a
probability distribution: instead of specifying a single value from REL as results
of the relevance assessment, we model the uncertainty entailed in the assessment
process as a whole distribution of possible values associated to each (t; d) pair.
Furthermore, we assume that the ability of the crowd assessors is stochastically
determined by a probability assigned to them, that we call their accuracy.</p>
      <p>More precisely, we assume that there exists a probability space ( ; F ; P),
which provides the source of randomness and encompasses the judgements done
by all the possible crowd assessors, on all the possible documents for any possible
topic. Considering this space, we can extend the de nition of the ground-truth
as GT : T D ! REL. In this way, to any pair (t; d) we associate a
random variable GT ( ; t; d) with value on REL, whose distribution describes the
relevance of the document d with respect to the topic t.</p>
      <p>Let = fW1; : : : ; Wlg be a nite set of crowd assessors and let us assume
that there exists a random variable, W : T ! , whose distribution identi es
the ability of a single crowd assessor with respect to any given topic. We call
ak(t) = P[T = t; W = Wk] the accuracy of crowd assessor Wk in assessing
topic t and we assume that ak(t) is determined by the expected ability she/he
demonstrates in assessing all the possible documents for that topic.</p>
      <p>The easiest way to jointly cope with these random objects, i.e. ground-truth
and crowd assessors, is to consider their expectations. The expected relevance of
document d for topic t, by the law of total expectation, is given by
l
E GT (t; d) = EhE GT (t; d) W i = X E[GT (t; d)jW = Wk] ak(t) :
k=1
Then for a performance measure m( ), we can proceed in a similar way and de ne
its AWARE version as its expectation with respect to P:
h</p>
      <p>l
r^t i = X E</p>
      <p>k=1
aware-m t; rt = E
r^t</p>
      <p>
        W = Wk ak(t) ;
where is the scoring function associated to the performance measure m( ) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
and r^t is the judged run.
      </p>
      <p>We estimate the rst term by r^tk , where r^tk represents the judged run
under the assessments done by the crowd assessor Wk. However, the estimation
of the accuracies ak(t) = P[T = t; W = Wk] is somehow more problematic.
We therefore take a random assessor as a comparison point. In the case of
binary relevance, i.e. when REL = f0; 1g, an assessor Wk is a random assessor
of parameter p 2 [0; 1], if for any pair (t; d) the conditional random variables
GT (t; d)jW = Wk Bin(1; p), where Bin(1; p) denotes a Binomial random
variable with parameter p, and are mutually independent.</p>
      <p>A random assessor, of any possible parameter p, is the prototype of a \bad"
or at least a \shallow" assessor, since p is the same for any possible pair (t; d).
The basic idea that we will apply in the next section is that the farther a crowd
assessor is from the random ones, the better she/he is and the higher her/his
accuracy will be.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Estimating Crowd Assessor Accuracy</title>
      <p>This sections aims at providing an intuitive overview of the proposed
unsupervised estimators of the accuracy of a crowd assessor, more details can be found</p>
      <p>Measure</p>
      <p>Gap Gk
Mhp
⇢ ph Assessors</p>
      <p>Random
Mk</p>
      <p>Crowd
Wk Assessor</p>
      <p>Minimal
Dissimilarity</p>
      <p>Weight wk</p>
      <p>Minimal</p>
      <p>Squared</p>
      <p>Dissimilarity
Measure Level
- Frobenius Norm
- RMSE
fro_md
rmse_md
fro_msd
rmse_msd
Distribution Level
- KL Divergence
Rankings Level
- Kendall’s Tau
- AP Correlation
kld_md
kld_msd</p>
      <p>kld_med
tau_md
apc_md
tau_msd
apc_msd
tau_med
apc_med
Minimal</p>
      <p>
        Equi
Dissimilarity
fro_med
rmse_med
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Figure 1 shows the main steps (granularity, gap and weight) we use to
estimate the accuracy of a crowd assessor and the di erent estimators we can
obtain by combining the various alternatives at each step. The idea is to
compare the crowd assessor against a set of random assessors and how \di erent"
this crowd assessor is from the random ones, i.e. how much better she/he is.
      </p>
      <p>For each pool we generate, ph; h = 1; 2; : : : ; H , a set of H random assessors of
level p, i.e. which randomly evaluate as relevant the p per cent of the documents
in the pool. We consider three di erent classes of random assessors: uniform
random assessor with p = 0:5, underestimating random assessor with p = 0:05, and
overestimating random assessor with p = 0:95. Each of these random assessors
gives origin to an assessor measure Mhp for a given performance measure m( ).</p>
      <p>Therefore, the intuitive idea described above boils down to determining some
sort of \di erence" between the measure Mk of a crowd assessor Wk and those
Mhp of the three random assessors ph and turning this \di erence" into an
estimated accuracy atk assigned to the crowd assessor Wk to compute the AWARE
version of the performance measure m( ). This is achieved in two main steps:
{ gap Gk: this quanti es what \di erent" means. We consider three
alternatives:
measure level : this operates directly on the assessor measures by
computing either the Frobenius norm of their di erence (labelled fro) or
their Root Mean Square Error (RMSE) (labelled rmse);
distribution level : this works on the performance distributions estimated
from the assessor measures by using Kernel Density Estimation (KDE)
and computes the Kullback-Leibler Divergence (KLD) between them
(labelled kld);
rankings level : this considers the system rankings induced by the
assessor measures and compares them by using either the Kendall's tau
correlation (labelled tau) or the AP correlation (labelled apc);
{ weight wtk: this turns the gap computed in the previous step into an estimated
accuracy to be assigned to a crowd assessor. In particular, we reason in terms
of dissimilarity from random assessors since, for a crowd assessor Wk, being
close to a random one ph can be considered as an indicator of her/his poor
quality. We have three alternatives:
minimal dissimilarity (labelled md): this computes a weight which is
proportional to the minimum gap from one of the random assessors class,
i.e. the closer to one of the random assessors, the smaller the weight;
minimal squared dissimilarity (labelled msd): this is similar to the
previous case but uses the minimum squared gap;
minimal equi-dissimilarity (labelled med): this computes a weight which
is proportional to the crowd assessor being equally distant from all three
families of random assessors.</p>
      <p>For each of the three random assessor classes, we generate a set of H replicates
to cope with the uncertainty of the random generation process and to obtain
better estimates. Therefore, for each crowd assessor Wk, we obtain a set of H
estimates and we need to aggregate them into a single one; we compute a mean
gap Gk, averaging over the set of H gaps computed with respect to each random
assessor ph.</p>
      <p>Finally, the described procedure produces an estimated accuracy atk to be
assigned to a crowd assessor Wk for each topic t 2 T ; this is what we call
topicby-topic score granularity, labelled tpc. However, we are also interested in the
case when a single accuracy score is assigned to a crowd assessor Wk, i.e. when
the atk are the same for all the topics; this is what we call single score granularity,
labelled sgl.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <sec id="sec-4-1">
        <title>Experimental Setup</title>
        <p>
          We use the TREC 21, 2012, Crowdsourcing [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] data sets developed in the Text
Relevance Assessing Task (TRAT). The TRAT required participating groups
to simulate the relevance assessing role of the NIST for 10 of the TREC 08,
1999, Ad-hoc topics [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Participating groups had to submit a binary relevance
judgements for every document in the judging pools of the ten topics. Two TREC
Adhoc tracks used these 10 topics over the years: the TREC 08, 1999, Ad-hoc
track [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] (labeled T08), and the TREC 13, 2004, Robust track [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] (labeled T13).
        </p>
        <p>
          When it comes to the measures for evaluating the e ectiveness of the di erent
approaches, we adopt two criteria used in the TREC 22, 2013, Crowdsourcing
track [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]: referred as rank correlation and score accuracy. We use Average
Precision (AP) correlation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to compare the ranking of the systems produced
for a given performance measure m( ), computed over the gold standard, with
respect to the ranking produced for the same performance measure computed
over the ground-truth, generated by one of the approaches under examination.
In addition to correctly ranking systems, it is important that the performance
scores are as accurate as possible. To this end, for a given performance measure
m( ), we use the RMSE between the performance measure computed over the
gold standard and the one computed over the ground-truth created by one of
the approaches under examination.
        </p>
        <p>When it comes to the assessor measures Mk and Mhp, we consider Average
Precision (AP), Normalized Discounted Cumulated Gain (nDCG), and Expected
Reciprocal Rank (ERR).</p>
        <p>We consider three baselines, representing the state-of-the-art: the MV
algorithm, labeled mv, and two variants of the EM algorithm: emmv, i.e. EM seeded
by the pool generated by the MV algorithm, and emneu, i.e. EM initialized using
the worker confusion matrix. Finally, we experiment also a fourth baseline
labeled uni, representing AWARE in absence of any information, i.e. using uniform
accuracies for all the merged crowd assessors.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Methodology</title>
        <p>
          The goal of this section is to to investigate how the AWARE approaches and
the state-of-the-art baselines behave with respect to di erent factors, and to
compare the AWARE approaches against those baselines. To this end, we adopt
a General Linear Mixed Model (GLMM) model for the three-way ANalysis Of
VAriance (ANOVA) with repeated measures [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We are interested in determining
whether a factor e ect is signi cant, i.e. its p-value is less than 0:05, as well as
in which proportion of the variance is due to it.
        </p>
        <p>
          AP Correlation The ANOVA table { not reported due to space limit [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] {
shows that Measure is a large size e ect and it explains the largest share of
variance; Systems is a large size e ect as well and it is the second largest main
e ect; nally, also Approach is a large size e ect but about 2 times smaller than
Measure e ect and 1.25 times smaller than Systems e ect. Overall, this supports
the intuition that led to the development of the AWARE framework: performance
Measures and Systems e ects do matter a lot when merging assessors and they
should be taken into the play.
        </p>
        <p>The Tukey HSD multiple comparison analysis reported in Figure 2a
highlights the top group (dashed blue line), the group of approaches not signi cantly
di erent from the uni baseline (dashed bright red line), the group of approaches
not signi cantly di erent from mv (dashed dark red line), and the group of
approaches not signi cantly di erent from emmv and emneu (dashed orange line).
We can note how the top group is separated from the others while the uni and
mv groups partially overlaps. In particular, we can see that the approaches
significantly better than all the others are sgl tau msd (the top one), sgl apc msd,</p>
        <sec id="sec-4-2-1">
          <title>Robustly Combine Judgements from</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Crowd Assessors with AWARE 7</title>
          <p>AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors</p>
          <p>ZZ:25</p>
          <p>ACM Transactions on Information Systems, Vol. XX, No. YY, Article ZZ, Publication date: Octobeur20l16t.iple comparison test for the Approach factor.</p>
          <p>Fig. 2: Tukey HSD m
ACM Transactions on Informatio(n bSys)tems, RVol. MXX, NSo. YEY, Article ZZ, Publication date: October 2016.
tpc apc msd, and sgl tau md, suggesting that the single score granularity is
preferable to the topic-by-topic one and that the tau and apc gaps help to rank
systems better. State-of-the-art approaches, namely mv (the best one in this
group), emmv, and emneu are clearly separated from the top group. Finally, the
AWARE uni baseline exhibits better performances than mv, even though it is
not signi cantly di erent from it.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>RMSE</title>
        <p>
          The ANOVA table { not reported due to space limit [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] { shows that the
Measure factor is a large size e ect with the greatest impact; Approach is a large
size e ect but, unlike the case of AP correlation, it is almost as important as
Measure;
        </p>
        <p>nally, Systems is a large size e ect but much smaller than the previous
two. Overall, this further supports the intuition behind
AWARE, but it also
suggests that Approaches are much more prominent for the accurate estimation
of the actual value of a performance measure, (assessed by the RMSE) than for
ranking systems correctly (assessed by AP correlation).</p>
        <p>The top group, reported in the Tukey HSD comparison of Figure 2b, consists
of sgl rmse med, tpc rmse med, tpc fro med (the top ones with extremely close
performances), sgl fro med, and sgl kld md; this suggests that there is more
balance between single and topic-by-topic score granularities and that the gaps
operating closer to the assessors measures (fro, rmse, kld) are more e ective.
State-of-the-art approaches are clearly distinct from the top group and, in this
case, AWARE uni is signi cantly better than mv and the rest of them.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we presented the AWARE framework for robustly combining
performance measures coming from multiple crowd assessors. The idea of AWARE
stemmed from the observation of the potential impact of both performance
measures and systems when it comes to correctly labeled/mis-labeled relevance
judgements. Therefore, we proposed a probabilistic framework to take systems
and performance measures into account during the estimation of the crowd
assessors accuracies used to combine them. We then exempli ed how to instantiate
the proposed stochastic framework by introducing many unsupervised estimators
of the accuracy of crowd assessors.</p>
      <p>Finally, we conducted a thorough evaluation on TREC collections,
comparing AWARE against state-of-the-art approaches and studying their in uencing
factors. The experimentation has provided multiple evidence supporting the
intuition behind the AWARE framework. Moreover, it has shown that AWARE
approaches perform better than state-of-the-art ones in terms of both ranking
systems and correctly predicting their performance scores.</p>
      <p>As future work we will investigate multi-feature estimators, i.e. estimators
that take into account multiple performance measures at the same time to
determine the accuracy of a crowd assessor, supervised estimators, i.e. estimators
that leverage a gold standard instead of random assessors for determining the
accuracy of a crowd assessor and extend the experiments to graded-relevance
judgements.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ferrante</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards a Formal Framework for Utilityoriented Measurements of Retrieval E ectiveness</article-title>
          .
          <source>In ICTIR</source>
          , pp.
          <volume>21</volume>
          {
          <issue>30</issue>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ferrante</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors</article-title>
          .
          <source>In TOIS</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ),
          <volume>20</volume>
          :1{
          <fpage>20</fpage>
          :
          <fpage>38</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Halvey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : SIGIR 2014 Workshop on Gathering E cient
          <article-title>Assessments of Relevance (GEAR)</article-title>
          .
          <source>In SIGIR</source>
          , p.
          <fpage>1293</fpage>
          ,
          <issue>ACM</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parameswaran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Crowdsourced Data Management: Industry and Academic Perspectives</article-title>
          . In Foundations and Trends R in Databases,
          <volume>6</volume>
          (
          <issue>1-2</issue>
          ) pp.
          <volume>1</volume>
          {
          <issue>16</issue>
          ,
          <string-name>
            <surname>Now</surname>
            <given-names>Publishers</given-names>
          </string-name>
          , Inc,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Maxwell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delaney</surname>
            , H.D.: Designing Experiments and
            <given-names>Analyzing</given-names>
          </string-name>
          <string-name>
            <surname>Data</surname>
          </string-name>
          .
          <article-title>A Model Comparison Perspective</article-title>
          . Lawrence Erlbaum Associates,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Smucker</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lease</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the TREC 2012 Crowdsourcing Track</article-title>
          . In TREC, NIST, Special Publication 500-
          <issue>298</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Smucker</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lease</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the TREC 2013 Crowdsourcing Track</article-title>
          . In TREC, NIST, Special Publication 500-
          <issue>302</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>Overview of the TREC 2004 Robust Track</article-title>
          . In TREC, NIST, Special Publication 500-
          <issue>261</issue>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.K.</given-names>
          </string-name>
          :
          <article-title>Overview of the Eight Text REtrieval Conference (TREC-8)</article-title>
          . In TREC, pp.
          <volume>1</volume>
          {
          <issue>24</issue>
          , NIST, Special Publication 500-
          <issue>246</issue>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yilmaz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Aslam</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S. E.:</given-names>
          </string-name>
          <article-title>A New Rank Correlation Coe cient for Information Retrieval</article-title>
          . In SIGIR, pp.
          <volume>587</volume>
          {
          <issue>594</issue>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>