<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>s-AWARE: Using Crowd Judgements in Supervised Measure-Based Methods for IR Evaluation?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Ferrante</string-name>
          <email>ferrante@math.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Piazzon</string-name>
          <email>piazzonl@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics \Tullio Levi-Civita", University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Crowdsourcing methodologies have recently emerged as a cheap and fast alternative to the traditional document assessment process for ground truth creation. Early approaches make use of voting and/or classi cation methodologies to combine crowd judgements into a merged pool, used as reference in the evaluation process. A measure-based approach has instead been used in Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) [3], focusing in optimizing the nal evaluation measure without merging judgements at pool level. s-AWARE extends AWARE with a set of supervised methods. We rely on several TREC collections to evaluate s-AWARE and we show that it outperforms state-of-the-art methods. Moreover, our results show that when moving to the real case scenario where a crowd-assessor only judges a portion of the dataset, s-AWARE is quite an e ective approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Document assessment for ground-truth creation is one of the most demanding
tasks in preparing an experimental collection in both terms of time and costs,
and it has traditionally been performed by relying on expert assessors [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Crowdsourcing methodologies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have been recently exploited for a faster and
cheaper collection of multiple, even less quali ed, document assessments. These
judgements are used together in the evaluation process with the objective of
achieving a pro cient evaluation, comparable to the traditional one. The most
common way to use crowd-judgements is to create a merged pool to be used
as the gold standard for evaluation. Since errors in the merged pool can
unfairly a ect evaluation measures, in our work we moved the merging process
at measure level, as rstly proposed in Assessor-driven Weighted Averages for
Retrieval Evaluation (AWARE)[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Performance measures are rstly computed
? Extended abstract of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). This volume is published
and copyrighted by its editors. IRCDL 2021, February 18-19, 2021, Padua, Italy.
based on each crowd-assessor judgements and then merged weighting by an
estimate of each assessor accuracy, computed making use of unsupervised
estimators. s-AWARE extends AWARE and uses supervised estimators based on the
closeness between each assessor and the gold standard in a small set of
training topics. We evaluated our s-AWARE against the state-of-the-art supervised
and unsupervised methods by using several TREC datasets, achieving promising
results.</p>
      <p>This extended abstract will describe s-AWARE methodology and
performance, presenting some related works (Section 2), the description of the
approaches (Section 3), the performed experiments (Section 4) and possible future
extensions (Section 5).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        One of the rst developed crowd-assessor merging approach is Majority Vote
(MV) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], that assigns to each document the judgement proposed by the
majority of crowd-assessors; Weighted versions of MV have been proposed do boost
pro cient assessors, e.g. [
        <xref ref-type="bibr" rid="ref11 ref12">12,11</xref>
        ].
      </p>
      <p>
        Expectation Maximization (EM) algorithms optimize the probability of
relevance of each document in an unsupervised [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or semi-supervised way [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and
then assign to each document the most probable judgement . Another EM
alternative [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses a variant of the same algorithm to optimize assessor reliability
to be used to weight crowd judgements.
      </p>
      <p>One weakness of the above described pool merging strategies is the possibility to
propagate mislabelling errors to evaluation measures. Di erent measures could
even be di erently a ected by the same pool error. Assessor-driven Weighted
Averages for Retrieval Evaluation (AWARE) tries to overcome this problem
by performing evaluation on the judgements given by every crowd-assessor and
combining the obtained measures weighting each assessor with his accuracy,
estimated in an unsupervised way favouring assessors behaving di erently from
some fake random assessors:</p>
      <p>m
aware (rt) = X
k=1
r^tk</p>
      <p>ak(t)
Pm
h=1 ah(t)
where m is the number of crowd-assessors to merge, r^tk is the value of
the performance measure computed on run r for topic t according to the
kth crowd-assessor, and ak is the accuracy of the k-th crowd-assessor. AWARE
computes accuracies as a function of the distance from random assessors: the
more a crowd-assessor is far from a set of random assessors, the better it is.
3
s-AWARE</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>To describe s-AWARE accuracy estimation, we consider the matrix Mk
containing the measures computed for a set S of systems and a set T of topics
based on the judgements issued by the k-th crowd assessor, and we de ne M
as the gold standard measures matrix. The idea behind s-AWARE is to assign
an higher accuracy to assessors that behaved similarly to the gold standard on
a set of training topics. We consider the two best performing approaches used
in AWARE to quantify the \closeness" Ck to the gold standard:
{ Measure closeness: we consider the Root Mean Square Error (RMSE) between
the crowd-measure and the gold standard one</p>
      <p>Ck = RM SE M k( ; S)</p>
      <p>M ( ; S) = uutuu XjSj
v
s=1</p>
      <p>M k( ; s)</p>
      <p>M ( ; s)
j S j
2
where M ( ; s) indicates the average measure by topic
{ Ranking of Systems closeness: we use the Kendall's correlation between the
ranking of systems based on the crowd-measures and the gold standard one
Ck =</p>
      <p>M k( ; S); M ( ; S) =</p>
      <p>A D
j S j (j S j 1)=2
where A is the number of system pairs ranked in the same order in M k( ; S)
and M ( ; S), and D is the number of discordant pairs.</p>
      <p>
        Cks are then normalized in the [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] range, obtaining normalized Ck equal to
1 with gold standard behaviour. Squared and cubed Ck are also considered to
sharpen the distinction between good and bad assessors.
4
4.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Setup</title>
        <p>
          We compared s-AWARE approaches against MV, EM with MV seeding[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
AWARE with uniform accuracy scores (uniform), unsupervised AWARE
(uAWARE) unsup rmse tpc and unsup tau tpc approaches (using respectively
RMSE and Kendall's GAP computation), Georgescu Zhu EM method (hard
labels, PN discrimination, no boost version) (emGZ) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and semi-supervised EM
(using 30% of the documents as training set)(emsemi) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>In our evaluation, for each approach, we evaluated the systems with Average
Precision (AP), and we evaluated each approach performance by computing the
AP Correlation (APC) [15] between the ranking induced by AP values and the
gold standard ranking.</p>
        <p>
          We used two di erent collections, using the NIST judgments as gold standard:
{ TREC 2012 Crowdsourcing track [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]: 31 complete pools of judgements on 10
topics common to TREC 08 Adhoc track (T08) [14] and TREC 13 Robust
track(T13) [13]. We used the 129 runs from T08 and the 110 runs from T13.
{ TREC 2017 Common Core track (T26) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: real crowdsourced judgements
gathered by Inel Et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] from 406 crowd assessors on short documents
( 1000 words) within the NYTimes corpus. Judgements refer to 50 topics,
having exactly 7 judgements for each (topic, document) pair. We used the
75 runs from T26.
        </p>
        <p>We tested s-AWARE using only the 30% of the topics as training set. We
considered k-tuples from 2 to 7 crowd-assessors. We validated the results by
repeating both topic and assessor sampling 100 times for each k-tuples size.</p>
        <p>We performed experiments under two possible scenarios, considering Whole
Assessors and Partitioned Assessors. In the Whole Assessors case (most
favorable to supervised approaches but quite unrealistic) each crowd-assessor
completely judges all the topics. Whole Assessors data is available only for the T08
and T13 tracks. In the Partitioned Assessors case (real case scenario, more
challenging for supervised approaches), each crowd-assessor judges just a portion of
the documents for a portion of the topics. Therefore, to get the complete pools
assigned to each Partitioned Assessor we group judgements coming from di
erent crowd-assessors. This is the case of T26 track, but we also simulated this
con guration on the T08 and T13 tracks, by assembling the judgments coming
from more participants into each topic.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Main Results</title>
        <p>Table 1 reports the AP Correlation results in the tested con gurations on the
test portion of the dataset (70% of the documents from 70% of the topics, the
common subset of documents unseen by both s-AWARE and emsemi). The best
performing approach in the Whole Assessors case is our sup tau cubed,
constantly achieving better performance with respect to all the other approaches.
More in general, as expected, s-AWARE approaches generally outperform the
baselines and the corresponding unsupervised u-AWARE approach, that
anyway signi cantly outperform the baselines.</p>
        <p>We notice a very poor performance of emGZ and only a little improvement of
emsemi with respect to emmv. This is probably due to the very limited amount
of training data, more e ectively exploited by s-AWARE.</p>
        <p>In Partitioned Assessors case we face up a di erent situation, where s-AWARE
advantage is limited with respect to u-AWARE approaches. On T08 and T13,
unsup rmse tpc u-AWARE method performs generally better than s-AWARE,
but s-AWARE still outperform the other u-AWARE approaches and the
baselines. This narrower gap supports the idea that the Partitioned Assessors case
is less favorable to supervised approaches, since the training phase re ects less
what happens in the test phase; In general, we can observe that s-AWARE still
performs remarkably better than emsemi.</p>
        <p>Looking to T26, s-AWARE approaches always outperform all the other
approaches, with sup tau cubed achieving the best performance for all k-tuples.
This is very promising since, while Partitioned assessors for T08 and T13 are
simulated, T26 is the only dataset obtained by real crowd assessors, showing
a good performance in a real case scenario. In fact, we hypothesize that bad
performance on T08 and T13 can be due to the little more fragmentation of the
simulated partitioned assessors, i.e. smaller pieces from more crowd-assessors,
with respect to the the T26 ones.</p>
        <p>In all our results, Kendall's performs better than RMSE as s-AWARE
\closeness" accuracy computation, and cubed and squared s-AWARE approaches achieve,
in general, better performance than the basic closeness approach, since they
emphasize more sharply the di erence between good and bad assessors. Moreover,
results highlight that s-AWARE approaches can obtain good results even with
small k-tuple size.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusions and Future Work</title>
      <p>
        We presented s-AWARE [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] a methodology for merging crowd-assessors, that
extends AWARE approach to supervised techniques. We tested s-AWARE against
a set of unsupervised and supervised baselines, highlighting the e ectiveness of
s-AWARE in the very challenging real scenario situation where only 30% of the
documents were used for training. s-AWARE outperform all the others in the
Whole Assessors case and is still quite robust in the Partitioned Assessors case.
In the future, we plan to extend AWARE framework to better deal with partial
assessments, assigning an accuracy score to each real crowd assessor, avoiding
the need to group judgements as done in Partitioned assessor case.
13. Voorhees, E.M.: Overview of the TREC 2004 Robust Track. In: Voorhees, E.M.,
Buckland, L.P. (eds.) The Thirteenth Text REtrieval Conference Proceedings
(TREC 2004). National Institute of Standards and Technology (NIST), Special
Publication 500-261, Washington, USA (2004)
14. Voorhees, E.M., Harman, D.K.: Overview of the Eigth Text REtrieval Conference
(TREC-8). In: Voorhees, E.M., Harman, D.K. (eds.) The Eighth Text REtrieval
Conference (TREC-8). pp. 1{24. National Institute of Standards and Technology
(NIST), Special Publication 500-246, Washington, USA (1999)
15. Yilmaz, E., Aslam, J.A., Robertson, S.E.: A New Rank Correlation Coe cient for
Information Retrieval. In: Chua, T.S., Leong, M.K., Oard, D.W., Sebastiani, F.
(eds.) Proc. 31st Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR 2008). pp. 587{594. ACM Press,
New York, USA (2008)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gysel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>TREC 2017 Common Core Track Overview</article-title>
          . In: Voorhees,
          <string-name>
            <given-names>E.M.</given-names>
            ,
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>The Twenty-Sixth Text REtrieval Conference Proceedings (TREC</source>
          <year>2017</year>
          ).
          <article-title>National Institute of Standards and Technology (NIST</article-title>
          ),
          <source>Special Publication 500-324</source>
          , Washington, USA (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The Practice of Crowdsourcing</article-title>
          . Morgan &amp; Claypool Publishers, USA (May
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ferrante</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          <volume>36</volume>
          (
          <issue>2</issue>
          ),
          <volume>1</volume>
          {38 (aug
          <year>2017</year>
          ). https://doi.org/10.1145/3110217, https://doi.org/10.1145% 2F3110217
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ferrante</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piazzon</surname>
          </string-name>
          , L.:
          <article-title>s-aware: Supervised measure-based methods for crowd-assessors combination</article-title>
          . In: Arampatzis,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Kanoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Tsikrika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Joho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Eickho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Neveol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and Interaction. pp.
          <volume>16</volume>
          {
          <fpage>27</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Georgescu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Aggregation of crowdsourced labels based on worker history</article-title>
          .
          <source>In: Proceedings of the 4th International Conference on Web Intelligence</source>
          ,
          <article-title>Mining and Semantics (WIMS14)</article-title>
          .
          <source>WIMS '14</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2014</year>
          ). https://doi.org/10.1145/2611040.2611074, https://doi.org/10.1145/2611040.2611074
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hosseini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>I.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milic-Frayling</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinay</surname>
          </string-name>
          , V.:
          <article-title>On aggregating labels from multiple crowd workers to infer relevance of documents</article-title>
          .
          <source>In: Proceedings of the 34th European Conference on Advances in Information Retrieval</source>
          . pp.
          <volume>182</volume>
          {
          <fpage>194</fpage>
          . ECIR'
          <volume>12</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Inel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haralabopoulos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gysel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlavik</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simperl</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Studying Topical Relevance with Evidence-based Crowdsourcing</article-title>
          . In: Cuzzocrea,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Allan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Paton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.W.</given-names>
            ,
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Zaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            ,
            <surname>Candan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Labrinidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          , H. (eds.)
          <source>Proc. 27th International Conference on Information and Knowledge Management (CIKM</source>
          <year>2018</year>
          ). pp.
          <volume>1253</volume>
          {
          <fpage>1262</fpage>
          . ACM Press, New York, USA (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval (FnTIR) 4(4</article-title>
          ),
          <volume>247</volume>
          {
          <fpage>375</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Smucker</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lease</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the TREC 2012 Crowdsourcing Track</article-title>
          . In: Voorhees,
          <string-name>
            <given-names>E.M.</given-names>
            ,
            <surname>Buckland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.P. (eds.) The</given-names>
            <surname>Twenty-First Text REtrieval Conference Proceedings</surname>
          </string-name>
          (TREC
          <year>2012</year>
          ).
          <article-title>National Institute of Standards and Technology (NIST</article-title>
          ),
          <source>Special Publication 500-298</source>
          , Washington, USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lease</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Semi-supervised consensus labeling for crowdsourcing</article-title>
          .
          <source>In: Proceedings of the SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR)</source>
          . pp.
          <volume>36</volume>
          {
          <fpage>41</fpage>
          .
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Tao</surname>
          </string-name>
          , D., Cheng, J.,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yue</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Domain-weighted majority voting for crowdsourcing</article-title>
          .
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <volume>163</volume>
          {174 (jan
          <year>2019</year>
          ). https://doi.org/10.1109/tnnls.
          <year>2018</year>
          .
          <volume>2836969</volume>
          , https:// doi.org/10.1109%
          <fpage>2Ftnnls</fpage>
          .
          <year>2018</year>
          .2836969
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiaoben</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Max-margin majority voting for learning from crowds</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>41</volume>
          (
          <issue>10</issue>
          ),
          <volume>2480</volume>
          {2494 (oct
          <year>2019</year>
          ). https://doi.org/10.1109/tpami.
          <year>2018</year>
          .
          <volume>2860987</volume>
          , https://doi. org/10.1109%
          <fpage>2Ftpami</fpage>
          .
          <year>2018</year>
          .2860987
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>