-

How to Robustly Combine Judgements from Crowd Assessors with AWARE ?

Marco Ferrante

ferrante@math.unipd.it 1

Nicola Ferro

ferro@dei.unipd.it 0

Maria Maistro

maistro@dei.unipd.it 0 0 Department of Information Engineering, University of Padua , Padua , Italy 1 Department of Mathematics, University of Padua , Padua , Italy

We propose the Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) probabilistic framework, a novel methodology for dealing with multiple crowd assessors, who may be contradictory and/or noisy. By modeling relevance judgements and crowd assessors as sources of uncertainty, AWARE directly combines the performance measures computed on the ground-truth generated by the crowd assessors instead of adopting some classi cation technique to merge the labels produced by them. We propose several unsupervised estimators that instantiate the AWARE framework and we compare them with Majority Vote (MV) and Expectation Maximization (EM) showing that AWARE approaches improve both in correctly ranking systems and predicting their actual performance scores.

crowdsourcing unsupervised estimators AWARE

Ground-truth is central to the data processing area, as in top-k ranking in databases, information retrieval, natural language processing, video and image processing, information extraction and many others. Although ground-truth enables the scoring and comparison of algorithms with respect to human judgments, creating a dataset and, in particular, gathering relevance assessments is an extremely demanding activity, therefore there is an increasing interest for more e ective and a ordable ways of gathering assessments [ 3 ].

Crowdsourcing [ 4 ] has emerged as a viable option for ground-truth creation since it allows to cheaply collect multiple assessments for each task. However, it raises many questions regarding the quality of the collected assessments. Therefore, in order to obtain a ground-truth good enough to be used for evaluation purposes, the possibility of discarding the low quality assessors and/or combining them with more or less sophisticated algorithms has been considered.

The problem of merging multiple crowd assessors has been addressed mostly from a classi cation point of view, with traditional approaches which focus mainly on how to select assessors and/or discard low quality assessors and how to merge judgments from multiple assessors. We can consider this as a kind of \upstream" approach, because the aggregated ground-truth is created before systems are evaluated and performance scores are computed.

In this paper, we address the problem of ground-truth creation from a new angle, i.e. we investigate how to estimate performance measures in a way more robust to crowd assessors. In particular, we seek a better estimation of the true expected value of a performance measure, by leveraging its multiple observations, generated separately by the relevance judgements of each crowd assessor. We can consider this as as a kind of \downstream" approach, since the aggregation happens after performance measures have been computed.

The main intuition behind our approach is based on the idea that the choice of the \best" relevance judgments, operated ahead at the pool level, may have a diverse impact on di erent systems and on various performance measures. Indeed, systems rank the same documents di erently and therefore the same correctly labelled or mis-labelled documents impact the performances of di erent systems in di erent ways. Therefore, we propose the Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) probabilistic framework, which allows us to combine multiple versions of a performance measure, computed from the groundtruth created by each crowd assessor, into a single composite measure, referred as the AWARE version of it. The AWARE framework speci es how performance measures have to be merged on the basis of the estimated crowd assessor accuracies and we propose several unsupervised estimators of such accuracies. The experimentation shows that AWARE approaches improve in terms of capability of correctly ranking systems and predicting their actual performance scores.

The paper is organized as follows: Section 2 introduces the AWARE framework; Section 3 gives an intuitive overview of several unsupervised estimators for determining the assessors accuracies; Section 4 carry out the experimental evaluation using TREC collections; nally, Section 5 draws some conclusions. 2

The AWARE Framework

In [ 1 ] we introduced the following de nitions: let D and T be a set of documents and a set of topics, respectively; let (REL, ) be a totally ordered set of relevance degrees. For each pair (t; d) 2 T D, the ground-truth GT is a map which assigns a relevance degree rel 2 REL to a document d with respect to a topic t.

In order to cope with and leverage crowd assessors, we assume that the relevance of a document is not deterministically known, but it is described by a probability distribution: instead of specifying a single value from REL as results of the relevance assessment, we model the uncertainty entailed in the assessment process as a whole distribution of possible values associated to each (t; d) pair. Furthermore, we assume that the ability of the crowd assessors is stochastically determined by a probability assigned to them, that we call their accuracy.

More precisely, we assume that there exists a probability space ( ; F ; P), which provides the source of randomness and encompasses the judgements done by all the possible crowd assessors, on all the possible documents for any possible topic. Considering this space, we can extend the de nition of the ground-truth as GT : T D ! REL. In this way, to any pair (t; d) we associate a random variable GT ( ; t; d) with value on REL, whose distribution describes the relevance of the document d with respect to the topic t.

Let = fW1; : : : ; Wlg be a nite set of crowd assessors and let us assume that there exists a random variable, W : T ! , whose distribution identi es the ability of a single crowd assessor with respect to any given topic. We call ak(t) = P[T = t; W = Wk] the accuracy of crowd assessor Wk in assessing topic t and we assume that ak(t) is determined by the expected ability she/he demonstrates in assessing all the possible documents for that topic.

The easiest way to jointly cope with these random objects, i.e. ground-truth and crowd assessors, is to consider their expectations. The expected relevance of document d for topic t, by the law of total expectation, is given by l E GT (t; d) = EhE GT (t; d) W i = X E[GT (t; d)jW = Wk] ak(t) : k=1 Then for a performance measure m( ), we can proceed in a similar way and de ne its AWARE version as its expectation with respect to P: h

l r^t i = X E

k=1 aware-m t; rt = E r^t

W = Wk ak(t) ; where is the scoring function associated to the performance measure m( ) [ 1 ], and r^t is the judged run.

We estimate the rst term by r^tk , where r^tk represents the judged run under the assessments done by the crowd assessor Wk. However, the estimation of the accuracies ak(t) = P[T = t; W = Wk] is somehow more problematic. We therefore take a random assessor as a comparison point. In the case of binary relevance, i.e. when REL = f0; 1g, an assessor Wk is a random assessor of parameter p 2 [0; 1], if for any pair (t; d) the conditional random variables GT (t; d)jW = Wk Bin(1; p), where Bin(1; p) denotes a Binomial random variable with parameter p, and are mutually independent.

A random assessor, of any possible parameter p, is the prototype of a \bad" or at least a \shallow" assessor, since p is the same for any possible pair (t; d). The basic idea that we will apply in the next section is that the farther a crowd assessor is from the random ones, the better she/he is and the higher her/his accuracy will be. 3

Estimating Crowd Assessor Accuracy

This sections aims at providing an intuitive overview of the proposed unsupervised estimators of the accuracy of a crowd assessor, more details can be found

Measure

Gap Gk Mhp ⇢ ph Assessors

Random Mk

Crowd Wk Assessor

Minimal Dissimilarity

Weight wk

Minimal

Squared

Dissimilarity Measure Level - Frobenius Norm - RMSE fro_md rmse_md fro_msd rmse_msd Distribution Level - KL Divergence Rankings Level - Kendall’s Tau - AP Correlation kld_md kld_msd

kld_med tau_md apc_md tau_msd apc_msd tau_med apc_med Minimal

Equi Dissimilarity fro_med rmse_med in [ 2 ]. Figure 1 shows the main steps (granularity, gap and weight) we use to estimate the accuracy of a crowd assessor and the di erent estimators we can obtain by combining the various alternatives at each step. The idea is to compare the crowd assessor against a set of random assessors and how \di erent" this crowd assessor is from the random ones, i.e. how much better she/he is.

For each pool we generate, ph; h = 1; 2; : : : ; H , a set of H random assessors of level p, i.e. which randomly evaluate as relevant the p per cent of the documents in the pool. We consider three di erent classes of random assessors: uniform random assessor with p = 0:5, underestimating random assessor with p = 0:05, and overestimating random assessor with p = 0:95. Each of these random assessors gives origin to an assessor measure Mhp for a given performance measure m( ).

Therefore, the intuitive idea described above boils down to determining some sort of \di erence" between the measure Mk of a crowd assessor Wk and those Mhp of the three random assessors ph and turning this \di erence" into an estimated accuracy atk assigned to the crowd assessor Wk to compute the AWARE version of the performance measure m( ). This is achieved in two main steps: { gap Gk: this quanti es what \di erent" means. We consider three alternatives: measure level : this operates directly on the assessor measures by computing either the Frobenius norm of their di erence (labelled fro) or their Root Mean Square Error (RMSE) (labelled rmse); distribution level : this works on the performance distributions estimated from the assessor measures by using Kernel Density Estimation (KDE) and computes the Kullback-Leibler Divergence (KLD) between them (labelled kld); rankings level : this considers the system rankings induced by the assessor measures and compares them by using either the Kendall's tau correlation (labelled tau) or the AP correlation (labelled apc); { weight wtk: this turns the gap computed in the previous step into an estimated accuracy to be assigned to a crowd assessor. In particular, we reason in terms of dissimilarity from random assessors since, for a crowd assessor Wk, being close to a random one ph can be considered as an indicator of her/his poor quality. We have three alternatives: minimal dissimilarity (labelled md): this computes a weight which is proportional to the minimum gap from one of the random assessors class, i.e. the closer to one of the random assessors, the smaller the weight; minimal squared dissimilarity (labelled msd): this is similar to the previous case but uses the minimum squared gap; minimal equi-dissimilarity (labelled med): this computes a weight which is proportional to the crowd assessor being equally distant from all three families of random assessors.

For each of the three random assessor classes, we generate a set of H replicates to cope with the uncertainty of the random generation process and to obtain better estimates. Therefore, for each crowd assessor Wk, we obtain a set of H estimates and we need to aggregate them into a single one; we compute a mean gap Gk, averaging over the set of H gaps computed with respect to each random assessor ph.

Finally, the described procedure produces an estimated accuracy atk to be assigned to a crowd assessor Wk for each topic t 2 T ; this is what we call topicby-topic score granularity, labelled tpc. However, we are also interested in the case when a single accuracy score is assigned to a crowd assessor Wk, i.e. when the atk are the same for all the topics; this is what we call single score granularity, labelled sgl. 4 4.1

Experimental Evaluation Experimental Setup

We use the TREC 21, 2012, Crowdsourcing [ 6 ] data sets developed in the Text Relevance Assessing Task (TRAT). The TRAT required participating groups to simulate the relevance assessing role of the NIST for 10 of the TREC 08, 1999, Ad-hoc topics [ 9 ]. Participating groups had to submit a binary relevance judgements for every document in the judging pools of the ten topics. Two TREC Adhoc tracks used these 10 topics over the years: the TREC 08, 1999, Ad-hoc track [ 9 ] (labeled T08), and the TREC 13, 2004, Robust track [ 8 ] (labeled T13).

When it comes to the measures for evaluating the e ectiveness of the di erent approaches, we adopt two criteria used in the TREC 22, 2013, Crowdsourcing track [ 7 ]: referred as rank correlation and score accuracy. We use Average Precision (AP) correlation [ 10 ] to compare the ranking of the systems produced for a given performance measure m( ), computed over the gold standard, with respect to the ranking produced for the same performance measure computed over the ground-truth, generated by one of the approaches under examination. In addition to correctly ranking systems, it is important that the performance scores are as accurate as possible. To this end, for a given performance measure m( ), we use the RMSE between the performance measure computed over the gold standard and the one computed over the ground-truth created by one of the approaches under examination.

When it comes to the assessor measures Mk and Mhp, we consider Average Precision (AP), Normalized Discounted Cumulated Gain (nDCG), and Expected Reciprocal Rank (ERR).

We consider three baselines, representing the state-of-the-art: the MV algorithm, labeled mv, and two variants of the EM algorithm: emmv, i.e. EM seeded by the pool generated by the MV algorithm, and emneu, i.e. EM initialized using the worker confusion matrix. Finally, we experiment also a fourth baseline labeled uni, representing AWARE in absence of any information, i.e. using uniform accuracies for all the merged crowd assessors. 4.2

Methodology

The goal of this section is to to investigate how the AWARE approaches and the state-of-the-art baselines behave with respect to di erent factors, and to compare the AWARE approaches against those baselines. To this end, we adopt a General Linear Mixed Model (GLMM) model for the three-way ANalysis Of VAriance (ANOVA) with repeated measures [ 5 ]. We are interested in determining whether a factor e ect is signi cant, i.e. its p-value is less than 0:05, as well as in which proportion of the variance is due to it.

AP Correlation The ANOVA table { not reported due to space limit [ 2 ] { shows that Measure is a large size e ect and it explains the largest share of variance; Systems is a large size e ect as well and it is the second largest main e ect; nally, also Approach is a large size e ect but about 2 times smaller than Measure e ect and 1.25 times smaller than Systems e ect. Overall, this supports the intuition that led to the development of the AWARE framework: performance Measures and Systems e ects do matter a lot when merging assessors and they should be taken into the play.

The Tukey HSD multiple comparison analysis reported in Figure 2a highlights the top group (dashed blue line), the group of approaches not signi cantly di erent from the uni baseline (dashed bright red line), the group of approaches not signi cantly di erent from mv (dashed dark red line), and the group of approaches not signi cantly di erent from emmv and emneu (dashed orange line). We can note how the top group is separated from the others while the uni and mv groups partially overlaps. In particular, we can see that the approaches significantly better than all the others are sgl tau msd (the top one), sgl apc msd,

Robustly Combine Judgements from Crowd Assessors with AWARE 7

AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors

ZZ:25

ACM Transactions on Information Systems, Vol. XX, No. YY, Article ZZ, Publication date: Octobeur20l16t.iple comparison test for the Approach factor.

Fig. 2: Tukey HSD m ACM Transactions on Informatio(n bSys)tems, RVol. MXX, NSo. YEY, Article ZZ, Publication date: October 2016. tpc apc msd, and sgl tau md, suggesting that the single score granularity is preferable to the topic-by-topic one and that the tau and apc gaps help to rank systems better. State-of-the-art approaches, namely mv (the best one in this group), emmv, and emneu are clearly separated from the top group. Finally, the AWARE uni baseline exhibits better performances than mv, even though it is not signi cantly di erent from it.

RMSE

The ANOVA table { not reported due to space limit [ 2 ] { shows that the Measure factor is a large size e ect with the greatest impact; Approach is a large size e ect but, unlike the case of AP correlation, it is almost as important as Measure;

nally, Systems is a large size e ect but much smaller than the previous two. Overall, this further supports the intuition behind AWARE, but it also suggests that Approaches are much more prominent for the accurate estimation of the actual value of a performance measure, (assessed by the RMSE) than for ranking systems correctly (assessed by AP correlation).

The top group, reported in the Tukey HSD comparison of Figure 2b, consists of sgl rmse med, tpc rmse med, tpc fro med (the top ones with extremely close performances), sgl fro med, and sgl kld md; this suggests that there is more balance between single and topic-by-topic score granularities and that the gaps operating closer to the assessors measures (fro, rmse, kld) are more e ective. State-of-the-art approaches are clearly distinct from the top group and, in this case, AWARE uni is signi cantly better than mv and the rest of them.

Conclusions and Future Work

In this paper, we presented the AWARE framework for robustly combining performance measures coming from multiple crowd assessors. The idea of AWARE stemmed from the observation of the potential impact of both performance measures and systems when it comes to correctly labeled/mis-labeled relevance judgements. Therefore, we proposed a probabilistic framework to take systems and performance measures into account during the estimation of the crowd assessors accuracies used to combine them. We then exempli ed how to instantiate the proposed stochastic framework by introducing many unsupervised estimators of the accuracy of crowd assessors.

Finally, we conducted a thorough evaluation on TREC collections, comparing AWARE against state-of-the-art approaches and studying their in uencing factors. The experimentation has provided multiple evidence supporting the intuition behind the AWARE framework. Moreover, it has shown that AWARE approaches perform better than state-of-the-art ones in terms of both ranking systems and correctly predicting their performance scores.

As future work we will investigate multi-feature estimators, i.e. estimators that take into account multiple performance measures at the same time to determine the accuracy of a crowd assessor, supervised estimators, i.e. estimators that leverage a gold standard instead of random assessors for determining the accuracy of a crowd assessor and extend the experiments to graded-relevance judgements.

1. Ferrante , M. , Ferro , N. , Maistro , M. : Towards a Formal Framework for Utilityoriented Measurements of Retrieval E ectiveness . In ICTIR , pp. 21 { 30 , ACM , 2013 .

2. Ferrante , M. , Ferro , N. , Maistro , M.: AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors . In TOIS , 36 ( 2 ), 20 :1{ 20 : 38 , 2017 .

3. Halvey , M. , Villa , R. , Clough , P. : SIGIR 2014 Workshop on Gathering E cient Assessments of Relevance (GEAR) . In SIGIR , p. 1293 , ACM , 2014 .

4. Marcus , A. , Parameswaran , A. : Crowdsourced Data Management: Industry and Academic Perspectives . In Foundations and Trends R in Databases, 6 ( 1-2 ) pp. 1 { 16 , Now

Publishers

, Inc, 2015 .

5. Maxwell , S. , Delaney , H.D.: Designing Experiments and Analyzing Data . A Model Comparison Perspective . Lawrence Erlbaum Associates, 2004 .

6. Smucker , M.D. , Kazai , G. , Lease , M. : Overview of the TREC 2012 Crowdsourcing Track . In TREC, NIST, Special Publication 500- 298 , 2013 .

7. Smucker , M.D. , Kazai , G. , Lease , M. : Overview of the TREC 2013 Crowdsourcing Track . In TREC, NIST, Special Publication 500- 302 , 2014 .

8. Voorhees , E.M.: Overview of the TREC 2004 Robust Track . In TREC, NIST, Special Publication 500- 261 , 2004 .

9. Voorhees , E.M. , Harman , D.K. : Overview of the Eight Text REtrieval Conference (TREC-8) . In TREC, pp. 1 { 24 , NIST, Special Publication 500- 246 , 1999 .

10. Yilmaz , E. and Aslam , J. A. and Robertson , S. E.: A New Rank Correlation Coe cient for Information Retrieval . In SIGIR, pp. 587 { 594 , ACM , 2008 .