=Paper= {{Paper |id=Vol-1810/LWDM_paper_03 |storemode=property |title=Consensus-Based Techniques for Range-Task Resolution in Crowdsourcing Systems |pdfUrl=https://ceur-ws.org/Vol-1810/LWDM_paper_03.pdf |volume=Vol-1810 |authors=Lorenzo Genta,Alfio Ferrara,Stefano Montanelli |dblpUrl=https://dblp.org/rec/conf/edbt/GentaFM17 }} ==Consensus-Based Techniques for Range-Task Resolution in Crowdsourcing Systems== https://ceur-ws.org/Vol-1810/LWDM_paper_03.pdf
  Consensus-based Techniques for Range-task Resolution
              in Crowdsourcing Systems

                   Lorenzo Genta                                Alfio Ferrara                   Stefano Montanelli
            Dipartimento di Informatica                Dipartimento di Informatica          Dipartimento di Informatica
           Università degli Studi di Milano           Università degli Studi di Milano     Università degli Studi di Milano
                  Via Comelico 39                            Via Comelico 39                      Via Comelico 39
                20135 - Milano, Italy                      20135 - Milano, Italy                20135 - Milano, Italy
                 genta@di.unimi.it                          ferrara@di.unimi.it              montanelli@di.unimi.it

ABSTRACT                                                                  to express her/his creativity, thus enabling crowdsourcing to
In crowdsourcing, a range task is a type of creation task                 become a mechanism for collaborative knowledge creation.
where only free answers belonging to the numeric domain are               However, in creation tasks, the problem of choosing the final
accepted/possible. In this paper, we present the median-on-               task result among all the available worker answers is even
agreement (ma) techniques based on statistical and consensus-             more challenging than for decision tasks, especially when
based mechanisms for determining the result of range tasks.               the task question is intrinsically subjective and a factual an-
The ma techniques are characterized by i) the distinction                 swer is not possible nor appropriate (e.g., a labeling task in
between the group of workers that agree (i.e., workers in the             which the worker is called to provide a featuring keyword
consensus) on the task result from the group that disagree,               for a group of web images).
and ii) the calculation of the final task answer through a                   In this paper, we focus on range tasks, namely a type
median-based mechanism where only answers of workers in                   of creation task where only free answers belonging to the
the consensus are considered.                                             numeric domain are accepted/possible [1]. We propose the
                                                                          median-on-agreement (ma) techniques based on statistical
                                                                          and consensus-based mechanisms. In particular, the ma
Keywords                                                                  techniques are conceived to address range task resolution
crowdsourcing, consensus evaluation, range task manage-                   when multiple crowd workers are involved in the execution
ment                                                                      of each task. Each worker autonomously and independently
                                                                          executes a task, thus a number of different answers is pro-
1.    INTRODUCTION                                                        duced. Based on these answers, the ma techniques allow i)
   In the recent years, crowdsourcing systems have gained                 to distinguish the group of workers that agree (i.e., work-
growing popularity as powerful solutions for addressing the               ers in the consensus) on the task result from the group that
execution of complex, time-consuming activities where the                 disagree, and ii) to calculate the final task answer through
contribution of human workers can be decisive and the use                 a median-based mechanism where only answers of workers
of automatic procedures is not completely effective, such as              in the consensus are considered. The application of the ma
for example collaborative filtering and web-resource tagging.             techniques to the Argo crowdsourcing system is presented as
Usually, in this kind of systems, crowd workers are involved              well as experimental results against the main state-of-the-art
in decision tasks where they are called to select the most                approaches for range task resolution.
appropriate answer among a set of predefined alternatives                    The paper is organized as follows. In Section 2, we illus-
(e.g., [9]). In a conventional scenario, multiple workers par-            trate motivations and related work. The ma techniques are
ticipate to the execution of a task, thus multiple answers                presented in Section 3. In Section 4, the application of ma
are collected and the final result is derived by assessing the            to Argo is discussed. In Section 5, experimental results on
level of agreement between the different answers and by de-               a real crowdsourcing case-study are presented. Concluding
ciding if a consensus has been reached [1, 3]. The use of                 remarks are provided in Section 6.
crowdsourcing systems is now being proposed also for the
resolution of the so-called creation tasks, in which the task             2.    MOTIVATING SCENARIO
answer can be any kind of worker-generated content like for
                                                                             Consider the scenario described in [6] where the use of
example a free text answer as well as a drawing or another vi-
                                                                          a crowdsourcing approach is proposed for estimating the
sual/multimedia artifact. This task type enables the worker
                                                                          amount of calories in a meal. In [6], a task is character-
                                                                          ized by a picture of a dish and a worker receiving a task to
                                                                          execute is asked to insert a numeric value corresponding to
                                                                          her/his calorie estimation based on the given picture.
                                                                             This is an example of a range task, in that a worker re-
                                                                          ceiving a task to execute can only provide a free numeric
2017, Copyright is with the authors. Published in the Workshop Proceed-   answer, namely integer or decimal value, based on her/his
ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,      personal point-of-view, knowledge, perception, and exper-
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is
permitted under the terms of the Creative Commons license CC-by-nc-nd     tise. This means that no predefined options/suggestions are
4.0                                                                       available and workers are called to independently and au-
tonomously provide her/his own task answer. Moreover,                      in the following.
the real amount of calories in a dish (i.e., in a task) is not
available/known and only a collective answer is possible [3].              Identification of the support group. We call GCA1 ⊆ G
This means that crowdsourcing has the goal to provide a                    the support group of G, namely the group of workers that
result that represents the so-called “wisdom of the crowd”,                agree on the task result. Two workers agree on the task
in which the reliability of a task result is determined by its             result when they provide a similar numeric answer, mean-
credibility: the more the consensus among workers on an                    ing that the values provided in the task answer are near in
answer is high, the more the answer reliability is high.                   comparison with the overall range of answers A provided
   An intuitive and popular solution for range task resolution             by all the workers in G. We call ACA1 ⊆ A the set of task
is to employ a mean-based approach in which multiple work-                 answers provided by the workers in GCA1 . Consider the me-
ers are involved in the execution of each task and the arith-              dian value mA of all the provided worker answers A. The
metic mean of the whole set of worker answers is provided                  group GCA1 is progressively built by including workers that
as final result [5]. The main drawbacks of a mean-based ap-                provided an answer close to mA , namely:
proach are illustrated by Francis Galton in [4] where the use
of arithmetic mean for computing the result of a range task                     1. Compute the median mA over the whole set of worker
is deprecated, since it                                                            answers A and define GCA1 = ∅, ACA1 = ∅.

      would give a voting power to “cranks” in propor-                          2. Select the worker answer ak ∈ A which is nearest to
      tion to their crankiness. One absurdly large or                              mA . Insert ak in ACA1 and insert the worker wk in
      small estimate would leave a greater impress on                              the support group GCA1 .
      the result than one of reasonable amount, and                             3. The coefficient of variation cv is exploited to decide
      the more an estimate diverges from the bulk of                               whether an answer ak ∈ A is near enough to mA for
      the rest, the more influence would it exert.                                 being included in GCA1 . To this end, cv is calculated
                                                                                   over the set of answers in ACA1 :
   In other words, the numeric answer of a single worker that
diverges (i.e., it is very different) from the other more-or-
                                                                                                    r
                                                                                                          1
                                                                                                               P|ACA1 |            2
less equivalent worker answers has a strong influence on the                                           |ACA1 |  i=1     ai − µACA1
final task result. This means that a single worker can auto-                           cv(ACA1 ) =
                                                                                                                  µACA1
determine her/his impact on the task result independently
from her/his trustworthiness. This is especially true when                        where |ACA1 | is the number of answers in ACA1 , ai
the group of workers involved in a task execution is small                        represents the ith worker answer in ACA1 , and µACA1
(i.e., 5-10 workers per group) and malicious or inaccurate                        represents the arithmetic mean of the answers in ACA1 .
workers can be involved as usually occurs in real systems.
   Further work on resolution of range tasks are presented                      4. The insertion of workers in GCA1 is repeated until
in [7]. This contribution is in the field of QoE (Quality of                       the coefficient of variation over the answers ACA1 is
Experience) where workers are asked to provide an evalua-                          lower than a threshold thcv (i.e., go back to step 2
tion of their experience with a service (e.g., web browsing,                       if cv(ACA1 ) < thcv ). Otherwise, remove the last-
phone call, TV broadcast). The authors propose a tech-                             inserted item from GCA1 and ACA1 and continue with
nique called CrowdMOS (i.e., Crowd sourcing M ean Opinion                          the next step.
S core) based on the analysis of the answer distribution pro-
                                                                                5. Create the set GCA2 = G \ GCA1 containing the work-
vided by workers. The high subjectivity/uncertainty of con-
                                                                                   ers that are not in the support group. Analogously,
sidered tasks motivates the use of a random-effects model
                                                                                   the set ACA2 = A \ ACA1 is created as well.
for determining the task result. However, only random vari-
ables based on a normal distribution (i.e., a symmetric dis-               Definition of the final task result. The final task result
tribution) can be used for representing errors, thus other                 Ā is defined as the median value calculated over the set of
statistical distributions are not supported.                               worker answers ACA1 , namely Ā = mACA1 .
   In the following, we propose consensus-based techniques
for managing range task resolution based on two main con-                  Example. Consider a task T1 where workers are asked to
tributions. First, use of the median value (instead of the                 guess the distance between the two Italian cities Caserta and
arithmetic mean) to determine the task result which is rep-                Siena in kilometers (the real distance is 352 Km). Consider
resentative of the multiple answers collected from the in-                 the following set of worker answers: A = {300, 300, 301, 301,
volved workers. Second, use of consensus as a mechanism                    350, 351, 351, 351, 351, 400, 408, 408, 450, 500, 600, 600,
for distinguishing workers that agree on the task result from              600, 650, 700, 1500}. The median value over the whole set of
workers that disagree and represent a sort of outlier position.            worker answers mA = 404. According to ma, we consider a
                                                                           threshold for the coefficient of variation thcv = 0.15 and we
3.    THE MEDIAN-ON-AGREEMENT TECH-                                        identify the support group GCA1 shown in Figure 1. With
      NIQUES                                                               this support group, the median value of the answers provided
                                                                           by workers in the support group is returned as final task
  Consider a range task T assigned to a group of workers
                                                                           result: Ā = mACA1 = 351.
G = {w1 , . . . , wn } providing a set of answers A = {a1 , . . . , an }
where ak ∈ A is the numeric answer provided by the worker
wk ∈ G. Range task resolution according to the ma tech-                    4.     APPLICATION TO THE ARGO SYSTEM
niques is articulated in two main steps: identification of the               The ma techniques have been implemented in the Argo
support group and definition of the final task result described            crowdsourcing platform (http://island.ricerca.di.unimi.it/projects/
                                                                    5.   EXPERIMENTAL RESULTS
                                                                       For evaluation of the proposed ma techniques, we consider
                                                                    the geo-dis case-study for crowdsourcing the geographic dis-
                                                                    tance between pairs of Italian cities.
                                                                       The experiment has been executed by relying on the Argo
                                                                    prototype. We collected a dataset of 120 Italian cities with
                                                                    their geographic coordinates extracted from the FreeBase
                                                                    (http://www.freebase.com) open repository. We built a set
                                                                    of 634 tasks each one asking for the distance between a pair
                                                                    of different cities. The experimentation on geo-dis was con-
                                                                    ducted with a crowd of 585 workers selected in a class of
                                                                    master-degree students (average worker age is 21 years old).
  Figure 1: Identification of support group in ma                   For task resolution, we asked the workers to rely on their
                                                                    personal knowledge and we set the allowed time to perform
                                                                    a task to a maximum of 15 minutes. In the experimentation,
                                                                    the Argo prototype has been configured as follows: i) initial
                                                                    worker trustworthiness τ0 = 0.5; ii) group size sG =20; iii)
argo/ (Italian language)). In Argo, range task resolution is        quorum value q = 0.51; iv) the worker salary is s = 0.1 and
enforced through consensus-based evaluation techniques and          the worker award is a = 1.
trustworthiness-based worker management by relying on our              Evaluation is based on two different experiments over the
experience and research results in this field [3].                  geo-dis case-study. The former experiment presents a com-
   Consensus-based evaluation of range tasks. For                   parison of the ma techniques implemented in the Argo sys-
consensus evaluation, Argo employs a weighted-voting mech-          tem (maArgo ) against other state-of-the-art techniques for
anism called supermajority where the answer of a worker             range task resolution. The latter experiment is performed
wk has a weight corresponding to her/his trustworthiness.           to evaluate the crowdsourcing cost of the ma techniques by
Supermajority is based on the verification of two differ-           measuring the number of committed/uncommitted tasks.
ent constraints called quorum-constraint (q) and balance-of-        Comparison against state-of-the-art techniques. We
power constraint (bop). The q-constraint verifies that the          compare maArgo against the following competitor techniques:
task result Ā is supported by a group of workers GCA1 with            Overall arithmetic mean µO . This method refers to the
enough weight (i.e., trustworthiness) for satisfying a given        classical approach proposed in [10] where the result of a task
quorum q ∈ [0.51, 1]. The bop-constraint verifies that a            T is given by computing the arithmetic mean over all the
single worker cannot shift the majority from one answer to          obtained answers.
another one just by changing her own task answer [3]. This             Outlier-cleaned arithmetic mean - Standard Deviation µ2SD .
means that the support group GCA1 still satisfies the q-            This method consists in applying a classical outlier removal
constraint even if a worker is shifted from GCA1 to GCA2 .          technique based on the standard deviation (2SD) [8] to the
A task is committed on the task result Ā when the super-           set of answers of a task T . After removal of the outliers,
majority constraints are satisfied (i.e., consensus is verified).   the arithmetic mean is finally computed over the remaining
On the opposite, when supermajority constraints are not             answers.
satisfied, the task remains uncommitted. In this case, the             Outlier-cleaned arithmetic mean - Median Rule µM R . This
task should be re-executed or considered as failed.                 method consists in applying a more recent outlier removal
   Trustworthiness-based worker management. The                     technique based on the median rule [2] to the set of answers
Argo system aims at keeping into account not only the mere          of a task T . After removal of the outliers, the arithmetic
effort workers spent in executing tasks, but also the quality       mean is finally computed over the remaining answers.
of the effort provided. A worker W is characterized by a               Overall median mO . This method consists in comput-
worker score σW , and a worker trustworthiness τW .                 ing the result of a task T as the median value of all the
   The worker score σW represents the worker revenue com-           provided answers. As far as we know, state-of-the-art tech-
posed by i) a salary, the payment the worker receives each          niques based on the median value are not provided. How-
time she/he executes a task, regardless of the consensus ver-       ever, we compare maArgo against mO since this is the natural
ification, and ii) an award, a bonus the worker receives each       baseline for our ma techniques.
time she/he contributes to commit a task.                              In the evaluation, we consider maArgo under three con-
   The worker trustworthiness τW ∈ [0, 1] is defined to cap-        figurations characterized by different thresholds for the co-
ture the worker ability to foster the task commitment and           efficient of variation thcv . Results are evaluated through
it is based on the worker history in executing tasks. At the        average-error and average-error with outlier-removal mech-
beginning of the crowdsourcing activities (time t = 0), the         anisms. In the average error mechanism, for each task T ,
worker trustworthiness τW is set to an initial value τW 0
                                                           = τ0 .   the evaluation considers the error between the distance esti-
Each time a task T is committed (time t + 1), the trustwor-         mation in the crowdsourcing result Ā and the real distance
thiness of a worker W ∈ G is updated. In particular, the            between the two cities contained in T . The average error ¯A
worker trustworthiness increases (i.e., τW t+1    t
                                               > τW ) when the      is calculated as:
worker belongs to the support group (i.e., W ∈ GCA1 ), thus
confirming her/his ability to foster task commitment in the                                     Pi=n
                                                                                                 i=1 |Ai − Ri |
last-executed task T . On the opposite, the worker trust-                               ¯A =
                                                                                                       |T |
worthiness decreases when the worker is not in the support
group (i.e., W ∈ / GCA1 ).                                          where n = |T | is the overall number of tasks, Ai is the crowd-
sourcing result of the task Ti , Ri is the real distance be-
tween the pair of cities in the task Ti calculated through the               Table 2: Commitment evaluation
                                                                                              #Committed     c
geodesic distance. In the average-error with outlier-removal,                   thcv = 0.25      624       98.4%
the error evaluation follows the same approach of ¯A calcula-                  thcv = 0.20      609       96.1%
tion, but outliers are removed according to the conventional                    thcv = 0.15      565       89.1%
                                                                                thcv = 0.10      507       80.0%
criterion based on standard-deviation (2SD) [8].                                thcv = 0.05      422       66.6%
   The results of this experiment are presented in Table 1.
The first consideration that has to be done is about the re-
                                                                  between accuracy (i.e., almost twice value on accuracy with
            Table 1: Comparison of results                        respect to the other threshold values) and commitment ratio
                                   ¯ (Km)    ¯c (Km)            (i.e., c ≈ 90 %).
                            µO    666206.10    9558.12
                         µ2SD        48.44      40.98
                          µM R       19.71      14.00
                                                                  6.   CONCLUDING REMARKS
                           mO        18.10      11.21               In this paper, we presented the ma techniques for range
           maArgo (thcv = 0.25)      12.89       6.71
           maArgo (thcv = 0.15)      9.15        5.18
                                                                  task resolution in crowdsourcing systems. Application to
           maArgo (thcv = 0.05)      2.69        1.35             the Argo system as well as experimental results on a real
                                                                  case-study are provided to show the contribution of the pro-
sult of the µO technique. The fact that the obtained average      posed solution with respect to the state-of-the-art. Ongoing
error ¯ is so high is mainly due to the presence of malicious    work are focused on the so-called task routing problem with
workers in a very high number of groups. These malicious          the goal to specify a family of configuration patterns for dy-
workers gave completely wrong answers (e.g., 10 millions          namically choosing the most appropriate group of workers
kilometers as distance between Rome and Milan) that have          that can be selected for assignment of a given task to be
a very serious impact on the task result when the arithmetic      executed based on worker expertise and knowledge.
mean is considered and outlier removal is not performed.
We note that the median-based techniques (i.e., mO and            7.   REFERENCES
maArgo ) provide better results than the techniques based          [1] A. Bozzon, M. Brambilla, S. Ceri, and A. Mauri.
on the standard deviation. We argue that this is due to                Reactive Crowdsourcing. In Proc. of the 22nd Int.
the assumption of symmetric distribution used in µO , µ2SD ,           World Wide Web Conference (WWW 2013), pages
and µM R , which is is usually false (e.g., see for example            153–164, Rio de Janeiro, Brazil, 2013.
the task presented in Figure 1). As a general remark, we           [2] K. Carling. Resistant Outlier Rules and the
observe that the median-based solutions provide better re-             Non-Gaussian Case. Computational Statistics & Data
sults than mean-based techniques even without the outlier-             Analysis, 33(3):249–258, 2000.
removal phase. By considering the maArgo results with the          [3] S. Castano, A. Ferrara, L. Genta, and S. Montanelli.
different thresholds on the coefficient of variation, we note          Combining Crowd Consensus and User
that the lower is the threshold thcv , the lower is the average        Trustworthiness for Managing Collective Tasks. Future
error ¯. This means that a more restrictive mechanism for             Generation Computer Systems, 54, 2016.
determining the support group GCA1 increases the accuracy
                                                                   [4] F. Galton. One Vote, One Value. Nature, 75:414, 1907.
of obtained results.
                                                                   [5] T. W. Malone, R. Laubacher, and C. Dellarocas. The
Analysis on the task commitment We observed that
                                                                       Collective Intelligence Genome. IEEE Engineering
a low value of thcv produces a low average error ¯. How-
                                                                       Management Review, 38(3), 2010.
ever, on the opposite, a low value of thcv also produces a
high number of uncommitted tasks, and thus high expenses           [6] J. Noronha, E. Hysen, H. Zhang, and K. Z. Gajos.
for crowdsourcing execution. For this reason, in this ex-              Platemate: Crowdsourcing Nutritional Analysis from
periment, we analyze the number of committed tasks when                Food Photographs. In Proc. of the 24th symposium on
different thresholds on the coefficient of variation are con-          User Interface Software and Technology, pages 1–12,
sidered. To this end, we define the commitment ratio as                Santa Barbara, CA, USA, 2011.
follows:                                                           [7] F. P. Ribeiro, D. A. F. Florêncio, C. Zhang, and M. L.
                                Nc                                     Seltzer. CROWDMOS: An Approach for
                         c=                                            Crowdsourcing Mean Opinion Score Studies. In Proc.
                              Nc + Nu
                                                                       of the IEEE International Conference on Acoustics,
where Nc is the number of committed tasks and Nu is the                Speech, and Signal Processing, pages 2416–2419,
number of uncommitted tasks.                                           Prague, Czech Republic, 2011.
   The commitment ratio for different coefficient of variation     [8] S. Seo. A Review and Comparison of Methods for
thresholds thcv are presented in Table 2. We note that the             Detecting Outliers in Univariate Data Sets. PhD
lower is the coefficient of variation threshold, the lower is          thesis, University of Pittsburgh, Pennsylvania, USA,
the value of commitment. This behavior is motivated by                 2006.
the fact that the lower is the coefficient of variation, the
                                                                   [9] C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera:
more restrictive is the mechanism for determining the sup-
                                                                       Large-scale Classification Using Machine Learning,
port group GCA1 . As a result, it is important to configure
                                                                       Rules, and Crowdsourcing. Proceedings of the VLDB
the crowdsourcing execution by tuning the threshold thcv
                                                                       Endowment, 7(13), 2014.
with the goal to set the desired tradeoff between accuracy
of results and commitment ratio. In the geo-dis case study,       [10] J. Surowiecki. The Wisdom of Crowds. Random House
the threshold value thcv = 0.15 provides the best tradeoff             LLC, 2005.