=Paper= {{Paper |id=None |storemode=property |title=Using Crowdsourcing to Compare Document Recommendation Strategies for Conversations |pdfUrl=https://ceur-ws.org/Vol-910/paper3.pdf |volume=Vol-910 |dblpUrl=https://dblp.org/rec/conf/recsys/HabibiP12 }} ==Using Crowdsourcing to Compare Document Recommendation Strategies for Conversations== https://ceur-ws.org/Vol-910/paper3.pdf
                  Using Crowdsourcing to Compare Document
                 Recommendation Strategies for Conversations

                                 Maryam Habibi                                                       Andrei Popescu-Belis
                     Idiap Research Institute and EPFL                                               Idiap Research Institute
                          Rue Marconi 19, CP 592                                                     Rue Marconi 19, CP 592
                         1920 Martigny, Switzerland                                                 1920 Martigny, Switzerland
                        maryam.habibi@idiap.ch                                                andrei.popescu-belis@idiap.ch

ABSTRACT                                                                                  a conversation, such as a business meeting. Used as a vir-
This paper explores a crowdsourcing approach to the evalua-                               tual secretary, the system constantly retrieves documents
tion of a document recommender system intended for use in                                 that are related to the words of the conversation, using au-
meetings. The system uses words from the conversation to                                  tomatic speech recognition, but users could also be allowed
perform just-in-time document retrieval. We compare sev-                                  to make explicit queries. Such a system builds upon pre-
eral versions of the system, including the use of keywords,                               vious approaches known as implicit queries, just-in-time re-
retrieval using semantic similarity, and the possibility for                              trieval, or zero query terms, which were recently confirmed
user initiative. The system’s results are submitted for com-                              as a promising research avenue [1].
parative evaluations to workers recruited via a crowdsour-                                   Evaluating the relevance of recommendations produced by
cing platform, Amazon’s Mechanical Turk. We introduce                                     such a system is a challenging task. Evaluation in use re-
a new method, Pearson Correlation Coefficient-Information                                 quires the full deployment of the system and the setup of
Entropy (PCC-H), to abstract over the quality of the work-                                numerous evaluation sessions with realistic meetings. That
ers’ judgments and produce system-level scores. We measure                                is why alternative solutions based on simulations are impor-
the workers’ reliability by the inter-rater agreement of each                             tant to find. In this paper, we propose to run the document
of them against the others, and use entropy to weight the                                 recommender system over a corpus of conversations and to
difficulty of each comparison task. The proposed evaluation                               use crowdsourcing to compare the relevance of results in var-
method is shown to be reliable, and the results show that                                 ious configurations of the system.
adding user initiative improves the relevance of recommen-                                   A crowdsourcing platform, here Amazon’s Mechanical Turk,
dations.                                                                                  is helpful for several reasons. First, we can evaluate a large
                                                                                          amount of data in a fast and inexpensive manner. Second,
                                                                                          workers are sampled from the general public, which might
Categories and Subject Descriptors                                                        represent a more realistic user model than the system de-
H.3.3 [Information Storage and Retrieval]: Information                                    velopers, and have no contact with each other. However, in
Search and Retrieval—Query formulation, Retrieval models;                                 order to use workers’ judgments for relevance evaluation, we
H.3.4 [Information Storage and Retrieval]: Systems                                        have to circumvent the difficulties of measuring the quality
and Software—Performance evaluation                                                       of their evaluations, and factor out the biases of individual
                                                                                          contributions.
                                                                                             We will define an evaluation protocol using crowdsourcing,
General Terms                                                                             which estimates the quality of the workers’ judgments by
Evaluation, Uncertainty, Reliability, Metric                                              predicting task difficulty and workers’ reliability, even if no
                                                                                          ground truth to validate the judgments is available. This ap-
Keywords                                                                                  proach, named Pearson Correlation Coefficient-Information
                                                                                          Entropy (PCC-H), is inspired by previous studies of inter-
Document recommender system, user initiative, crowdsourc-                                 rater agreement as well as by information theory.
ing, Amazon Mechanical Turk, comparative evaluation                                          This paper is organized as follows. Section 2 describes
                                                                                          the document recommender system and the different ver-
1.     INTRODUCTION                                                                       sions which will be compared. Section 3 reviews previous
  A document recommender system for conversations pro-                                    research on measuring the quality of workers’ judgments for
vides suggestions for potentially relevant documents within                               relevance evaluation and labeling tasks using crowdsourcing.
                                                                                          Section 4 presents our design of the evaluation micro-tasks
Permission to make digital or hard copies of all or part of this work for                 – “Human Intelligence Tasks” for the Amazon’s Mechanical
personal or classroom use is granted without fee provided that copies are                 Turk. In Section 5, the proposed PCC-H method for measur-
not made or distributed for profit or commercial advantage and that copies                ing the quality of judgments is explained. Section 6 presents
bear this notice and the full citation on the first page. To copy otherwise, to           the results of our evaluation experiments, which on the one
republish, to post on servers or to redistribute to lists, requires prior specific        hand validate the proposed method, and on the other hand
permission and/or a fee.                                                                  indicate the comparative relevance of the different versions
Copyright is held by the author/owner(s). Workshop on Recommendation
Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with                      of the recommender system.
ACM RecSys 2012, September 9, 2012, Dublin, Ireland.
.




                                                                                     15
2.   OUTLINE OF THE DOCUMENT                                            most relevant by external judges. As the method allows
     RECOMMENDER SYSTEM                                                 only binary comparisons, as we will now describe, we will
                                                                        compare UI with the AW and KW versions, and then SS
   The document recommender system under study is the
                                                                        with KW.
Automatic Content Linking Device (ACLD [15, 16]), which
uses real-time automatic speech recognition [8] to extract
words from a conversation in a group meeting. The ACLD                  3.   RELATED WORK
filters and aggregates the words to prepare queries at regu-               Relevance evaluation is a difficult task because it is subjec-
lar time intervals. The queries can be addressed to a lo-               tive and expensive to be performed. Two well-known meth-
cal database of meeting-related documents, including also               ods for relevance evaluation are the use of a click-data cor-
transcripts of past meetings if available, but also to a web            pus, or the use of human experts [18]. However, in our case,
search engine. The results are then displayed in an unobtru-            producing click data or hiring professional workers for rele-
sive manner to the meeting participants, which can consult              vance evaluation would both be overly expensive. Moreover,
them if they find them relevant and purposeful.                         it is not clear that evaluation results provided by a narrow
   Since it is difficult to assess the utility of recommended           range of experts would be generalizable to a broader range
documents from an absolute perspective, we aim instead at               of end users. In contrast, crowdsourcing, or peer collabora-
comparing variants of the ACLD, in order to assess the im-              tive annotation, is relatively easy to prototype and to test
provement (or lack thereof) due to various designs. Here, we            experimentally, and provides a cheap and fast approach to
will compare four different approaches to the recommenda-               explicit evaluation. However, it is necessary to consider some
tion problem – which is in all cases a cold-start problem, as           problems which are associated to this approach, mainly the
we don’t assume knowledge about participants. Rather, in a              reliability of the workers’ judgments (including spammers)
pure content-based manner, the ACLD simply aims to find                 and the intrinsic knowledge of the workers [3].
the closest documents to a given stretch of conversation.                  Recently, many studies have considered the effect of the
   The four compared versions are the following ones. Two               task design on relevance evaluation, and proposed design
“standard” versions as in [15] differ by the filtering procedure        solutions to decrease time and cost of evaluation and to in-
for the conversation words. One of them (noted AW) uses                 crease the accuracy of results. In [9], several human factors
all the words (except stop words) spoken by users during a              are considered: query design, terminology and pay, with
specific period (typically, 15 s) to retrieve related documents.        their impact on cost, time and accuracy of annotations.
The other one (noted KW) filters the words, keeping only                To collect proper results, the effect of user interface guide-
keywords from a pre-defined list related to the topic of the            lines, inter-rater agreement metrics and justification analysis
meeting.                                                                were examined [2], showing e.g. that asking workers to write
   Two other methods depart from the initial system. One                a short explanation in exchange of a bonus is an efficient
of them implements semantic search (noted SS [16]), which               method for detecting spammers. In addition, in [11], dif-
uses a graph-based semantic relatedness measure to per-                 ferent batches of tasks were designed to measure the effect
form retrieval. The most recent version allows user initiative          of pay, required effort and worker qualifications on the ac-
(noted UI), that is, it can answer explicit queries addressed           curacy of resulting labels. Another paper [13] has studied
by users to the system, with results replacing spontaneous              how the distribution of correct answers in the training data
recommendations for one time period. These are processed                affects worker responses, and suggested to use a uniform
by the same ASR component, with participants using a spe-               distribution to avoid biases from unethical workers.
cific name for the system (“John”) to solve the addressing                 The Technique for Evaluating Relevance by Crowdsourc-
problem.                                                                ing (TERC, see [4]) emphasizes the importance of qualifica-
   In the evaluation experiments presented here, we only use            tion control, e.g. by creating qualification tests that must be
human transcriptions of meetings, to focus on the evalu-                passed before performing the actual task. However, another
ation of the retrieval strategy itself. We use one meeting              study [2] showed that workers may still perform tasks ran-
(ES2008b) from the AMI Meeting Corpus [6] in which the                  domly even after passing qualification tests. Therefore, it
design of a new remote control for a TV set is discussed.               is important to perform partial validation of each worker’s
The explicit users’ requests for the UI version are simulated           tasks, and weight the judgments of several workers to pro-
by modifying the transcript at 24 different locations where             duce aggregate scores [4].
we believe that users are likely to ask explicit queries – a               Several other studies have focused on Amazon’s Mechan-
more principled approach for this simulation is currently un-           ical Turk crowdsourcing platform and have proposed tech-
der study. We restrict the search to the Wikipedia website,             niques to measure the quality of workers’ judgments when
mainly because the semantic search system is adapted to                 there is no ground truth to verify them directly [17, 19, 7,
this data, using a local copy of it (WEX) that is semanti-              10, 12]. For instance, in [5], the quality of judgments for
cally indexed. Wikipedia is one of the most popular general             a labeling task is measured using the inter-rater agreement
reference works on the Internet, and recommendations over               and majority voting. Expectation maximization (EM) has
it are clearly of high potential interest. But alternatively,           sometimes been used to estimate true labels in the absence
all our systems (except the semantic one) could also be run             of ground truth, e.g. in [17] for an image labeling task. In
with non-restricted web searches via Google, or limited to              order to improve EM-based estimation of the reliability of
other web domains or websites.                                          workers, the confidence of workers in each of their judg-
   The 24 fragments of the meeting containing the explicit              ments has been used in [7] as an additional feature – the
queries are submitted for comparison. That is, we want to               task being dominance level estimation for participants in a
know which of the results displayed by the various versions             conversation. As the performance of the EM algorithm is
at the moment following the explicit query are considered               not guaranteed, a new method [10] was introduced to esti-
                                                                        mate reliability based on low-rank matrix approximation.




                                                                   16
  All of the above-mentioned studies assume that tasks share          5.    THE PCC-H METHOD
the same level of difficulty. To model both task difficulty              Majority voting is frequently used to aggregate multiple
and user reliability, an EM-based method named GLAD was               sources of comparative relevance evaluation. However, this
proposed by [19] for an image labeling task. However, this            assumes that all HITs share the same difficulty and all the
method is sensitive to the initialization value, hence a good         workers are equally reliable. We will take here into account
estimation of labels requires a small amount of data with             the task difficulty Wq and the workers’ reliability rw , as it
ground truth annotation [12].                                         was shown that they have a significant impact on the qual-
                                                                      ity of the aggregated judgments. We thus introduce a new
4.   SETUP OF THE EXPERIMENT                                          computation method called PCC-H, for Pearson Correlation
   Amazon’s Mechanical Turk (AMT) is a crowdsourcing                  Coefficient-Information Entropy.
platform which gives access to a vast pool of online work-
ers paid by requesters to complete human intelligence tasks
                                                                      5.1   Estimating Worker Reliability
(HITs). Once designed and published, registered workers                 The PCC-H method computes the Wq and rw values in
that fulfill the requesters’ selection criteria are invited by        two steps. In a first step, PCC-H estimates the reliability
AMT service to work on HITs in exchange for a small amount            of each worker rw based on the Pearson correlation of each
of money per HIT [3].                                                 worker’s judgment with the average of all the other workers
   As it is difficult to find an absolute relevance score for         judgments (see Eq. 1).
each version of the ACLD recommender system, we only
aim for comparative relevance evaluation between versions.                           PA      PQ             ¯           ¯
                                                                                       a=1    q=1 (Xqwa − Xwa )(Yqa − Ya )
For each pair of versions, a batch of HITs was designed with                  rw =                                              (1)
                                                                                               (Q − 1)SXwa SYa
their results. Each HIT (see example in Fig. 1) contains a
fragment of conversation transcript with the two lists of doc-           In Equation 1, Q is number of meeting fragments, Xwqa
ument recommendations to be compared. Only the first six              is the value that worker w assigned to option a of fragment
recommendations are kept for each version. The lists from             q, Xwqa has value 1 if that option a is selected by worker
the two compared versions are placed in random positions              w, otherwise it is 0. X̄wa and SXwa are the expected value
(first or second) across HITs, to avoid biases from a constant        and standard deviation of variable Xwqa respectively. Yqa
position.                                                             is the average value which all other workers assign to the
   We experimented with two different HIT designs. The                option a of fragment q. Y¯a and SYa are the expected value
first one offers evaluators a binary choice: either the first list    and standard deviation of variable Yqa .
is considered more relevant than the second, or vice-versa.              The value of rw computed above is used as a weight for
In other words, workers are obliged to express a preference           computing RVqa , the relevance value of option a of each
for one of the two recommendation sets. This encourages de-           fragment q, according to Eq. 2 below:
cisions, but of course may be inappropriate when the two an-
swers are of comparable quality, though this may be evened                                         PW
                                                                                                    w=1 rw Xwqa
out when averaging over workers. The second design gives                                  RVqa =    PW                          (2)
workers four choices (as in Figure 1): in addition to the pre-                                        w=1 rw
vious two options, they can indicate either that both lists              For HIT designs with two options, RVqa shows the rel-
seem equally relevant, or equally irrelevant. In both designs,        evance value of each answer list a. However, for the four
workers must select exactly one option.                               option HIT designs, RVql for each answer list l is formu-
   To assign a value to each worker’s judgment, a binary cod-         lated as Eq. 3 below:
ing scheme will be used in the computations below, assigning
a value of 1 to the selected option and 0 to all others. The
                                                                                                      RVqb     RVqn
relevance value RV of each recommendation list for a meet-                           RVql = RVql +          −                   (3)
ing fragment is computed by giving a weight to each worker                                              2        2
judgment and averaging them. The Percentage of Relevance                 In this equation, half of the relevance value of the case
Value, noted PRV , shows the relevance value of each com-             in which both lists are relevant RVqb is added as a reward,
pared system, and is computed by assigning a weight to each           and half of the relevance value of the case in which both
part of the meeting and averaging the relevance values RV             lists are irrelevant RVqn is subtracted as a penalty from the
for all meeting fragments.                                            relevance value of each answer list RVql .
   There are 24 meeting fragments, hence 24 HITs in each
batch for comparing pairs of systems, for UI vs. AW and               5.2   Estimating Task Difficulty
UI vs. KW. As user queries are not needed for comparing                  In a second step, PCC-H considers the task difficulty for
SS vs. KW, we designed 36 HITs, with 30-second fragments              each fragment of the meeting. The goal is to reduce the ef-
for each. There are 10 workers per HIT, so there are 240              fect of some fragments of the meeting, in which there is an
total assignments for UI-vs-KW and for UI-vs-AW (with a               uncertainty in the workers judgments, e.g. because there are
2-choice and 4-choice design for each), and 360 for SS-KW.            no relevant search results in Wikipedia for the current frag-
As workers are paid 0.02 USD per HIT, the cost for the five           ment. To lessen the effect of uncertainty in our judgments,
separate experiments was 33 USD, with an apparent average             the entropy of answers for each fragment of the meeting is
hourly rate of 1.60 USD. The average time per assignment              computed and a function of it is used as a weight for each
is almost 50 seconds. All five tasks took only 17 hours to be         fragment. This weight is used for computing the percentage
performed by workers via AMT. For qualification control we            of relevance value PRV . Entropy, weight and PRV are de-
allow workers with greater than 95% approval rate or with             fined in Eqs. 4–6, where A is the number of options, and Hq
more than 1000 approved HITs.                                         and Wq are the entropy and weight of fragment q.




                                                                 17
Figure 1: Snapshot of a 4-choice HIT: workers read the conversation transcript, examine the two answer
lists (with recommended documents for the respective conversation fragment) and select one of the four
comparative choices (#1 better than #2, #2 better than #1, both equally good, both equally poor). A short
comment can be added.


                                                                       Table 1: Percentage of agreement between a single
                          A
                          X                                            worker and the expert, and a single worker and the
                 Hq = −         RVqa log(RVqa )            (4)
                                                                       other workers, for the KW system and 4-choice HITs
                          a=1
                                                                                       Worker #   ew   rw
                        W q = 1 − Hq                       (5)                              1    0.66 0.81
                                                                                            2    0.54 0.65
                            PQ
                                 q=1 Wq RVqa
                                                                                            3    0.54 0.64
                  PRV a =        PQ                        (6)                              4    0.50 0.71
                                   q=1 Wq
                                                                                            5    0.50 0.60
                                                                                            6    0.50 0.35
6.   RESULTS OF THE EXPERIMENTS                                                             7    0.41 0.24
  Two sets of experiments were performed. First, we at-                                     8    0.39 0.33
tempt to validate the PCC-H method. Then, we apply the                                      9    0.36 0.34
PCC-H method to compute PRV for each answer list to con-
                                                                                           10    0.31 0.12
clude which version of the system outperforms the others.
  In order to make an initial validation of the workers judg-
ments, we compare the judgments of individual workers with
those of an expert. For each worker, the number of frag-                  In this approach, it is assumed that all the workers are
ments for which the answer is the same as the expert’s an-             reliable and all the fragments share the same difficulty. To
swer is counted, and the total is divided by the number                handle workers’ reliability, we consider workers with lower
of fragments to compute accuracy. Then we compare this                 rw as outliers. One approach is to remove all the outliers.
value with rw , which is estimated as the reliability mea-             For instance, the four workers with lowest rw are considered
surement for each worker’s judgment. The percentage of                 outliers and are deleted, and the same weight is given to the
agreement between each worker vs. the expert ew and the                remaining six workers. The result of comparative evaluation
rw for each worker for one of the batches is shown in Table 1,         based on removing outliers is shown in Table 3.
with an overall agreement between these two values for each               In the computation above, an arbitrary border was defined
worker. In other words, workers who have more similarity               between outliers and other workers as a decision boundary
with our expert also have more inter-rater agreement with              for removing outliers. However, instead of deleting work-
other workers. Since in the general case there is no ground            ers with lower rw , which might still have potentially useful
truth (expert) to verify workers judgments, we rely on the             insights on relevance, it is rational to give a weight to all
inter-rater agreement for the other experiments.                       workers’ judgments based on a confidence value. The PRV
  Firstly, equal weights for all the user evaluations and frag-        for each answer list of four experiments based on assigning
ments are assigned to compute PRV s for two answer lists of            weight rw to each worker’s evaluation, and equal weights to
our experiments, which are shown in Table 2.                           all meeting fragments are shown in Table 4.




                                                                  18
Table 2: PRV s for AW-vs-UI and KW-vs-UI pairs                         Table 4: PRV s for AW-vs-UI and KW-vs-UI pairs
     All workers and    2-choice 4-choice                                   All workers with   2 choices 4 choices
     fragments    with  HITs     HITs                                       different weights  HIT design HIT design
     equal weights                                                          and parts with
               PRV AW       30%      26%                                    equal weights
   AW-vs-UI
                PRV U I     70%      74%                                               PRV AW     24%        18%
                                                                          AW-vs-UI
               PRV KW       45%      35%                                               PRV U I    76%        82%
   KW-vs-UI
                PRV U I     55%      65%                                              PRV KW      33%        34%
                                                                          KW-vs-UI
                                                                                       PRV U I    67%        66%

Table 3: PRV s for AW-vs-UI and KW-vs-UI pairs
     Six workers and    2-choice 4-choice                              Table 5: PRV s for AW-vs-UI and KW-vs-UI pairs
     fragments    with  HITs     HITs                                       All workers with    2-choice 4-choice
     equal weights                                                          different weights   HITs     HITs
               PRV AW       24%      13%                                    and     fragments
   AW-vs-UI                                                                 with      different
                PRV U I     76%      86%
               PRV KW       46%      33%                                    weights
   KW-vs-UI                                                                 (PCC-H method)
                PRV U I     54%      67%
                                                                                        PRV AW      19%      15%
                                                                          AW-vs-UI
                                                                                        PRV U I     81%      85%
                                                                                       PRV KW       23%      26%
   In order to show that our method is stable on different                KW-vs-UI
                                                                                        PRV U I     77%      74%
HIT designs, we used two different HIT designs for each
pair as mentioned in Section 4. We show that PRV con-
verges to the same value for each pair with different HIT
                                                                       word-based search). The PRV s are calculated by three dif-
designs. As observed in Table 4, PRV s of AW-vs-UI pair
                                                                       ferent methods as shown in Table 7. The first method is the
are not quite similar for two different HIT designs, although
                                                                       majority voting method which considers all the workers and
the answer lists are the same. In fact, we observed that, in
                                                                       fragments with the same weight. The second method assigns
several cases, there was no strong agreement among workers
                                                                       weights computed by PCC-H method to measure PRV s, the
to decide which answer list is more relevant to that meeting
                                                                       third one is the GLAD method. Therefore the SS version
fragment, and we consider that these are “difficult” frag-
                                                                       outperforms the KW version according to all three scores.
ments. Since the source of uncertainty is undefined, we can
reduce the effect of that fragment on the comparison by giv-
ing a weight to each fragment in proportion of the difficulty          7.   CONCLUSION AND PERSPECTIVES
of assigning RVql . The PRV values thus obtained for all ex-              In all the evaluation steps, the UI system appeared to pro-
periments are represented in Table 5. As shown there, the              duce more relevant recommendations than AW or KW. Us-
PRV s of AW-vs-UI pair are now very similar for 2-HIT and              ing KW instead of AW improved PRV by 10 percent. This
4-HIT tasks. Moreover, the difference between the system               means that using UI, i.e. when users ask explicit queries in
versions is emphasized, which indicates that the sensitivity           conversation, improves over AW or KW versions, i.e. with
of the comparison method has increased.                                spontaneous recommendations. Nevertheless, KW can be
   Moreover, we compare the PCC-H method with the ma-                  used as an assistant which suggests documents based on the
jority voting method and the GLAD method (Generative                   context of the meeting along with the UI version, that is,
model of Labels, Abilities, and Difficulties [19]) for estimat-        spontaneous recommendations can be made when no user
ing comparative relevance value through considering task               initiates a search. Moreover, the SS version works better
difficulty and worker reliability parameters. We run the               than the KW version, which shows the advantage of seman-
GLAD algorithm with the same initial values for all four               tic search.
experiments. The PRV s which are computed by majority                     As for the evaluation method, PCC-H outperformed the
voting, GLAD and PCC-H are shown in Table 6.                           GLAD method proposed earlier for estimating task difficulty
   As shown in Table 6, PRV s which are computed by the                and reliability of workers in the absence of ground truth.
PCC-H method for both HIT designs are very close to those              Based on the evaluation results, the PCC-H method is ac-
of GLAD for the 4-choice HIT design. Moreover, the PRV                 ceptable for qualification control of AMT workers or judg-
values obtained by the PCC-H method for the two different              ments, because it provides a more stable PRV score across
HIT designs are very similar, which is less the case for ma-           different HIT designs. Moreover, PCC-H does not require
jority voting and GLAD. This means that PCC-H method                   any initialization.
is able to calculate the PRV s independent of the exact HIT               The comparative nature of PCC-H imposes some restric-
design. Moreover, the PRV values calculated using PCC-H                tions on the evaluations that can be carried out. For in-
are more robust since the proposed method is not dependent             stance, if N versions must be compared, this calls in theory
on initialization values, as GLAD is. Therefore, using PCC-            for N ∗ (N − 1)/2 comparisons, which is clearly impracti-
H for measuring the reliability of workers judgments is also           cal when N grows. This can be solved if a priori knowl-
an appropriate method for qualification control of workers             edge about the quality of the systems is available, to avoid
from crowdsourcing platforms.                                          redundant comparisons. Moreover, an approach to reduce
   The proposed method is also applied for comparative eval-           the number of pairwise comparisons required from human
uation of SS-vs-KW search results (semantic search vs. key-            raters proposed in [14] could be ported to our context. For




                                                                  19
                                                                         Conference on Automatic Face and Gesture
Table 6: PRV s computed by the majority voting, the                      Recognition (FG), 2011.
GLAD, and the PCC-H methods
                                                                     [8] P. N. Garner, J. Dines, T. Hain, A. El Hannani,
       Methods         Majority voting, GLAD, PCC-H
                                                                         M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, and
         pairs          2-choice HITs    4-choice HITs
                                                                         L. Zhang. Real-time ASR from meetings. In
              PRV AW 30%, 23%, 19% 26%, 13%, 15%
 AW-vs-UI                                                                Proceedings of Interspeech, pages 2119–2122, 2009.
               PRV U I 70%, 77%, 81% 74%, 87%, 85%
                                                                     [9] C. Grady and M. Lease. Crowdsourcing document
              PRV KW 45%, 47%, 23% 35%, 23%, 26%
 KW-vs-UI                                                                relevance assessment with mechanical turk. In
               PRV U I 55%, 53%, 77% 65%, 77%, 74%
                                                                         Proceedings of the NAACL-HLT 2010 Workshop on
                                                                         Creating Speech and Language Data with Amazon’s
                                                                         Mechanical Turk, pages 172–179, 2010.
           Table 7: PRV s for SS-vs-KW
       Method         Majority voting, GLAD, PCC-H                  [10] D. R. Karger, S. Oh, and D. Shah. Budget-optimal
        pair           4-choice HITs                                     crowdsourcing using lowrank matrix approximations.
             PRV SS           88%, 88%, 93%                              In Proceedings of the Allerton Conference on
 SS-vs-KW                                                                Communication, Control and Computing, 2011.
             PRV KW            12%, 12%, 7%
                                                                    [11] G. Kazai. In search of quality in crowdsourcing for
                                                                         search engine evaluation. In Proceedings of the
progress evaluation, a new version must be compared with                 European Conference on Information Retrieval
the best performing previous version, looking for measur-                (ECIR), pages 165–176, 2011.
able improvement, in which case PCC-H fully answers the             [12] F. K. Khattak and A. Salleb-Aouissi. Quality control
evaluation needs.                                                        of crowd labeling through expert evaluation. In
   There are instances in which the search results of both               Proceedings of the NIPS 2nd Workshop on
versions are irrelevant. The goal of future work will be to              Computational Social Science and the Wisdom of
reduce the number of such uncertain instances, to deal with              Crowds, 2011.
ambiguous questions, and to improve the processing of user-         [13] J. Le, A. Edmonds, V. Hester, and L. Biewald.
directed queries by recognizing the context of the conver-               Ensuring quality in crowdsourced search relevance
sation. Another experiment should improve the design of                  evaluation : The effects of training question
simulated user queries, in order to make them more realis-               distribution. In Proceedings of the SIGIR 2010
tic.                                                                     Workshop on Crowdsourcing for Search Evaluation,
                                                                         pages 17–20, 2010.
8.   ACKNOWLEDGMENTS                                                [14] X. Llorà, K. Sastry, D.E. Goldberg, A. Gupta, and
                                                                         L. Lakshmi. Combating user fatigue in iGAs: Partial
  The authors are grateful to the Swiss National Science
                                                                         ordering, support vector machines, and synthetic
Foundation for its financial support under the IM2 NCCR
                                                                         fitness. In Proceedings of the Conference on Genetic
on Interactive Multimodal Information Management (see
                                                                         and Evolutionary Computation (GECCO ’05), pages
www.im2.ch).
                                                                         1363–1370, 2005.
                                                                    [15] A. Popescu-Belis, E. Boertjes, J. Kilgour, P. Poller,
9.   REFERENCES                                                          S. Castronovo, T. Wilson, A. Jaimes, and J. Carletta.
 [1] J. Allan, B. Croft, A. Moffat, and M. Sanderson.                    The AMIDA automatic content linking device:
     Frontiers, challenges and opportunities for information             Just-in-time document retrieval in meetings. In
     retrieval: Report from SWIRL 2012. SIGIR Forum,                     Proceedings of Machine Learning for Multimodal
     46(1):2–32, 2012.                                                   Interaction (MLMI), pages 272–283, 2008.
 [2] O. Alonso and R. A. Baeza-Yates. Design and                    [16] A. Popescu-Belis, M. Yazdani, A. Nanchen, and
     implementation of relevance assessments using                       P. Garner. A speech-based just-in-time retrieval
     crowdsourcing. In Proceedings of the European                       system using semantic search. In Proceedings of the
     Conference on Information Retrieval (ECIR), pages                   49th Annual Meeting of the ACL, pages 80–85, 2011.
     153–164, 2011.                                                 [17] P. Smyth, U. M. Fayyad, M. C. Burl, P. Perona, and
 [3] O. Alonso and M. Lease. Crowdsourcing 101: Putting                  P. Baldi. Inferring ground truth from subjective
     the “wisdom of the crowd” to work for you. WSDM                     labeling of venus images. In Advances in Neural
     Tutorial, 2011.                                                     Information Processing Systems (NIPS), pages
 [4] O. Alonso, D. Rose, and B. Stewart. Crowdsourcing                   1085–1092, 1994.
     for relevance evaluation. SIGIR Forum, 42:9–15, 2008.          [18] P. Thomas and D. Hawking. Evaluation by comparing
 [5] J. Carletta. Assessing agreement on classification                  result sets in context. In Proceedings of the 15th ACM
     tasks: The kappa statistic. Computational Linguistics,              International Conference on Information and
     22:249–254, 1996.                                                   Knowledge Management (CIKM), pages 94–101, 2006.
 [6] J. Carletta. Unleashing the killer corpus: experiences         [19] J. Whitehill, P. Ruvolo, T.-F. Wu, J. Bergsma, and
     in creating the multi-everything AMI Meeting Corpus.                J. Movellan. Whose vote should count more: Optimal
     Language Resources and Evaluation Journal,                          integration of labels from labelers of unknown
     41(2):181–190, 2007.                                                expertise. In Advances in Neural Information
 [7] G. Chittaranjan, O. Aran, and D. Gatica-Perez.                      Processing Systems (NIPS), pages 2035–2043. 2009.
     Exploiting observers’ judgments for nonverbal group
     interaction analysis. In Proceedings of the IEEE




                                                               20