<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Crowdsourcing to Compare Document Recommendation Strategies for Conversations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maryam Habibi</string-name>
          <email>maryam.habibi@idiap.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrei Popescu-Belis</string-name>
          <email>andrei.popescu-belis@idiap.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute and EPFL</institution>
          ,
          <addr-line>Rue Marconi 19, CP 592, 1920 Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Rue Marconi 19, CP 592, 1920 Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <fpage>15</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>This paper explores a crowdsourcing approach to the evaluation of a document recommender system intended for use in meetings. The system uses words from the conversation to perform just-in-time document retrieval. We compare several versions of the system, including the use of keywords, retrieval using semantic similarity, and the possibility for user initiative. The system's results are submitted for comparative evaluations to workers recruited via a crowdsourcing platform, Amazon's Mechanical Turk. We introduce a new method, Pearson Correlation Coe cient-Information Entropy (PCC-H), to abstract over the quality of the workers' judgments and produce system-level scores. We measure the workers' reliability by the inter-rater agreement of each of them against the others, and use entropy to weight the di culty of each comparison task. The proposed evaluation method is shown to be reliable, and the results show that adding user initiative improves the relevance of recommendations.</p>
      </abstract>
      <kwd-group>
        <kwd>Document recommender system</kwd>
        <kwd>user initiative</kwd>
        <kwd>crowdsourcing</kwd>
        <kwd>Amazon Mechanical Turk</kwd>
        <kwd>comparative evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval|Query formulation, Retrieval models ;
H.3.4 [Information Storage and Retrieval]: Systems
and Software|Performance evaluation</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>A document recommender system for conversations
provides suggestions for potentially relevant documents within
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.</p>
      <p>
        Copyright is held by the author/owner(s). Workshop on Recommendation
Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with
ACM RecSys 2012, September 9, 2012, Dublin, Ireland.
.
a conversation, such as a business meeting. Used as a
virtual secretary, the system constantly retrieves documents
that are related to the words of the conversation, using
automatic speech recognition, but users could also be allowed
to make explicit queries. Such a system builds upon
previous approaches known as implicit queries, just-in-time
retrieval, or zero query terms, which were recently con rmed
as a promising research avenue [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Evaluating the relevance of recommendations produced by
such a system is a challenging task. Evaluation in use
requires the full deployment of the system and the setup of
numerous evaluation sessions with realistic meetings. That
is why alternative solutions based on simulations are
important to nd. In this paper, we propose to run the document
recommender system over a corpus of conversations and to
use crowdsourcing to compare the relevance of results in
various con gurations of the system.</p>
      <p>A crowdsourcing platform, here Amazon's Mechanical Turk,
is helpful for several reasons. First, we can evaluate a large
amount of data in a fast and inexpensive manner. Second,
workers are sampled from the general public, which might
represent a more realistic user model than the system
developers, and have no contact with each other. However, in
order to use workers' judgments for relevance evaluation, we
have to circumvent the di culties of measuring the quality
of their evaluations, and factor out the biases of individual
contributions.</p>
      <p>We will de ne an evaluation protocol using crowdsourcing,
which estimates the quality of the workers' judgments by
predicting task di culty and workers' reliability, even if no
ground truth to validate the judgments is available. This
approach, named Pearson Correlation Coe cient-Information
Entropy (PCC-H), is inspired by previous studies of
interrater agreement as well as by information theory.</p>
      <p>This paper is organized as follows. Section 2 describes
the document recommender system and the di erent
versions which will be compared. Section 3 reviews previous
research on measuring the quality of workers' judgments for
relevance evaluation and labeling tasks using crowdsourcing.
Section 4 presents our design of the evaluation micro-tasks
{ \Human Intelligence Tasks" for the Amazon's Mechanical
Turk. In Section 5, the proposed PCC-H method for
measuring the quality of judgments is explained. Section 6 presents
the results of our evaluation experiments, which on the one
hand validate the proposed method, and on the other hand
indicate the comparative relevance of the di erent versions
of the recommender system.</p>
    </sec>
    <sec id="sec-3">
      <title>OUTLINE OF THE DOCUMENT</title>
    </sec>
    <sec id="sec-4">
      <title>RECOMMENDER SYSTEM</title>
      <p>
        The document recommender system under study is the
Automatic Content Linking Device (ACLD [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]), which
uses real-time automatic speech recognition [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to extract
words from a conversation in a group meeting. The ACLD
lters and aggregates the words to prepare queries at
regular time intervals. The queries can be addressed to a
local database of meeting-related documents, including also
transcripts of past meetings if available, but also to a web
search engine. The results are then displayed in an
unobtrusive manner to the meeting participants, which can consult
them if they nd them relevant and purposeful.
      </p>
      <p>Since it is di cult to assess the utility of recommended
documents from an absolute perspective, we aim instead at
comparing variants of the ACLD, in order to assess the
improvement (or lack thereof) due to various designs. Here, we
will compare four di erent approaches to the
recommendation problem { which is in all cases a cold-start problem, as
we don't assume knowledge about participants. Rather, in a
pure content-based manner, the ACLD simply aims to nd
the closest documents to a given stretch of conversation.</p>
      <p>
        The four compared versions are the following ones. Two
\standard" versions as in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] di er by the ltering procedure
for the conversation words. One of them (noted AW) uses
all the words (except stop words) spoken by users during a
speci c period (typically, 15 s) to retrieve related documents.
The other one (noted KW) lters the words, keeping only
keywords from a pre-de ned list related to the topic of the
meeting.
      </p>
      <p>
        Two other methods depart from the initial system. One
of them implements semantic search (noted SS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]), which
uses a graph-based semantic relatedness measure to
perform retrieval. The most recent version allows user initiative
(noted UI), that is, it can answer explicit queries addressed
by users to the system, with results replacing spontaneous
recommendations for one time period. These are processed
by the same ASR component, with participants using a
speci c name for the system (\John") to solve the addressing
problem.
      </p>
      <p>
        In the evaluation experiments presented here, we only use
human transcriptions of meetings, to focus on the
evaluation of the retrieval strategy itself. We use one meeting
(ES2008b) from the AMI Meeting Corpus [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in which the
design of a new remote control for a TV set is discussed.
The explicit users' requests for the UI version are simulated
by modifying the transcript at 24 di erent locations where
we believe that users are likely to ask explicit queries { a
more principled approach for this simulation is currently
under study. We restrict the search to the Wikipedia website,
mainly because the semantic search system is adapted to
this data, using a local copy of it (WEX) that is
semantically indexed. Wikipedia is one of the most popular general
reference works on the Internet, and recommendations over
it are clearly of high potential interest. But alternatively,
all our systems (except the semantic one) could also be run
with non-restricted web searches via Google, or limited to
other web domains or websites.
      </p>
      <p>The 24 fragments of the meeting containing the explicit
queries are submitted for comparison. That is, we want to
know which of the results displayed by the various versions
at the moment following the explicit query are considered
most relevant by external judges. As the method allows
only binary comparisons, as we will now describe, we will
compare UI with the AW and KW versions, and then SS
with KW.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>RELATED WORK</title>
      <p>
        Relevance evaluation is a di cult task because it is
subjective and expensive to be performed. Two well-known
methods for relevance evaluation are the use of a click-data
corpus, or the use of human experts [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. However, in our case,
producing click data or hiring professional workers for
relevance evaluation would both be overly expensive. Moreover,
it is not clear that evaluation results provided by a narrow
range of experts would be generalizable to a broader range
of end users. In contrast, crowdsourcing, or peer
collaborative annotation, is relatively easy to prototype and to test
experimentally, and provides a cheap and fast approach to
explicit evaluation. However, it is necessary to consider some
problems which are associated to this approach, mainly the
reliability of the workers' judgments (including spammers)
and the intrinsic knowledge of the workers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Recently, many studies have considered the e ect of the
task design on relevance evaluation, and proposed design
solutions to decrease time and cost of evaluation and to
increase the accuracy of results. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], several human factors
are considered: query design, terminology and pay, with
their impact on cost, time and accuracy of annotations.
To collect proper results, the e ect of user interface
guidelines, inter-rater agreement metrics and justi cation analysis
were examined [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], showing e.g. that asking workers to write
a short explanation in exchange of a bonus is an e cient
method for detecting spammers. In addition, in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
different batches of tasks were designed to measure the e ect
of pay, required e ort and worker quali cations on the
accuracy of resulting labels. Another paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] has studied
how the distribution of correct answers in the training data
a ects worker responses, and suggested to use a uniform
distribution to avoid biases from unethical workers.
      </p>
      <p>
        The Technique for Evaluating Relevance by
Crowdsourcing (TERC, see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) emphasizes the importance of quali
cation control, e.g. by creating quali cation tests that must be
passed before performing the actual task. However, another
study [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] showed that workers may still perform tasks
randomly even after passing quali cation tests. Therefore, it
is important to perform partial validation of each worker's
tasks, and weight the judgments of several workers to
produce aggregate scores [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Several other studies have focused on Amazon's
Mechanical Turk crowdsourcing platform and have proposed
techniques to measure the quality of workers' judgments when
there is no ground truth to verify them directly [
        <xref ref-type="bibr" rid="ref10 ref12 ref17 ref19 ref7">17, 19, 7,
10, 12</xref>
        ]. For instance, in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the quality of judgments for
a labeling task is measured using the inter-rater agreement
and majority voting. Expectation maximization (EM) has
sometimes been used to estimate true labels in the absence
of ground truth, e.g. in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for an image labeling task. In
order to improve EM-based estimation of the reliability of
workers, the con dence of workers in each of their
judgments has been used in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as an additional feature { the
task being dominance level estimation for participants in a
conversation. As the performance of the EM algorithm is
not guaranteed, a new method [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] was introduced to
estimate reliability based on low-rank matrix approximation.
5.
      </p>
    </sec>
    <sec id="sec-6">
      <title>THE PCC-H METHOD</title>
      <p>Majority voting is frequently used to aggregate multiple
sources of comparative relevance evaluation. However, this
assumes that all HITs share the same di culty and all the
workers are equally reliable. We will take here into account
the task di culty Wq and the workers' reliability rw, as it
was shown that they have a signi cant impact on the
quality of the aggregated judgments. We thus introduce a new
computation method called PCC-H, for Pearson Correlation
Coe cient-Information Entropy.
5.1</p>
    </sec>
    <sec id="sec-7">
      <title>Estimating Worker Reliability</title>
      <p>The PCC-H method computes the Wq and rw values in
two steps. In a rst step, PCC-H estimates the reliability
of each worker rw based on the Pearson correlation of each
worker's judgment with the average of all the other workers
judgments (see Eq. 1).</p>
      <p>rw =</p>
      <p>PaA=1 PqQ=1(Xqwa</p>
      <p>Xwa)(Yqa</p>
      <p>Ya)
(Q</p>
      <p>1)SXwa SYa</p>
      <p>In Equation 1, Q is number of meeting fragments, Xwqa
is the value that worker w assigned to option a of fragment
q, Xwqa has value 1 if that option a is selected by worker
w, otherwise it is 0. Xwa and SXwa are the expected value
and standard deviation of variable Xwqa respectively. Yqa
is the average value which all other workers assign to the
option a of fragment q. Ya and SYa are the expected value
and standard deviation of variable Yqa.</p>
      <p>The value of rw computed above is used as a weight for
computing RVqa, the relevance value of option a of each
fragment q, according to Eq. 2 below:</p>
      <p>
        All of the above-mentioned studies assume that tasks share
the same level of di culty. To model both task di culty
and user reliability, an EM-based method named GLAD was
proposed by [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] for an image labeling task. However, this
method is sensitive to the initialization value, hence a good
estimation of labels requires a small amount of data with
ground truth annotation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>4. SETUP OF THE EXPERIMENT</title>
      <p>
        Amazon's Mechanical Turk (AMT) is a crowdsourcing
platform which gives access to a vast pool of online
workers paid by requesters to complete human intelligence tasks
(HITs). Once designed and published, registered workers
that ful ll the requesters' selection criteria are invited by
AMT service to work on HITs in exchange for a small amount
of money per HIT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>As it is di cult to nd an absolute relevance score for
each version of the ACLD recommender system, we only
aim for comparative relevance evaluation between versions.
For each pair of versions, a batch of HITs was designed with
their results. Each HIT (see example in Fig. 1) contains a
fragment of conversation transcript with the two lists of
document recommendations to be compared. Only the rst six
recommendations are kept for each version. The lists from
the two compared versions are placed in random positions
( rst or second) across HITs, to avoid biases from a constant
position.</p>
      <p>We experimented with two di erent HIT designs. The
rst one o ers evaluators a binary choice: either the rst list
is considered more relevant than the second, or vice-versa.
In other words, workers are obliged to express a preference
for one of the two recommendation sets. This encourages
decisions, but of course may be inappropriate when the two
answers are of comparable quality, though this may be evened
out when averaging over workers. The second design gives
workers four choices (as in Figure 1): in addition to the
previous two options, they can indicate either that both lists
seem equally relevant, or equally irrelevant. In both designs,
workers must select exactly one option.</p>
      <p>To assign a value to each worker's judgment, a binary
coding scheme will be used in the computations below, assigning
a value of 1 to the selected option and 0 to all others. The
relevance value RV of each recommendation list for a
meeting fragment is computed by giving a weight to each worker
judgment and averaging them. The Percentage of Relevance
Value, noted PRV ; shows the relevance value of each
compared system, and is computed by assigning a weight to each
part of the meeting and averaging the relevance values RV
for all meeting fragments.</p>
      <p>There are 24 meeting fragments, hence 24 HITs in each
batch for comparing pairs of systems, for UI vs. AW and
UI vs. KW. As user queries are not needed for comparing
SS vs. KW, we designed 36 HITs, with 30-second fragments
for each. There are 10 workers per HIT, so there are 240
total assignments for UI-vs-KW and for UI-vs-AW (with a
2-choice and 4-choice design for each), and 360 for SS-KW.
As workers are paid 0.02 USD per HIT, the cost for the ve
separate experiments was 33 USD, with an apparent average
hourly rate of 1.60 USD. The average time per assignment
is almost 50 seconds. All ve tasks took only 17 hours to be
performed by workers via AMT. For quali cation control we
allow workers with greater than 95% approval rate or with
more than 1000 approved HITs.
(1)
(2)
(3)
RVqa =</p>
      <p>PW
w=1 rwXwqa
PW</p>
      <p>w=1 rw</p>
      <p>For HIT designs with two options, RVqa shows the
relevance value of each answer list a. However, for the four
option HIT designs, RVql for each answer list l is
formulated as Eq. 3 below:</p>
      <p>RVql = RVql +</p>
      <sec id="sec-8-1">
        <title>RVqb</title>
        <p>2</p>
      </sec>
      <sec id="sec-8-2">
        <title>RVqn</title>
        <p>2</p>
        <p>In this equation, half of the relevance value of the case
in which both lists are relevant RVqb is added as a reward,
and half of the relevance value of the case in which both
lists are irrelevant RVqn is subtracted as a penalty from the
relevance value of each answer list RVql.
5.2</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Estimating Task Difficulty</title>
      <p>In a second step, PCC-H considers the task di culty for
each fragment of the meeting. The goal is to reduce the
effect of some fragments of the meeting, in which there is an
uncertainty in the workers judgments, e.g. because there are
no relevant search results in Wikipedia for the current
fragment. To lessen the e ect of uncertainty in our judgments,
the entropy of answers for each fragment of the meeting is
computed and a function of it is used as a weight for each
fragment. This weight is used for computing the percentage
of relevance value PRV . Entropy, weight and PRV are
dened in Eqs. 4{6, where A is the number of options, and Hq
and Wq are the entropy and weight of fragment q.</p>
    </sec>
    <sec id="sec-10">
      <title>RESULTS OF THE EXPERIMENTS</title>
      <p>Two sets of experiments were performed. First, we
attempt to validate the PCC-H method. Then, we apply the
PCC-H method to compute PRV for each answer list to
conclude which version of the system outperforms the others.</p>
      <p>In order to make an initial validation of the workers
judgments, we compare the judgments of individual workers with
those of an expert. For each worker, the number of
fragments for which the answer is the same as the expert's
answer is counted, and the total is divided by the number
of fragments to compute accuracy. Then we compare this
value with rw, which is estimated as the reliability
measurement for each worker's judgment. The percentage of
agreement between each worker vs. the expert ew and the
rw for each worker for one of the batches is shown in Table 1,
with an overall agreement between these two values for each
worker. In other words, workers who have more similarity
with our expert also have more inter-rater agreement with
other workers. Since in the general case there is no ground
truth (expert) to verify workers judgments, we rely on the
inter-rater agreement for the other experiments.</p>
      <p>Firstly, equal weights for all the user evaluations and
fragments are assigned to compute PRV s for two answer lists of
our experiments, which are shown in Table 2.</p>
      <p>In this approach, it is assumed that all the workers are
reliable and all the fragments share the same di culty. To
handle workers' reliability, we consider workers with lower
rw as outliers. One approach is to remove all the outliers.
For instance, the four workers with lowest rw are considered
outliers and are deleted, and the same weight is given to the
remaining six workers. The result of comparative evaluation
based on removing outliers is shown in Table 3.</p>
      <p>In the computation above, an arbitrary border was de ned
between outliers and other workers as a decision boundary
for removing outliers. However, instead of deleting
workers with lower rw, which might still have potentially useful
insights on relevance, it is rational to give a weight to all
workers' judgments based on a con dence value. The PRV
for each answer list of four experiments based on assigning
weight rw to each worker's evaluation, and equal weights to
all meeting fragments are shown in Table 4.</p>
      <p>In order to show that our method is stable on di erent
HIT designs, we used two di erent HIT designs for each
pair as mentioned in Section 4. We show that PRV
converges to the same value for each pair with di erent HIT
designs. As observed in Table 4, PRV s of AW-vs-UI pair
are not quite similar for two di erent HIT designs, although
the answer lists are the same. In fact, we observed that, in
several cases, there was no strong agreement among workers
to decide which answer list is more relevant to that meeting
fragment, and we consider that these are \di cult"
fragments. Since the source of uncertainty is unde ned, we can
reduce the e ect of that fragment on the comparison by
giving a weight to each fragment in proportion of the di culty
of assigning RVql. The PRV values thus obtained for all
experiments are represented in Table 5. As shown there, the
PRV s of AW-vs-UI pair are now very similar for 2-HIT and
4-HIT tasks. Moreover, the di erence between the system
versions is emphasized, which indicates that the sensitivity
of the comparison method has increased.</p>
      <p>
        Moreover, we compare the PCC-H method with the
majority voting method and the GLAD method (Generative
model of Labels, Abilities, and Di culties [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]) for
estimating comparative relevance value through considering task
di culty and worker reliability parameters. We run the
GLAD algorithm with the same initial values for all four
experiments. The PRV s which are computed by majority
voting, GLAD and PCC-H are shown in Table 6.
      </p>
      <p>As shown in Table 6, PRV s which are computed by the
PCC-H method for both HIT designs are very close to those
of GLAD for the 4-choice HIT design. Moreover, the PRV
values obtained by the PCC-H method for the two di erent
HIT designs are very similar, which is less the case for
majority voting and GLAD. This means that PCC-H method
is able to calculate the PRV s independent of the exact HIT
design. Moreover, the PRV values calculated using PCC-H
are more robust since the proposed method is not dependent
on initialization values, as GLAD is. Therefore, using
PCCH for measuring the reliability of workers judgments is also
an appropriate method for quali cation control of workers
from crowdsourcing platforms.</p>
      <p>The proposed method is also applied for comparative
evaluation of SS-vs-KW search results (semantic search vs.
keyword-based search). The PRV s are calculated by three
different methods as shown in Table 7. The rst method is the
majority voting method which considers all the workers and
fragments with the same weight. The second method assigns
weights computed by PCC-H method to measure PRV s, the
third one is the GLAD method. Therefore the SS version
outperforms the KW version according to all three scores.
7.</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION AND PERSPECTIVES</title>
      <p>In all the evaluation steps, the UI system appeared to
produce more relevant recommendations than AW or KW.
Using KW instead of AW improved PRV by 10 percent. This
means that using UI, i.e. when users ask explicit queries in
conversation, improves over AW or KW versions, i.e. with
spontaneous recommendations. Nevertheless, KW can be
used as an assistant which suggests documents based on the
context of the meeting along with the UI version, that is,
spontaneous recommendations can be made when no user
initiates a search. Moreover, the SS version works better
than the KW version, which shows the advantage of
semantic search.</p>
      <p>As for the evaluation method, PCC-H outperformed the
GLAD method proposed earlier for estimating task di culty
and reliability of workers in the absence of ground truth.
Based on the evaluation results, the PCC-H method is
acceptable for quali cation control of AMT workers or
judgments, because it provides a more stable PRV score across
di erent HIT designs. Moreover, PCC-H does not require
any initialization.</p>
      <p>
        The comparative nature of PCC-H imposes some
restrictions on the evaluations that can be carried out. For
instance, if N versions must be compared, this calls in theory
for N (N 1)=2 comparisons, which is clearly
impractical when N grows. This can be solved if a priori
knowledge about the quality of the systems is available, to avoid
redundant comparisons. Moreover, an approach to reduce
the number of pairwise comparisons required from human
raters proposed in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] could be ported to our context. For
SS-vs-KW
progress evaluation, a new version must be compared with
the best performing previous version, looking for
measurable improvement, in which case PCC-H fully answers the
evaluation needs.
      </p>
      <p>There are instances in which the search results of both
versions are irrelevant. The goal of future work will be to
reduce the number of such uncertain instances, to deal with
ambiguous questions, and to improve the processing of
userdirected queries by recognizing the context of the
conversation. Another experiment should improve the design of
simulated user queries, in order to make them more
realistic.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors are grateful to the Swiss National Science
Foundation for its nancial support under the IM2 NCCR
on Interactive Multimodal Information Management (see
www.im2.ch).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mo at, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>Frontiers, challenges and opportunities for information retrieval: Report from SWIRL 2012</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>46</volume>
          (
          <issue>1</issue>
          ):2{
          <fpage>32</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Alonso</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          .
          <article-title>Design and implementation of relevance assessments using crowdsourcing</article-title>
          .
          <source>In Proceedings of the European Conference on Information Retrieval (ECIR)</source>
          , pages
          <fpage>153</fpage>
          {
          <fpage>164</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Alonso</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lease</surname>
          </string-name>
          . Crowdsourcing 101:
          <article-title>Putting the \wisdom of the crowd" to work for you</article-title>
          .
          <source>WSDM Tutorial</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Alonso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rose</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Stewart</surname>
          </string-name>
          .
          <article-title>Crowdsourcing for relevance evaluation</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>42</volume>
          :9{
          <fpage>15</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carletta</surname>
          </string-name>
          .
          <article-title>Assessing agreement on classi cation tasks: The kappa statistic</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>22</volume>
          :
          <fpage>249</fpage>
          {
          <fpage>254</fpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carletta</surname>
          </string-name>
          .
          <article-title>Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus</article-title>
          .
          <source>Language Resources and Evaluation Journal</source>
          ,
          <volume>41</volume>
          (
          <issue>2</issue>
          ):
          <volume>181</volume>
          {
          <fpage>190</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chittaranjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Aran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gatica-Perez</surname>
          </string-name>
          .
          <article-title>Exploiting observers' judgments for nonverbal group interaction analysis</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition (FG)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Garner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El Hannani</surname>
          </string-name>
          , M. Kara at, D. Korchagin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lincoln</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Wan</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Zhang.</surname>
          </string-name>
          <article-title>Real-time ASR from meetings</article-title>
          .
          <source>In Proceedings of Interspeech</source>
          , pages
          <volume>2119</volume>
          {
          <fpage>2122</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Grady</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lease</surname>
          </string-name>
          .
          <article-title>Crowdsourcing document relevance assessment with mechanical turk</article-title>
          .
          <source>In Proceedings of the NAACL-HLT 2010 Workshop on Creating Speech</source>
          and
          <article-title>Language Data with Amazon's Mechanical Turk</article-title>
          , pages
          <volume>172</volume>
          {
          <fpage>179</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Budget-optimal crowdsourcing using lowrank matrix approximations</article-title>
          .
          <source>In Proceedings of the Allerton Conference on Communication, Control and Computing</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>G. Kazai.</surname>
          </string-name>
          <article-title>In search of quality in crowdsourcing for search engine evaluation</article-title>
          .
          <source>In Proceedings of the European Conference on Information Retrieval (ECIR)</source>
          , pages
          <fpage>165</fpage>
          {
          <fpage>176</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F. K.</given-names>
            <surname>Khattak</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Salleb-Aouissi</surname>
          </string-name>
          .
          <article-title>Quality control of crowd labeling through expert evaluation</article-title>
          .
          <source>In Proceedings of the NIPS 2nd Workshop on Computational Social Science and the Wisdom of Crowds</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Edmonds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hester</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Biewald</surname>
          </string-name>
          .
          <article-title>Ensuring quality in crowdsourced search relevance evaluation : The e ects of training question distribution</article-title>
          .
          <source>In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation</source>
          , pages
          <volume>17</volume>
          {
          <fpage>20</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Llora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.E.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Lakshmi</surname>
          </string-name>
          .
          <article-title>Combating user fatigue in iGAs: Partial ordering, support vector machines, and synthetic tness</article-title>
          .
          <source>In Proceedings of the Conference on Genetic and Evolutionary Computation (GECCO '05)</source>
          , pages
          <fpage>1363</fpage>
          {
          <fpage>1370</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu-Belis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Boertjes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kilgour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Poller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Castronovo</surname>
          </string-name>
          , T. Wilson,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Carletta</surname>
          </string-name>
          .
          <article-title>The AMIDA automatic content linking device: Just-in-time document retrieval in meetings</article-title>
          .
          <source>In Proceedings of Machine Learning for Multimodal Interaction (MLMI)</source>
          , pages
          <fpage>272</fpage>
          {
          <fpage>283</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu-Belis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazdani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nanchen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Garner</surname>
          </string-name>
          .
          <article-title>A speech-based just-in-time retrieval system using semantic search</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the ACL</source>
          , pages
          <volume>80</volume>
          {
          <fpage>85</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. M.</given-names>
            <surname>Fayyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Burl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldi</surname>
          </string-name>
          .
          <article-title>Inferring ground truth from subjective labeling of venus images</article-title>
          .
          <source>In Advances in Neural Information Processing Systems (NIPS)</source>
          , pages
          <fpage>1085</fpage>
          {
          <fpage>1092</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thomas</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          .
          <article-title>Evaluation by comparing result sets in context</article-title>
          .
          <source>In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM)</source>
          , pages
          <fpage>94</fpage>
          {
          <fpage>101</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Whitehill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ruvolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-F. Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Bergsma</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Movellan</surname>
          </string-name>
          .
          <article-title>Whose vote should count more: Optimal integration of labels from labelers of unknown expertise</article-title>
          .
          <source>In Advances in Neural Information Processing Systems (NIPS)</source>
          , pages
          <year>2035</year>
          {
          <year>2043</year>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>