=Paper=
{{Paper
|id=None
|storemode=property
|title=Using Crowdsourcing to Compare Document Recommendation Strategies for Conversations
|pdfUrl=https://ceur-ws.org/Vol-910/paper3.pdf
|volume=Vol-910
|dblpUrl=https://dblp.org/rec/conf/recsys/HabibiP12
}}
==Using Crowdsourcing to Compare Document Recommendation Strategies for Conversations==
Using Crowdsourcing to Compare Document Recommendation Strategies for Conversations Maryam Habibi Andrei Popescu-Belis Idiap Research Institute and EPFL Idiap Research Institute Rue Marconi 19, CP 592 Rue Marconi 19, CP 592 1920 Martigny, Switzerland 1920 Martigny, Switzerland maryam.habibi@idiap.ch andrei.popescu-belis@idiap.ch ABSTRACT a conversation, such as a business meeting. Used as a vir- This paper explores a crowdsourcing approach to the evalua- tual secretary, the system constantly retrieves documents tion of a document recommender system intended for use in that are related to the words of the conversation, using au- meetings. The system uses words from the conversation to tomatic speech recognition, but users could also be allowed perform just-in-time document retrieval. We compare sev- to make explicit queries. Such a system builds upon pre- eral versions of the system, including the use of keywords, vious approaches known as implicit queries, just-in-time re- retrieval using semantic similarity, and the possibility for trieval, or zero query terms, which were recently confirmed user initiative. The system’s results are submitted for com- as a promising research avenue [1]. parative evaluations to workers recruited via a crowdsour- Evaluating the relevance of recommendations produced by cing platform, Amazon’s Mechanical Turk. We introduce such a system is a challenging task. Evaluation in use re- a new method, Pearson Correlation Coefficient-Information quires the full deployment of the system and the setup of Entropy (PCC-H), to abstract over the quality of the work- numerous evaluation sessions with realistic meetings. That ers’ judgments and produce system-level scores. We measure is why alternative solutions based on simulations are impor- the workers’ reliability by the inter-rater agreement of each tant to find. In this paper, we propose to run the document of them against the others, and use entropy to weight the recommender system over a corpus of conversations and to difficulty of each comparison task. The proposed evaluation use crowdsourcing to compare the relevance of results in var- method is shown to be reliable, and the results show that ious configurations of the system. adding user initiative improves the relevance of recommen- A crowdsourcing platform, here Amazon’s Mechanical Turk, dations. is helpful for several reasons. First, we can evaluate a large amount of data in a fast and inexpensive manner. Second, workers are sampled from the general public, which might Categories and Subject Descriptors represent a more realistic user model than the system de- H.3.3 [Information Storage and Retrieval]: Information velopers, and have no contact with each other. However, in Search and Retrieval—Query formulation, Retrieval models; order to use workers’ judgments for relevance evaluation, we H.3.4 [Information Storage and Retrieval]: Systems have to circumvent the difficulties of measuring the quality and Software—Performance evaluation of their evaluations, and factor out the biases of individual contributions. We will define an evaluation protocol using crowdsourcing, General Terms which estimates the quality of the workers’ judgments by Evaluation, Uncertainty, Reliability, Metric predicting task difficulty and workers’ reliability, even if no ground truth to validate the judgments is available. This ap- Keywords proach, named Pearson Correlation Coefficient-Information Entropy (PCC-H), is inspired by previous studies of inter- Document recommender system, user initiative, crowdsourc- rater agreement as well as by information theory. ing, Amazon Mechanical Turk, comparative evaluation This paper is organized as follows. Section 2 describes the document recommender system and the different ver- 1. INTRODUCTION sions which will be compared. Section 3 reviews previous A document recommender system for conversations pro- research on measuring the quality of workers’ judgments for vides suggestions for potentially relevant documents within relevance evaluation and labeling tasks using crowdsourcing. Section 4 presents our design of the evaluation micro-tasks Permission to make digital or hard copies of all or part of this work for – “Human Intelligence Tasks” for the Amazon’s Mechanical personal or classroom use is granted without fee provided that copies are Turk. In Section 5, the proposed PCC-H method for measur- not made or distributed for profit or commercial advantage and that copies ing the quality of judgments is explained. Section 6 presents bear this notice and the full citation on the first page. To copy otherwise, to the results of our evaluation experiments, which on the one republish, to post on servers or to redistribute to lists, requires prior specific hand validate the proposed method, and on the other hand permission and/or a fee. indicate the comparative relevance of the different versions Copyright is held by the author/owner(s). Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with of the recommender system. ACM RecSys 2012, September 9, 2012, Dublin, Ireland. . 15 2. OUTLINE OF THE DOCUMENT most relevant by external judges. As the method allows RECOMMENDER SYSTEM only binary comparisons, as we will now describe, we will compare UI with the AW and KW versions, and then SS The document recommender system under study is the with KW. Automatic Content Linking Device (ACLD [15, 16]), which uses real-time automatic speech recognition [8] to extract words from a conversation in a group meeting. The ACLD 3. RELATED WORK filters and aggregates the words to prepare queries at regu- Relevance evaluation is a difficult task because it is subjec- lar time intervals. The queries can be addressed to a lo- tive and expensive to be performed. Two well-known meth- cal database of meeting-related documents, including also ods for relevance evaluation are the use of a click-data cor- transcripts of past meetings if available, but also to a web pus, or the use of human experts [18]. However, in our case, search engine. The results are then displayed in an unobtru- producing click data or hiring professional workers for rele- sive manner to the meeting participants, which can consult vance evaluation would both be overly expensive. Moreover, them if they find them relevant and purposeful. it is not clear that evaluation results provided by a narrow Since it is difficult to assess the utility of recommended range of experts would be generalizable to a broader range documents from an absolute perspective, we aim instead at of end users. In contrast, crowdsourcing, or peer collabora- comparing variants of the ACLD, in order to assess the im- tive annotation, is relatively easy to prototype and to test provement (or lack thereof) due to various designs. Here, we experimentally, and provides a cheap and fast approach to will compare four different approaches to the recommenda- explicit evaluation. However, it is necessary to consider some tion problem – which is in all cases a cold-start problem, as problems which are associated to this approach, mainly the we don’t assume knowledge about participants. Rather, in a reliability of the workers’ judgments (including spammers) pure content-based manner, the ACLD simply aims to find and the intrinsic knowledge of the workers [3]. the closest documents to a given stretch of conversation. Recently, many studies have considered the effect of the The four compared versions are the following ones. Two task design on relevance evaluation, and proposed design “standard” versions as in [15] differ by the filtering procedure solutions to decrease time and cost of evaluation and to in- for the conversation words. One of them (noted AW) uses crease the accuracy of results. In [9], several human factors all the words (except stop words) spoken by users during a are considered: query design, terminology and pay, with specific period (typically, 15 s) to retrieve related documents. their impact on cost, time and accuracy of annotations. The other one (noted KW) filters the words, keeping only To collect proper results, the effect of user interface guide- keywords from a pre-defined list related to the topic of the lines, inter-rater agreement metrics and justification analysis meeting. were examined [2], showing e.g. that asking workers to write Two other methods depart from the initial system. One a short explanation in exchange of a bonus is an efficient of them implements semantic search (noted SS [16]), which method for detecting spammers. In addition, in [11], dif- uses a graph-based semantic relatedness measure to per- ferent batches of tasks were designed to measure the effect form retrieval. The most recent version allows user initiative of pay, required effort and worker qualifications on the ac- (noted UI), that is, it can answer explicit queries addressed curacy of resulting labels. Another paper [13] has studied by users to the system, with results replacing spontaneous how the distribution of correct answers in the training data recommendations for one time period. These are processed affects worker responses, and suggested to use a uniform by the same ASR component, with participants using a spe- distribution to avoid biases from unethical workers. cific name for the system (“John”) to solve the addressing The Technique for Evaluating Relevance by Crowdsourc- problem. ing (TERC, see [4]) emphasizes the importance of qualifica- In the evaluation experiments presented here, we only use tion control, e.g. by creating qualification tests that must be human transcriptions of meetings, to focus on the evalu- passed before performing the actual task. However, another ation of the retrieval strategy itself. We use one meeting study [2] showed that workers may still perform tasks ran- (ES2008b) from the AMI Meeting Corpus [6] in which the domly even after passing qualification tests. Therefore, it design of a new remote control for a TV set is discussed. is important to perform partial validation of each worker’s The explicit users’ requests for the UI version are simulated tasks, and weight the judgments of several workers to pro- by modifying the transcript at 24 different locations where duce aggregate scores [4]. we believe that users are likely to ask explicit queries – a Several other studies have focused on Amazon’s Mechan- more principled approach for this simulation is currently un- ical Turk crowdsourcing platform and have proposed tech- der study. We restrict the search to the Wikipedia website, niques to measure the quality of workers’ judgments when mainly because the semantic search system is adapted to there is no ground truth to verify them directly [17, 19, 7, this data, using a local copy of it (WEX) that is semanti- 10, 12]. For instance, in [5], the quality of judgments for cally indexed. Wikipedia is one of the most popular general a labeling task is measured using the inter-rater agreement reference works on the Internet, and recommendations over and majority voting. Expectation maximization (EM) has it are clearly of high potential interest. But alternatively, sometimes been used to estimate true labels in the absence all our systems (except the semantic one) could also be run of ground truth, e.g. in [17] for an image labeling task. In with non-restricted web searches via Google, or limited to order to improve EM-based estimation of the reliability of other web domains or websites. workers, the confidence of workers in each of their judg- The 24 fragments of the meeting containing the explicit ments has been used in [7] as an additional feature – the queries are submitted for comparison. That is, we want to task being dominance level estimation for participants in a know which of the results displayed by the various versions conversation. As the performance of the EM algorithm is at the moment following the explicit query are considered not guaranteed, a new method [10] was introduced to esti- mate reliability based on low-rank matrix approximation. 16 All of the above-mentioned studies assume that tasks share 5. THE PCC-H METHOD the same level of difficulty. To model both task difficulty Majority voting is frequently used to aggregate multiple and user reliability, an EM-based method named GLAD was sources of comparative relevance evaluation. However, this proposed by [19] for an image labeling task. However, this assumes that all HITs share the same difficulty and all the method is sensitive to the initialization value, hence a good workers are equally reliable. We will take here into account estimation of labels requires a small amount of data with the task difficulty Wq and the workers’ reliability rw , as it ground truth annotation [12]. was shown that they have a significant impact on the qual- ity of the aggregated judgments. We thus introduce a new 4. SETUP OF THE EXPERIMENT computation method called PCC-H, for Pearson Correlation Amazon’s Mechanical Turk (AMT) is a crowdsourcing Coefficient-Information Entropy. platform which gives access to a vast pool of online work- ers paid by requesters to complete human intelligence tasks 5.1 Estimating Worker Reliability (HITs). Once designed and published, registered workers The PCC-H method computes the Wq and rw values in that fulfill the requesters’ selection criteria are invited by two steps. In a first step, PCC-H estimates the reliability AMT service to work on HITs in exchange for a small amount of each worker rw based on the Pearson correlation of each of money per HIT [3]. worker’s judgment with the average of all the other workers As it is difficult to find an absolute relevance score for judgments (see Eq. 1). each version of the ACLD recommender system, we only aim for comparative relevance evaluation between versions. PA PQ ¯ ¯ a=1 q=1 (Xqwa − Xwa )(Yqa − Ya ) For each pair of versions, a batch of HITs was designed with rw = (1) (Q − 1)SXwa SYa their results. Each HIT (see example in Fig. 1) contains a fragment of conversation transcript with the two lists of doc- In Equation 1, Q is number of meeting fragments, Xwqa ument recommendations to be compared. Only the first six is the value that worker w assigned to option a of fragment recommendations are kept for each version. The lists from q, Xwqa has value 1 if that option a is selected by worker the two compared versions are placed in random positions w, otherwise it is 0. X̄wa and SXwa are the expected value (first or second) across HITs, to avoid biases from a constant and standard deviation of variable Xwqa respectively. Yqa position. is the average value which all other workers assign to the We experimented with two different HIT designs. The option a of fragment q. Y¯a and SYa are the expected value first one offers evaluators a binary choice: either the first list and standard deviation of variable Yqa . is considered more relevant than the second, or vice-versa. The value of rw computed above is used as a weight for In other words, workers are obliged to express a preference computing RVqa , the relevance value of option a of each for one of the two recommendation sets. This encourages de- fragment q, according to Eq. 2 below: cisions, but of course may be inappropriate when the two an- swers are of comparable quality, though this may be evened PW w=1 rw Xwqa out when averaging over workers. The second design gives RVqa = PW (2) workers four choices (as in Figure 1): in addition to the pre- w=1 rw vious two options, they can indicate either that both lists For HIT designs with two options, RVqa shows the rel- seem equally relevant, or equally irrelevant. In both designs, evance value of each answer list a. However, for the four workers must select exactly one option. option HIT designs, RVql for each answer list l is formu- To assign a value to each worker’s judgment, a binary cod- lated as Eq. 3 below: ing scheme will be used in the computations below, assigning a value of 1 to the selected option and 0 to all others. The RVqb RVqn relevance value RV of each recommendation list for a meet- RVql = RVql + − (3) ing fragment is computed by giving a weight to each worker 2 2 judgment and averaging them. The Percentage of Relevance In this equation, half of the relevance value of the case Value, noted PRV , shows the relevance value of each com- in which both lists are relevant RVqb is added as a reward, pared system, and is computed by assigning a weight to each and half of the relevance value of the case in which both part of the meeting and averaging the relevance values RV lists are irrelevant RVqn is subtracted as a penalty from the for all meeting fragments. relevance value of each answer list RVql . There are 24 meeting fragments, hence 24 HITs in each batch for comparing pairs of systems, for UI vs. AW and 5.2 Estimating Task Difficulty UI vs. KW. As user queries are not needed for comparing In a second step, PCC-H considers the task difficulty for SS vs. KW, we designed 36 HITs, with 30-second fragments each fragment of the meeting. The goal is to reduce the ef- for each. There are 10 workers per HIT, so there are 240 fect of some fragments of the meeting, in which there is an total assignments for UI-vs-KW and for UI-vs-AW (with a uncertainty in the workers judgments, e.g. because there are 2-choice and 4-choice design for each), and 360 for SS-KW. no relevant search results in Wikipedia for the current frag- As workers are paid 0.02 USD per HIT, the cost for the five ment. To lessen the effect of uncertainty in our judgments, separate experiments was 33 USD, with an apparent average the entropy of answers for each fragment of the meeting is hourly rate of 1.60 USD. The average time per assignment computed and a function of it is used as a weight for each is almost 50 seconds. All five tasks took only 17 hours to be fragment. This weight is used for computing the percentage performed by workers via AMT. For qualification control we of relevance value PRV . Entropy, weight and PRV are de- allow workers with greater than 95% approval rate or with fined in Eqs. 4–6, where A is the number of options, and Hq more than 1000 approved HITs. and Wq are the entropy and weight of fragment q. 17 Figure 1: Snapshot of a 4-choice HIT: workers read the conversation transcript, examine the two answer lists (with recommended documents for the respective conversation fragment) and select one of the four comparative choices (#1 better than #2, #2 better than #1, both equally good, both equally poor). A short comment can be added. Table 1: Percentage of agreement between a single A X worker and the expert, and a single worker and the Hq = − RVqa log(RVqa ) (4) other workers, for the KW system and 4-choice HITs a=1 Worker # ew rw W q = 1 − Hq (5) 1 0.66 0.81 2 0.54 0.65 PQ q=1 Wq RVqa 3 0.54 0.64 PRV a = PQ (6) 4 0.50 0.71 q=1 Wq 5 0.50 0.60 6 0.50 0.35 6. RESULTS OF THE EXPERIMENTS 7 0.41 0.24 Two sets of experiments were performed. First, we at- 8 0.39 0.33 tempt to validate the PCC-H method. Then, we apply the 9 0.36 0.34 PCC-H method to compute PRV for each answer list to con- 10 0.31 0.12 clude which version of the system outperforms the others. In order to make an initial validation of the workers judg- ments, we compare the judgments of individual workers with those of an expert. For each worker, the number of frag- In this approach, it is assumed that all the workers are ments for which the answer is the same as the expert’s an- reliable and all the fragments share the same difficulty. To swer is counted, and the total is divided by the number handle workers’ reliability, we consider workers with lower of fragments to compute accuracy. Then we compare this rw as outliers. One approach is to remove all the outliers. value with rw , which is estimated as the reliability mea- For instance, the four workers with lowest rw are considered surement for each worker’s judgment. The percentage of outliers and are deleted, and the same weight is given to the agreement between each worker vs. the expert ew and the remaining six workers. The result of comparative evaluation rw for each worker for one of the batches is shown in Table 1, based on removing outliers is shown in Table 3. with an overall agreement between these two values for each In the computation above, an arbitrary border was defined worker. In other words, workers who have more similarity between outliers and other workers as a decision boundary with our expert also have more inter-rater agreement with for removing outliers. However, instead of deleting work- other workers. Since in the general case there is no ground ers with lower rw , which might still have potentially useful truth (expert) to verify workers judgments, we rely on the insights on relevance, it is rational to give a weight to all inter-rater agreement for the other experiments. workers’ judgments based on a confidence value. The PRV Firstly, equal weights for all the user evaluations and frag- for each answer list of four experiments based on assigning ments are assigned to compute PRV s for two answer lists of weight rw to each worker’s evaluation, and equal weights to our experiments, which are shown in Table 2. all meeting fragments are shown in Table 4. 18 Table 2: PRV s for AW-vs-UI and KW-vs-UI pairs Table 4: PRV s for AW-vs-UI and KW-vs-UI pairs All workers and 2-choice 4-choice All workers with 2 choices 4 choices fragments with HITs HITs different weights HIT design HIT design equal weights and parts with PRV AW 30% 26% equal weights AW-vs-UI PRV U I 70% 74% PRV AW 24% 18% AW-vs-UI PRV KW 45% 35% PRV U I 76% 82% KW-vs-UI PRV U I 55% 65% PRV KW 33% 34% KW-vs-UI PRV U I 67% 66% Table 3: PRV s for AW-vs-UI and KW-vs-UI pairs Six workers and 2-choice 4-choice Table 5: PRV s for AW-vs-UI and KW-vs-UI pairs fragments with HITs HITs All workers with 2-choice 4-choice equal weights different weights HITs HITs PRV AW 24% 13% and fragments AW-vs-UI with different PRV U I 76% 86% PRV KW 46% 33% weights KW-vs-UI (PCC-H method) PRV U I 54% 67% PRV AW 19% 15% AW-vs-UI PRV U I 81% 85% PRV KW 23% 26% In order to show that our method is stable on different KW-vs-UI PRV U I 77% 74% HIT designs, we used two different HIT designs for each pair as mentioned in Section 4. We show that PRV con- verges to the same value for each pair with different HIT word-based search). The PRV s are calculated by three dif- designs. As observed in Table 4, PRV s of AW-vs-UI pair ferent methods as shown in Table 7. The first method is the are not quite similar for two different HIT designs, although majority voting method which considers all the workers and the answer lists are the same. In fact, we observed that, in fragments with the same weight. The second method assigns several cases, there was no strong agreement among workers weights computed by PCC-H method to measure PRV s, the to decide which answer list is more relevant to that meeting third one is the GLAD method. Therefore the SS version fragment, and we consider that these are “difficult” frag- outperforms the KW version according to all three scores. ments. Since the source of uncertainty is undefined, we can reduce the effect of that fragment on the comparison by giv- ing a weight to each fragment in proportion of the difficulty 7. CONCLUSION AND PERSPECTIVES of assigning RVql . The PRV values thus obtained for all ex- In all the evaluation steps, the UI system appeared to pro- periments are represented in Table 5. As shown there, the duce more relevant recommendations than AW or KW. Us- PRV s of AW-vs-UI pair are now very similar for 2-HIT and ing KW instead of AW improved PRV by 10 percent. This 4-HIT tasks. Moreover, the difference between the system means that using UI, i.e. when users ask explicit queries in versions is emphasized, which indicates that the sensitivity conversation, improves over AW or KW versions, i.e. with of the comparison method has increased. spontaneous recommendations. Nevertheless, KW can be Moreover, we compare the PCC-H method with the ma- used as an assistant which suggests documents based on the jority voting method and the GLAD method (Generative context of the meeting along with the UI version, that is, model of Labels, Abilities, and Difficulties [19]) for estimat- spontaneous recommendations can be made when no user ing comparative relevance value through considering task initiates a search. Moreover, the SS version works better difficulty and worker reliability parameters. We run the than the KW version, which shows the advantage of seman- GLAD algorithm with the same initial values for all four tic search. experiments. The PRV s which are computed by majority As for the evaluation method, PCC-H outperformed the voting, GLAD and PCC-H are shown in Table 6. GLAD method proposed earlier for estimating task difficulty As shown in Table 6, PRV s which are computed by the and reliability of workers in the absence of ground truth. PCC-H method for both HIT designs are very close to those Based on the evaluation results, the PCC-H method is ac- of GLAD for the 4-choice HIT design. Moreover, the PRV ceptable for qualification control of AMT workers or judg- values obtained by the PCC-H method for the two different ments, because it provides a more stable PRV score across HIT designs are very similar, which is less the case for ma- different HIT designs. Moreover, PCC-H does not require jority voting and GLAD. This means that PCC-H method any initialization. is able to calculate the PRV s independent of the exact HIT The comparative nature of PCC-H imposes some restric- design. Moreover, the PRV values calculated using PCC-H tions on the evaluations that can be carried out. For in- are more robust since the proposed method is not dependent stance, if N versions must be compared, this calls in theory on initialization values, as GLAD is. Therefore, using PCC- for N ∗ (N − 1)/2 comparisons, which is clearly impracti- H for measuring the reliability of workers judgments is also cal when N grows. This can be solved if a priori knowl- an appropriate method for qualification control of workers edge about the quality of the systems is available, to avoid from crowdsourcing platforms. redundant comparisons. Moreover, an approach to reduce The proposed method is also applied for comparative eval- the number of pairwise comparisons required from human uation of SS-vs-KW search results (semantic search vs. key- raters proposed in [14] could be ported to our context. For 19 Conference on Automatic Face and Gesture Table 6: PRV s computed by the majority voting, the Recognition (FG), 2011. GLAD, and the PCC-H methods [8] P. N. Garner, J. Dines, T. Hain, A. El Hannani, Methods Majority voting, GLAD, PCC-H M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, and pairs 2-choice HITs 4-choice HITs L. Zhang. Real-time ASR from meetings. In PRV AW 30%, 23%, 19% 26%, 13%, 15% AW-vs-UI Proceedings of Interspeech, pages 2119–2122, 2009. PRV U I 70%, 77%, 81% 74%, 87%, 85% [9] C. Grady and M. Lease. Crowdsourcing document PRV KW 45%, 47%, 23% 35%, 23%, 26% KW-vs-UI relevance assessment with mechanical turk. In PRV U I 55%, 53%, 77% 65%, 77%, 74% Proceedings of the NAACL-HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 172–179, 2010. Table 7: PRV s for SS-vs-KW Method Majority voting, GLAD, PCC-H [10] D. R. Karger, S. Oh, and D. Shah. Budget-optimal pair 4-choice HITs crowdsourcing using lowrank matrix approximations. PRV SS 88%, 88%, 93% In Proceedings of the Allerton Conference on SS-vs-KW Communication, Control and Computing, 2011. PRV KW 12%, 12%, 7% [11] G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Proceedings of the progress evaluation, a new version must be compared with European Conference on Information Retrieval the best performing previous version, looking for measur- (ECIR), pages 165–176, 2011. able improvement, in which case PCC-H fully answers the [12] F. K. Khattak and A. Salleb-Aouissi. Quality control evaluation needs. of crowd labeling through expert evaluation. In There are instances in which the search results of both Proceedings of the NIPS 2nd Workshop on versions are irrelevant. The goal of future work will be to Computational Social Science and the Wisdom of reduce the number of such uncertain instances, to deal with Crowds, 2011. ambiguous questions, and to improve the processing of user- [13] J. Le, A. Edmonds, V. Hester, and L. Biewald. directed queries by recognizing the context of the conver- Ensuring quality in crowdsourced search relevance sation. Another experiment should improve the design of evaluation : The effects of training question simulated user queries, in order to make them more realis- distribution. In Proceedings of the SIGIR 2010 tic. Workshop on Crowdsourcing for Search Evaluation, pages 17–20, 2010. 8. ACKNOWLEDGMENTS [14] X. Llorà, K. Sastry, D.E. Goldberg, A. Gupta, and L. Lakshmi. Combating user fatigue in iGAs: Partial The authors are grateful to the Swiss National Science ordering, support vector machines, and synthetic Foundation for its financial support under the IM2 NCCR fitness. In Proceedings of the Conference on Genetic on Interactive Multimodal Information Management (see and Evolutionary Computation (GECCO ’05), pages www.im2.ch). 1363–1370, 2005. [15] A. Popescu-Belis, E. Boertjes, J. Kilgour, P. Poller, 9. REFERENCES S. Castronovo, T. Wilson, A. Jaimes, and J. Carletta. [1] J. Allan, B. Croft, A. Moffat, and M. Sanderson. The AMIDA automatic content linking device: Frontiers, challenges and opportunities for information Just-in-time document retrieval in meetings. In retrieval: Report from SWIRL 2012. SIGIR Forum, Proceedings of Machine Learning for Multimodal 46(1):2–32, 2012. Interaction (MLMI), pages 272–283, 2008. [2] O. Alonso and R. A. Baeza-Yates. Design and [16] A. Popescu-Belis, M. Yazdani, A. Nanchen, and implementation of relevance assessments using P. Garner. A speech-based just-in-time retrieval crowdsourcing. In Proceedings of the European system using semantic search. In Proceedings of the Conference on Information Retrieval (ECIR), pages 49th Annual Meeting of the ACL, pages 80–85, 2011. 153–164, 2011. [17] P. Smyth, U. M. Fayyad, M. C. Burl, P. Perona, and [3] O. Alonso and M. Lease. Crowdsourcing 101: Putting P. Baldi. Inferring ground truth from subjective the “wisdom of the crowd” to work for you. WSDM labeling of venus images. In Advances in Neural Tutorial, 2011. Information Processing Systems (NIPS), pages [4] O. Alonso, D. Rose, and B. Stewart. Crowdsourcing 1085–1092, 1994. for relevance evaluation. SIGIR Forum, 42:9–15, 2008. [18] P. Thomas and D. Hawking. Evaluation by comparing [5] J. Carletta. Assessing agreement on classification result sets in context. In Proceedings of the 15th ACM tasks: The kappa statistic. Computational Linguistics, International Conference on Information and 22:249–254, 1996. Knowledge Management (CIKM), pages 94–101, 2006. [6] J. Carletta. Unleashing the killer corpus: experiences [19] J. Whitehill, P. Ruvolo, T.-F. Wu, J. Bergsma, and in creating the multi-everything AMI Meeting Corpus. J. Movellan. Whose vote should count more: Optimal Language Resources and Evaluation Journal, integration of labels from labelers of unknown 41(2):181–190, 2007. expertise. In Advances in Neural Information [7] G. Chittaranjan, O. Aran, and D. Gatica-Perez. Processing Systems (NIPS), pages 2035–2043. 2009. Exploiting observers’ judgments for nonverbal group interaction analysis. In Proceedings of the IEEE 20