A human-ML collaboration framework for improving video content reviews Meghana Deodhar, Xiao Ma, Yixin Cai, Alex Koes, Alex Beutel and Jilin Chen Google, USA Abstract We deal with the problem of localized in-video taxonomic human annotation in the video content moderation domain, where the goal is to identify video segments that violate granular policies, e.g., community guidelines on an online video platform. High quality human labeling is critical for enforcement in content moderation. This is challenging due to the problem of information overload - raters need to apply a large taxonomy of granular policy violations with ambiguous definitions, within a limited review duration to relatively long videos. Our key contribution is a novel human-machine learning (ML) collaboration framework aimed at maximizing the quality and efficiency of human decisions in this setting - human labels are used to train segment-level models, the predictions of which are displayed as “hints” to human raters, indicating probable regions of the video with specific policy violations. The human verified/corrected segment labels can help refine the model further, hence creating a human-ML positive feedback loop. Experiments show improved human video moderation decision quality, and efficiency through more granular annotations submitted within a similar review duration, which enable a 5-8% AUC improvement in the hint generation models. Keywords human computation, machine learning, video content moderation, ranking 1. Introduction categories such as Profanity, Violence, Nudity, etc., each of which contains several granular violations. For in- The importance of content moderation on online video stance, Violence could include a range of granular classes platforms such as TikTok, YouTube or Instagram is grow- such as animal abuse or graphic violence in video games. ing [1, 2]. These platforms strive to accurately detect The class definitions are complex, ambiguous and often the presence of policy violations within the video, which require nuanced judgment to apply, e.g., graphic violence. drive enforcement actions, e.g., the video can be taken New policy classes may be added over time as well, e.g., down. Given the complexity of this problem, content Covid anti-vaccination. Moreover, there is a class im- moderation relies heavily on human judgement and em- balance issue - some egregious violations may be very ploys large teams of content moderators to perform re- rare. views. Since human annotations directly lead to high Our goal is to maximize the quality and efficiency of stakes decisions, such as content take downs, the quality the complex, granular, localized policy annotations task, of the annotations is critical. hence leading to the correct video level enforcement de- For content moderation decisions there is a growing cision. We achieve this by tackling the key issue of "in- need for transparency in detected policy violations to pro- formation overload" faced by raters in providing high vide feedback to content creators [3]. This motivates a quality annotations, where 1) the sheer volume of videos in-video taxonomic annotation task, where the goal is to on large online video platforms means raters only have provide localized and fine-grained policy-specific annota- limited review time per video; and 2) the large taxonomy tions, i.e., both the time regions (video segments) and the of policies makes it hard for raters to recall the complete exact policies violated, which inform downstream video set of all granular violations for every video region they content moderation decisions. To cover the spectrum of watch in the limited review duration. potential violations, policy playbooks typically contain We propose a human-ML collaboration framework to hundreds of fine-grained policies. This large space of pol- maximize human ratings quality and efficiency by ad- icy violations can be organized as a taxonomy of broad dressing "information overload". We train models on granular rater annotations to predict policy violations, CIKM’22: Human-in-the-Loop Data Curation Workshop at the 31st which are then combined with innovative front-end ele- ACM International Conference on Information and Knowledge Man- ments in the rating tool to provide "hints" to assist raters. agement, October 17–22, 2022, Atlanta, GA We borrow from information retrieval literature and use email: mdeodhar@google.com (M. Deodhar); xmaa@google.com (X. Ma); yixincai@google.com (Y. Cai); koes@google.com ranking mechanisms for identifying the most useful and (A. Koes); alexbeutel@google.com (A. Beutel); jilinc@google.com succinct set of hints. In experiments, we show that this (J. Chen) enables raters to efficiently label policy violations more © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). correctly and comprehensively. The human interactions CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) with model hints pave the way for leveraging human feedback to improve the underlying ML models. 2. Related Work The crowd sourcing literature is very rich in the applica- tion of human annotations to perform a variety of tasks such as text processing [4, 5, 6], audio transcription [7], taxonomy creation [8], and social media analysis [9, 10]. Although there is existing literature on video annotation, Figure 1: Human-ML collaboration set up. it is primarily focused on identifying actions or labeling entities easily distinguishable by humans using visual information only, e.g., high jump, thunderstorms. The primary goal of these tasks is to create large datasets for ing. The novelty of our approach is that the models facilitating Machine Learning/Perception applications are re-trained continuously on the output of the human [11, 12], e.g., the YouTube-8m dataset [13]. This is very annotation task, which they provide assistance for, con- different from our set up, where raters annotate gran- structing a positive feedback loop between humans and ular, ambiguously defined policies using multi-modal models. In this collaborative framework, we have the signals - video, audio and text, from the transcript, and opportunity to improve both modeling and human rater the video title and description. The recent emergence of performance. crowd sourcing literature on content moderation primar- Information overload, which we encounter in our con- ily covers textual content such as user comments [14, 15], tent moderation setting, is a well studied problem that however there has been little focus on video moderation reduces the effectiveness of a human’s decision making tasks. ability [27, 28]. To address this, we build on the intuition Human-ML collaboration is an emerging area of re- that humans find it easier to verify or correct sugges- search with two main categories of work: tions rather than produce new annotations from scratch ML-Assisted Human Labeling: ML-assistance [29, 5]. Our ML-assistance proposal strives to select the through predictions and explanations has been used to most informative but succinct ML-based "hints" to sur- improve the quality of human decisions in several do- face to raters by drawing on the information retrieval and mains [16, 17, 18], including content moderation [15, 14]. ranking literature. ML-based ranking has been shown Crossmod [15], for instance, uses a model trained on his- to reduce information overload effectively in electronic toric cross-community moderation decisions to enable messaging [30] and social media [31, 32]. We draw in- Reddit human moderators to find more violations. Inter- spiration from the learning to rank idea [33] to reduce active ML-assistance, which we leverage in Section 3.1.2, information overload for raters. is used by Bartolo et al. [19] to assist human annotators to develop adversarial examples for improving a natural language question answering model. ML-assistance has 3. Proposed Human-ML been shown to also improve the efficiency of the human Collaboration Framework labeling task [20, 15, 21]. Our work aims to exploit both the human labeling quality and efficiency benefits. The main contribution of this paper is the human-ML Improving ML-models Through Human Annota- collaboration framework visualized in Figure 1. We use tions: Human annotations are useful in constructing the predictions of ML models to provide assistance to hybrid human-ML systems that leverage the complemen- human raters and evaluate the effectiveness of different tary strengths of both to improve the performance of ML user interfaces for the ML-assistance. Since the policy models [22, 23]. Existing work on active learning [24, 25] violation prediction task is hard for ML models, the feed- shows that strategically sampling data points can reduce back from human raters is useful to improve the models. human workload, but the purpose is to improve machine ML-assistance enables raters to provide segment level an- learning models instead of assisting raters. Recent work notations more efficiently leading to more ground truth on explainable active learning (XAL) [26] has called for to train/update the ML models. Additionally, we can en- better designing for the human experience in the human- able raters to interact with the ML hints (accept/reject), AI interface. providing direct feedback to refine the model, establish- Our work shows that it is possible to achieve model ing a positive human-ML feedback loop. improvements and assist human raters, bridging the gap between ML-assisted human labeling and active learn- 3.1. ML-Assisted Human Reviews As discussed in the introduction, raters face an informa- tion overload problem due to a combination of granular, complex policy definitions and limited time per video. Often, raters need to use their judgement to decide what parts of the video to watch to identify violating segments, resulting in fleeting violations being missed, or inconsis- Figure 2: V1 Continuous Line Graph for Specific Policy Hints tencies across raters watching different sections of the same video. Line Graph Hints It is intuitive that raters would benefit from pointers to likely unsafe regions within a video, labeled with the ex- act policies being violated. Even if not completely precise, this will enable them to optimize their review bandwidth by focusing on potentially more relevant regions, making them less likely to miss violations. We achieve this by training per policy ML models and transforming their predictions into "hints" provided to raters, described in more detail in Section 3.1.2. We tune the models to be high recall to minimize uncaught violations, while rely- ing on human judgement to improve the precision of the Figure 3: V2 Pre-Populated Segments from ML Models labeled violations. Pre-Populated Segments 3.1.1. Segment Level Model Training 3.1.2. Techniques to Provide ML-Assistance To train segment level policy violation models we frame the following modeling problem - given multi-modal fea- We proceed to develop ways to use model predictions tures for a fixed length video segment, predict whether to most effectively assist human reviews, and provide the segment contains specific policy violations. We gen- details on two different designs (V1 and V2). erate training datasets by extracting per-frame visual and V1 Hints: Continuous Line Graphs. For video an- audio features from the human labeled violation region. notations, it is standard to display the video itself with The visual and audio features are the dense embedding playback controls and additional information in the form layers of standard convolutional network image seman- of a timeline [37]. For V1, we display the ML predictions tic similarity [34] and audio classification models [35] as a line graph across the entire timeline of the video. respectively. Based on empirical evidence, we select flat The user interface is demonstrated in Figure 2. Raters concatenation to aggregate the frame features over a seg- can examine the line graphs and jump to the point in the ment, versus average/max pooling. The final model we video where a peak (policy violation) occurs. train is a multi-label DNN model with the aggregated While we have model predictions for hundreds of gran- frame-level visual and audio features as input, where ular policies, due to visual clutter, we only display plots each label corresponds to a fine grained policy violation. for a small subset of the most frequent policies. Raters We use MultiModal Versatile Networks [36] during model also don’t have the ability to provide feedback to improve training to learn a better representation for audio and model predictions. visual features for our classification task, which further V2 Hints: Towards a Scalable and Interactive-ML improves model performance. We use a sliding window Assistance UI. In V2, we borrow elements from rec- approach to utilize the trained model to generate predic- ommender systems [38, 39] to develop a more scalable tion scores per policy violation for a fixed length segment and interactive interface (see Figure 3), where we pre- starting at each frame of the video. Using a window of populate video segments that may contain policy viola- frame size n and stride of 1 frame, we produce model tions detected by machine learning models. scores for segments with start and end frames [0 to n-1], To generate video segments, as shown in Figure 3, [1 to n], and so on until the end of the video, padding we introduce an algorithm to binarize continuous model with empty features to fit the segment length for the last scores per policy into discrete segments and use a rank- n frames. ing algorithm to recommend the most useful segments to raters. For simplicity, we used a threshold-based al- gorithm. The threshold selection constitutes a tradeoff between the precision and recall, where precision cap- tures the utility of the predicted segments to raters and Treatment Precision Recall # Videos V1 vs. Baseline +9.82% +1.37% 3456 recall captures the comprehensive coverage of all video V1 + V2 vs. V1 +9.97% +5.64% 2914 violations. The algorithm chooses a threshold maximiz- ing recall, while maintaining a minimum precision (40% Table 1 based on user studies). The regions of the video where Relative impact on live traffic (Dataset 1) the ML scores are above this chosen threshold are dis- played as predicted policy violating segments to raters. Several heuristics are then applied to maximize segment productivity targets as generalists do. They can hence quality, e.g., we merge segments that are close to avoid spend more time reviewing each video comprehensively, visual clutter (<3% of the whole video length apart). leading to higher decision quality. In our experiment Finally, to reduce information overload, we borrow setup, each video in an evaluation dataset is indepen- from the learning to rank concept [33] to rank candidate dently reviewed by 1 expert and 2 generalist raters. We segments and limit the number of displayed segments. use the labels from expert raters on the dataset as the The ranking algorithm prioritizes segments based on the ground truth to evaluate the 2 sets of generalist rater max score across segment frames, and egregiousness of decisions. the predicted policies. We then select the top 𝑁 segments to display to raters, with 𝑁 selected through user studies. We pre-populate each ML suggested segment in the video 4.1. Experimental Setup timeline UI as seen in Figure 3. Raters can choose to 4.1.1. Datasets accept or reject the suggested segments. These logged interactions can be used to provide feedback to improve Our two evaluation datasets contain videos viewed on the ML models. a large online video platform: (1) "Live traffic" dataset: Sampled from live traffic, hence containing a very low proportion of policy violating videos; (2) "Affected slice" 3.2. Human Feedback to ML-Models dataset, sampled from live traffic and filtered to only The ambiguous and fluid policy definitions along with a videos with ML-hints present, containing 13-20% policy changing distribution of videos on online platforms poses violating videos. a challenge for building robust models to accurately pre- dict policy violations for providing ML-assistance. We 4.1.2. Metrics hence continuously need more data and human feedback Our human ratings quality metric is calculated at the to improve the models. We show that ML-assistance in- video-level and conveys the correctness of the final, bi- creases the total number of segment labels submitted nary content moderation decision, e.g., take down or not. vs. no assistance. The labels in turn can serve as new We compute the precision (P), recall (R), and disagree- ground truth for continuously re-training the models and ment rate for each of the 2 sets of generalist’s video-level improving performance. Additionally, with the reject but- decisions. We consider the expert decision as ground ton shown in V2 we collect clean negative labels; earlier, truth and report the averaged values across the 2 sets. On we had only "weak" negatives from videos where no vio- live traffic datasets, we use P/R over standard inter-rater lations were annotated 1 . Further exploiting the human disagreement metrics due to the high class imbalance. feedback in combination with active learning strategies For rater efficiency, we measure: (1) percentage of is an area of future work. policy violating videos where raters provide segment an- notations. (2) number of segment annotations submitted 4. Experiments by raters per video (3) average review duration per video. Our proposed methodology is evaluated using raters from 4.2. Results the author’s organization that regularly perform video content moderation reviews on live queues. Raters are 4.2.1. Rater Quality Improvements separated into two pools: experts and generalists, with We conduct experiments on both "live traffic" and "af- 150 and 400 raters respectively. The expert pool is a group fected slice" datasets, with the baseline as a review pro- of more experienced quality assurance (QA) raters with cess without ML-hints. Tables 1 and 2 compare the demonstrably better decision quality over a long period ratings quality metrics of our proposed V1 (line plot of of time. Since their focus is QA, they don’t have fixed model scores) ML-assistance treatment relative to the baseline, and evaluate the incremental benefit of the V2 1 Even if no violations were annotated, they could still be present in (pre-populated segments) treatment over V1, in the V1 + video segments the rater did not watch, hence the negative labels inferred are weak/noisy. V2 vs. V1 row. Treatment Precision Recall Disagreement% # Videos V1 vs. Baseline +4.24% +3.64% -15.71% 3319 V1 + V2 vs. V1 +7.02% +14.30% -32.27% 682 Table 2 Relative impact on affected slice (Dataset 2) From the live traffic results in Table 1, we see that V1 Policy Area AUCPR # Positive Labels Sexually Suggestive +5.9% +12.7% shows an improvement in precision and recall over the Nudity +4.9% +12.7% baseline, driven by the improvement on the affected slice Illegal Acts +8.6% +12.1% as seen in Table 2. We also see large rater quality gains of V2 over V1 on Table 3 both live traffic and affected slice datasets. The segmenta- Model Quality Improvements tion and ranking algorithms in V2 allows us to overcome the scalability limitation of V1 and expand the number of granular policies covered by model hints from 7 to raters accepting model hints without verification. Our 18. Specifically for violence related violations, we see a video-level ratings quality evaluation metrics are robust 35% relative recall gain over V1 by expanding policies to this since the ground truth comes from expert (QA) with ML-hints from 2 to 9. The V2 design can be scaled raters who review videos comprehensively, looking be- to cover hundreds of policies in future versions by dy- yond ML-hints. In practice, we observe little evidence namically surfacing the most relevant violating segments, of both (i) and (ii). 56% of the violation segments sub- further improving recall. mitted by raters in the V2 setup are organically created, i.e., don’t overlap with pre-populated hint segments. The 4.2.2. Rater Efficiency Improvements segment acceptance rate is 35%, aligned with our segmen- tation model precision tuning point of 40%, indicating We observe reduced review duration on policy violating that raters are verifying and rejecting false positive hints videos with V1 hints vs. without, with more efficiency at the expected rate. We could mitigate the risk of (ii) benefits on longer videos as expected; -14% on videos further by enforcing that at least some percentage of longer than 10 minutes and -20% on videos longer than hint segments/video is actually watched or by surfacing 30 minutes. With V2, relative to V1, we see a 3% increase the model’s confidence in the predicted hint to raters. in review duration, but it is traded off by a 9% increase To ensure robust evaluation of model quality, the AUC in the percentage of policy violating videos with exact improvements in Section 4.2.4 are evaluated on a set of segments annotated, and a 24% increase in the number labels collected without model generated segments. of segment annotations submitted per video. 4.2.3. Interactive ML-Assistance Metrics 6. Future work Isolating the precision of the ML-assisted segments, we For content moderation to scale to the size of online plat- see raters accepting 35% of ML generated hints, which is forms, it is necessary to take model-based enforcement in line with the 40% precision constraint we chose when action. We would like to explore the relation between converting model scores into discrete segments. improved ground truth and improvement of automated, model based enforcement. Leveraging active learning 4.2.4. Model Quality Improvements strategies in combination with utilizing rater feedback on model generated segments to show further quality Since the introduction of V1 hints, we see significant improvements in the models is another open area of re- model performance improvements with more human search. Finally, we will explore multi-armed bandits to labels collected within a 3 month period on specific policy balance active learning based exploration for model im- areas. provement with model exploitation for providing high quality ML-assistance [41]. 5. Discussion This paper used content moderation as the test bed for our human-ML collaboration proposal. However, it is a One of the potential risks of our proposed human-ML more generalized framework that applies to the problem collaboration framework is automation bias [40], where of granular, localized video annotation encountered in a rater’s over-reliance on ML-assistance can result in (i) various other industry applications such as identifying blind-spots due to humans missing violations and (ii) products/brands in videos to inform the placement of relevant ads, which we would like to explore further. ation, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2013, pp. 1999–2008. References [9] A. Zubiaga, M. Liakata, R. Procter, K. Bontcheva, P. Tolmie, Crowdsourcing the annotation of ru- [1] K. Gogarty, Hate speech and misinformation mourous conversations in social media, in: Pro- proliferate on meta products, with 13,500 policy ceedings of the 24th international conference on violations documented in the past year alone, World Wide Web, 2015, pp. 347–353. https://www.mediamatters.org/facebook/hate- [10] A. M. Founta, C. Djouvas, D. Chatzakou, I. Leon- speech-and-misinformation-proliferate-meta- tiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Siri- products-13500-policy-violations, 2022. vianos, N. Kourtellis, Large scale crowdsourcing [2] S. Mellor, Tiktok slammed for videos shar- and characterization of twitter abusive behavior, ing false information about russia’s war on in: Twelfth International AAAI Conference on Web ukraine, https://fortune.com/2022/03/21/tiktok- and Social Media, 2018. misinformation-ukraine/, 2022. [11] H. Zhao, A. Torralba, L. Torresani, Z. Yan, Hacs: [3] S. Jhaver, A. Bruckman, E. Gilbert, Does trans- Human action clips and segments dataset for parency in moderation really matter? user behavior recognition and temporal localization, 2017. URL: after content removal explanations on reddit, Proc. https://arxiv.org/abs/1712.09374. doi:10.48550/ ACM Hum.-Comput. Interact. 3 (2019). URL: https: ARXIV.1712.09374. //doi.org/10.1145/3359252. doi:10.1145/3359252. [12] M. Z. Trujillo, M. Gruppi, C. Buntain, B. D. Horne, [4] A. Kittur, E. H. Chi, B. Suh, Crowdsourc- The mela bitchute dataset, in: Proceedings of the ing user studies with mechanical turk, in: International AAAI Conference on Web and Social Proceedings of the SIGCHI Conference on Hu- Media, volume 16, 2022, pp. 1342–1351. man Factors in Computing Systems, CHI ’08, [13] S. Abu-El-Haija, N. Kothari, J. Lee, A. P. Natsev, Association for Computing Machinery, New G. Toderici, B. Varadarajan, S. Vijayanarasimhan, York, NY, USA, 2008, p. 453–456. URL: https: Youtube-8m: A large-scale video classification //doi.org/10.1145/1357054.1357127. doi:10.1145/ benchmark, in: arXiv:1609.08675, 2016. URL: https: 1357054.1357127. //arxiv.org/pdf/1609.08675v1.pdf. [5] M. S. Bernstein, G. Little, R. C. Miller, B. Hart- [14] V. Lai, S. Carton, R. Bhatnagar, Q. V. Liao, Y. Zhang, mann, M. S. Ackerman, D. R. Karger, D. Crow- C. Tan, Human-ai collaboration via conditional ell, K. Panovich, Soylent: A word pro- delegation: A case study of content modera- cessor with a crowd inside, in: Proceed- tion, in: Proceedings of the 2022 CHI Con- ings of the 23nd Annual ACM Symposium on ference on Human Factors in Computing Sys- User Interface Software and Technology, UIST tems, CHI ’22, Association for Computing Ma- ’10, Association for Computing Machinery, New chinery, New York, NY, USA, 2022. URL: https: York, NY, USA, 2010, p. 313–322. URL: https: //doi.org/10.1145/3491102.3501999. doi:10.1145/ //doi.org/10.1145/1866029.1866078. doi:10.1145/ 3491102.3501999. 1866029.1866078. [15] E. Chandrasekharan, C. Gandhi, M. W. Mustelier, [6] C. Hu, B. B. Bederson, P. Resnik, Y. Kron- E. Gilbert, Crossmod: A cross-community learning- rod, Monotrans2: A new human compu- based system to assist reddit moderators, Proc. tation system to support monolingual transla- ACM Hum.-Comput. Interact. 3 (2019). URL: https: tion, in: Proceedings of the SIGCHI Conference //doi.org/10.1145/3359276. doi:10.1145/3359276. on Human Factors in Computing Systems, CHI [16] E. Beede, E. Baylor, F. Hersch, A. Iurchenko, ’11, Association for Computing Machinery, New L. Wilcox, P. Ruamviboonsuk, L. M. Vardoulakis, A York, NY, USA, 2011, p. 1133–1136. URL: https: Human-Centered Evaluation of a Deep Learning //doi.org/10.1145/1978942.1979111. doi:10.1145/ System Deployed in Clinics for the Detection of 1978942.1979111. Diabetic Retinopathy, Association for Computing [7] W. Lasecki, C. Miller, A. Sadilek, A. Abumoussa, Machinery, New York, NY, USA, 2020, p. 1–12. URL: D. Borrello, R. Kushalnagar, J. Bigham, Real-Time https://doi.org/10.1145/3313831.3376718. Captioning by Groups of Non-Experts, Associa- [17] V. Lai, C. Tan, On human predictions with tion for Computing Machinery, New York, NY, explanations and predictions of machine learn- USA, 2012, p. 23–34. URL: https://doi.org/10.1145/ ing models: A case study on deception de- 2380116.2380122. tection, in: Proceedings of the Conference [8] L. B. Chilton, G. Little, D. Edge, D. S. Weld, J. A. on Fairness, Accountability, and Transparency, Landay, Cascade: Crowdsourcing taxonomy cre- FAT* ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 29–38. URL: https: formation overload on consumers’ subjective state //doi.org/10.1145/3287560.3287590. doi:10.1145/ towards buying decision in the internet shopping 3287560.3287590. environment, Electronic Commerce Research and [18] J. Park, R. Krishna, P. Khadpe, L. Fei-Fei, Applications 8 (2009) 48–58. M. Bernstein, Ai-based request augmentation [29] H. Hu, L. Xie, Z. Du, R. Hong, Q. Tian, One-bit to increase crowdsourcing participation, Pro- supervision for image classification (2020). URL: ceedings of the AAAI Conference on Human https://arxiv.org/abs/2009.06168. doi:10.48550/ Computation and Crowdsourcing 7 (2019) 115– ARXIV.2009.06168. 124. URL: https://ojs.aaai.org/index.php/HCOMP/ [30] R. M. Losee Jr, Minimizing information overload: article/view/5282. the ranking of electronic messages, Journal of In- [19] M. Bartolo, T. Thrush, S. Riedel, P. Stenetorp, R. Jia, formation Science 15 (1989) 179–189. D. Kiela, Models in the loop: Aiding crowdwork- [31] K. Koroleva, A. J. Bolufé Röhler, Reducing infor- ers with generative annotation assistants, CoRR mation overload: Design and evaluation of filtering abs/2112.09062 (2021). URL: https://arxiv.org/abs/ & ranking algorithms for social networking sites 2112.09062. arXiv:2112.09062. (2012). [20] Z. Ashktorab, M. Desmond, J. Andres, M. Muller, [32] J. Chen, R. Nairn, L. Nelson, M. Bernstein, E. Chi, N. N. Joshi, M. Brachman, A. Sharma, K. Brimi- Short and tweet: experiments on recommending join, Q. Pan, C. T. Wolf, E. Duesterwald, C. Dugan, content from information streams, in: Proceed- W. Geyer, D. Reimer, Ai-assisted human labeling: ings of the SIGCHI conference on human factors in Batching for efficiency without overreliance, Proc. computing systems, 2010, pp. 1185–1194. ACM Hum.-Comput. Interact. 5 (2021). URL: https: [33] A. Karatzoglou, L. Baltrunas, Y. Shi, Learning to //doi.org/10.1145/3449163. doi:10.1145/3449163. rank for recommender systems, in: Proceedings of [21] S. Anjum, A. Verma, B. Dang, D. Gurari, Exploring the 7th ACM Conference on Recommender Systems, the use of deep learning with crowdsourcing to 2013, pp. 493–494. annotate images, Human Computation 8 (2021) [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo- 76–106. doi:10.15346/hc.v8i2.121. jna, Rethinking the inception architecture for [22] I. Arous, J. Yang, M. Khayati, P. Cudré-Mauroux, computer vision, 2015. URL: https://arxiv.org/abs/ OpenCrowd: A Human-AI Collaborative Approach 1512.00567. doi:10.48550/ARXIV.1512.00567. for Finding Social Influencers via Open-Ended An- [35] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. swers Aggregation, Association for Computing Ma- Gemmeke, A. Jansen, R. C. Moore, M. Plakal, chinery, New York, NY, USA, 2020, p. 1851–1862. D. Platt, R. A. Saurous, B. Seybold, M. Slaney, URL: https://doi.org/10.1145/3366423.3380254. R. J. Weiss, K. W. Wilson, CNN architec- [23] J. W. Vaughan, Making better use of the crowd: tures for large-scale audio classification, CoRR How crowdsourcing can advance machine learning abs/1609.09430 (2016). URL: http://arxiv.org/abs/ research, Journal of Machine Learning Research 1609.09430. arXiv:1609.09430. 18 (2018) 1–46. URL: http://jmlr.org/papers/v18/17- [36] J.-B. Alayrac, A. Recasens, R. Schneider, 234.html. R. Arandjelović, J. Ramapuram, J. De Fauw, [24] B. Settles, Active Learning Literature Survey, Com- L. Smaira, S. Dieleman, A. Zisserman, Self- puter Sciences Technical Report 1648, University of supervised multimodal versatile networks, Wisconsin–Madison, 2009. 2020. URL: https://arxiv.org/abs/2006.16228. [25] Y. Yang, Z. Ma, F. Nie, X. Chang, A. G. Hauptmann, doi:10.48550/ARXIV.2006.16228. Multi-class active learning by uncertainty sampling [37] C. Vondrick, D. Ramanan, D. Patterson, Efficiently with diversity maximization, International Journal scaling up video annotation with crowdsourced of Computer Vision 113 (2015) 113–127. marketplaces, in: European Conference on Com- [26] B. Ghai, Q. V. Liao, Y. Zhang, R. Bellamy, K. Mueller, puter Vision, Springer, 2010, pp. 610–623. Explainable active learning (xal): An empirical [38] I. Portugal, P. Alencar, D. Cowan, The use of ma- study of how local explanations impact annota- chine learning algorithms in recommender systems: tor experience, 2020. URL: https://arxiv.org/abs/ A systematic review, Expert Systems with Applica- 2001.09219. doi:10.48550/ARXIV.2001.09219. tions 97 (2018) 205–227. [27] M. J. Eppler, J. Mengis, The concept of information [39] K. Swearingen, R. Sinha, Interaction design for overload-a review of literature from organization recommender systems, in: Designing Interactive science, accounting, marketing, mis, and related Systems, volume 6, Citeseer, 2002, pp. 312–334. disciplines (2004), Kommunikationsmanagement [40] L. J. Skitka, K. L. Mosier, M. Burdick, Does automa- im Wandel (2008) 271–305. tion bias decision-making?, International Journal [28] Y.-C. Chen, R.-A. Shang, C.-Y. Kao, The effects of in- of Human-Computer Studies 51 (1999) 991–1006. [41] J. McInerney, B. Lacker, S. Hansen, K. Higley, H. Bouchard, A. Gruson, R. Mehrotra, Explore, exploit, and explain: personalizing explainable rec- ommendations with bandits, in: Proceedings of the 12th ACM conference on recommender systems, 2018, pp. 31–39.