A human-ML collaboration framework for improving
video content reviews
Meghana Deodhar, Xiao Ma, Yixin Cai, Alex Koes, Alex Beutel and Jilin Chen
Google, USA


                                       Abstract
                                       We deal with the problem of localized in-video taxonomic human annotation in the video content moderation domain, where
                                       the goal is to identify video segments that violate granular policies, e.g., community guidelines on an online video platform.
                                       High quality human labeling is critical for enforcement in content moderation. This is challenging due to the problem
                                       of information overload - raters need to apply a large taxonomy of granular policy violations with ambiguous definitions,
                                       within a limited review duration to relatively long videos. Our key contribution is a novel human-machine learning (ML)
                                       collaboration framework aimed at maximizing the quality and efficiency of human decisions in this setting - human labels
                                       are used to train segment-level models, the predictions of which are displayed as “hints” to human raters, indicating probable
                                       regions of the video with specific policy violations. The human verified/corrected segment labels can help refine the model
                                       further, hence creating a human-ML positive feedback loop. Experiments show improved human video moderation decision
                                       quality, and efficiency through more granular annotations submitted within a similar review duration, which enable a 5-8%
                                       AUC improvement in the hint generation models.

                                       Keywords
                                       human computation, machine learning, video content moderation, ranking


1. Introduction                                                                                        categories such as Profanity, Violence, Nudity, etc., each
                                                                                                       of which contains several granular violations. For in-
The importance of content moderation on online video stance, Violence could include a range of granular classes
platforms such as TikTok, YouTube or Instagram is grow- such as animal abuse or graphic violence in video games.
ing [1, 2]. These platforms strive to accurately detect The class definitions are complex, ambiguous and often
the presence of policy violations within the video, which require nuanced judgment to apply, e.g., graphic violence.
drive enforcement actions, e.g., the video can be taken New policy classes may be added over time as well, e.g.,
down. Given the complexity of this problem, content Covid anti-vaccination. Moreover, there is a class im-
moderation relies heavily on human judgement and em- balance issue - some egregious violations may be very
ploys large teams of content moderators to perform re- rare.
views. Since human annotations directly lead to high                                                      Our goal is to maximize the quality and efficiency of
stakes decisions, such as content take downs, the quality the complex, granular, localized policy annotations task,
of the annotations is critical.                                                                        hence leading to the correct video level enforcement de-
    For content moderation decisions there is a growing cision. We achieve this by tackling the key issue of "in-
need for transparency in detected policy violations to pro- formation overload" faced by raters in providing high
vide feedback to content creators [3]. This motivates a quality annotations, where 1) the sheer volume of videos
in-video taxonomic annotation task, where the goal is to on large online video platforms means raters only have
provide localized and fine-grained policy-specific annota- limited review time per video; and 2) the large taxonomy
tions, i.e., both the time regions (video segments) and the of policies makes it hard for raters to recall the complete
exact policies violated, which inform downstream video set of all granular violations for every video region they
content moderation decisions. To cover the spectrum of watch in the limited review duration.
potential violations, policy playbooks typically contain                                                  We propose a human-ML collaboration framework to
hundreds of fine-grained policies. This large space of pol- maximize human ratings quality and efficiency by ad-
icy violations can be organized as a taxonomy of broad dressing "information overload". We train models on
                                                                                                       granular rater annotations to predict policy violations,
CIKM’22: Human-in-the-Loop Data Curation Workshop at the 31st which are then combined with innovative front-end ele-
ACM International Conference on Information and Knowledge Man- ments in the rating tool to provide "hints" to assist raters.
agement, October 17–22, 2022, Atlanta, GA                                                              We borrow from information retrieval literature and use
email: mdeodhar@google.com (M. Deodhar); xmaa@google.com
(X. Ma); yixincai@google.com (Y. Cai); koes@google.com                                                 ranking mechanisms for identifying the most useful and
(A. Koes); alexbeutel@google.com (A. Beutel); jilinc@google.com                                        succinct set of hints. In experiments, we show that this
(J. Chen)                                                                                              enables raters to efficiently label policy violations more
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   correctly and comprehensively. The human interactions
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
with model hints pave the way for leveraging human
feedback to improve the underlying ML models.


2. Related Work
The crowd sourcing literature is very rich in the applica-
tion of human annotations to perform a variety of tasks
such as text processing [4, 5, 6], audio transcription [7],
taxonomy creation [8], and social media analysis [9, 10].
Although there is existing literature on video annotation,
                                                              Figure 1: Human-ML collaboration set up.
it is primarily focused on identifying actions or labeling
entities easily distinguishable by humans using visual
information only, e.g., high jump, thunderstorms. The
primary goal of these tasks is to create large datasets for   ing. The novelty of our approach is that the models
facilitating Machine Learning/Perception applications         are re-trained continuously on the output of the human
[11, 12], e.g., the YouTube-8m dataset [13]. This is very     annotation task, which they provide assistance for, con-
different from our set up, where raters annotate gran-        structing a positive feedback loop between humans and
ular, ambiguously defined policies using multi-modal          models. In this collaborative framework, we have the
signals - video, audio and text, from the transcript, and     opportunity to improve both modeling and human rater
the video title and description. The recent emergence of      performance.
crowd sourcing literature on content moderation primar-          Information overload, which we encounter in our con-
ily covers textual content such as user comments [14, 15],    tent moderation setting, is a well studied problem that
however there has been little focus on video moderation       reduces the effectiveness of a human’s decision making
tasks.                                                        ability [27, 28]. To address this, we build on the intuition
    Human-ML collaboration is an emerging area of re-         that humans find it easier to verify or correct sugges-
search with two main categories of work:                      tions rather than produce new annotations from scratch
    ML-Assisted Human Labeling: ML-assistance                 [29, 5]. Our ML-assistance proposal strives to select the
through predictions and explanations has been used to         most informative but succinct ML-based "hints" to sur-
improve the quality of human decisions in several do-         face to raters by drawing on the information retrieval and
mains [16, 17, 18], including content moderation [15, 14].    ranking literature. ML-based ranking has been shown
Crossmod [15], for instance, uses a model trained on his-     to reduce information overload effectively in electronic
toric cross-community moderation decisions to enable          messaging [30] and social media [31, 32]. We draw in-
Reddit human moderators to find more violations. Inter-       spiration from the learning to rank idea [33] to reduce
active ML-assistance, which we leverage in Section 3.1.2,     information overload for raters.
is used by Bartolo et al. [19] to assist human annotators
to develop adversarial examples for improving a natural
language question answering model. ML-assistance has          3. Proposed Human-ML
been shown to also improve the efficiency of the human           Collaboration Framework
labeling task [20, 15, 21]. Our work aims to exploit both
the human labeling quality and efficiency benefits.           The main contribution of this paper is the human-ML
    Improving ML-models Through Human Annota-                 collaboration framework visualized in Figure 1. We use
tions: Human annotations are useful in constructing           the predictions of ML models to provide assistance to
hybrid human-ML systems that leverage the complemen-          human raters and evaluate the effectiveness of different
tary strengths of both to improve the performance of ML       user interfaces for the ML-assistance. Since the policy
models [22, 23]. Existing work on active learning [24, 25]    violation prediction task is hard for ML models, the feed-
shows that strategically sampling data points can reduce      back from human raters is useful to improve the models.
human workload, but the purpose is to improve machine         ML-assistance enables raters to provide segment level an-
learning models instead of assisting raters. Recent work      notations more efficiently leading to more ground truth
on explainable active learning (XAL) [26] has called for      to train/update the ML models. Additionally, we can en-
better designing for the human experience in the human-       able raters to interact with the ML hints (accept/reject),
AI interface.                                                 providing direct feedback to refine the model, establish-
    Our work shows that it is possible to achieve model       ing a positive human-ML feedback loop.
improvements and assist human raters, bridging the gap
between ML-assisted human labeling and active learn-
3.1. ML-Assisted Human Reviews
As discussed in the introduction, raters face an informa-
tion overload problem due to a combination of granular,
complex policy definitions and limited time per video.
Often, raters need to use their judgement to decide what
parts of the video to watch to identify violating segments,
resulting in fleeting violations being missed, or inconsis- Figure 2: V1 Continuous Line Graph for Specific Policy Hints
tencies across raters watching different sections of the
same video.                                                                     Line Graph Hints
   It is intuitive that raters would benefit from pointers to
likely unsafe regions within a video, labeled with the ex-
act policies being violated. Even if not completely precise,
this will enable them to optimize their review bandwidth
by focusing on potentially more relevant regions, making
them less likely to miss violations. We achieve this by
training per policy ML models and transforming their
predictions into "hints" provided to raters, described in
more detail in Section 3.1.2. We tune the models to be
high recall to minimize uncaught violations, while rely-
ing on human judgement to improve the precision of the Figure 3: V2 Pre-Populated Segments from ML Models
labeled violations.                                                         Pre-Populated Segments

3.1.1. Segment Level Model Training
                                                              3.1.2. Techniques to Provide ML-Assistance
To train segment level policy violation models we frame
the following modeling problem - given multi-modal fea-       We proceed to develop ways to use model predictions
tures for a fixed length video segment, predict whether       to most effectively assist human reviews, and provide
the segment contains specific policy violations. We gen-      details on two different designs (V1 and V2).
erate training datasets by extracting per-frame visual and       V1 Hints: Continuous Line Graphs. For video an-
audio features from the human labeled violation region.       notations, it is standard to display the video itself with
The visual and audio features are the dense embedding         playback controls and additional information in the form
layers of standard convolutional network image seman-         of a timeline [37]. For V1, we display the ML predictions
tic similarity [34] and audio classification models [35]      as a line graph across the entire timeline of the video.
respectively. Based on empirical evidence, we select flat     The user interface is demonstrated in Figure 2. Raters
concatenation to aggregate the frame features over a seg-     can examine the line graphs and jump to the point in the
ment, versus average/max pooling. The final model we          video where a peak (policy violation) occurs.
train is a multi-label DNN model with the aggregated             While we have model predictions for hundreds of gran-
frame-level visual and audio features as input, where         ular policies, due to visual clutter, we only display plots
each label corresponds to a fine grained policy violation.    for a small subset of the most frequent policies. Raters
We use MultiModal Versatile Networks [36] during model        also don’t have the ability to provide feedback to improve
training to learn a better representation for audio and       model predictions.
visual features for our classification task, which further       V2 Hints: Towards a Scalable and Interactive-ML
improves model performance. We use a sliding window           Assistance UI. In V2, we borrow elements from rec-
approach to utilize the trained model to generate predic-     ommender systems [38, 39] to develop a more scalable
tion scores per policy violation for a fixed length segment   and interactive interface (see Figure 3), where we pre-
starting at each frame of the video. Using a window of        populate video segments that may contain policy viola-
frame size n and stride of 1 frame, we produce model          tions detected by machine learning models.
scores for segments with start and end frames [0 to n-1],        To generate video segments, as shown in Figure 3,
[1 to n], and so on until the end of the video, padding       we introduce an algorithm to binarize continuous model
with empty features to fit the segment length for the last    scores per policy into discrete segments and use a rank-
n frames.                                                     ing algorithm to recommend the most useful segments
                                                              to raters. For simplicity, we used a threshold-based al-
                                                              gorithm. The threshold selection constitutes a tradeoff
                                                              between the precision and recall, where precision cap-
tures the utility of the predicted segments to raters and                         Treatment        Precision     Recall   # Videos
                                                                                V1 vs. Baseline     +9.82%       +1.37%     3456
recall captures the comprehensive coverage of all video
                                                                                V1 + V2 vs. V1      +9.97%       +5.64%     2914
violations. The algorithm chooses a threshold maximiz-
ing recall, while maintaining a minimum precision (40%                     Table 1
based on user studies). The regions of the video where                     Relative impact on live traffic (Dataset 1)
the ML scores are above this chosen threshold are dis-
played as predicted policy violating segments to raters.
Several heuristics are then applied to maximize segment                    productivity targets as generalists do. They can hence
quality, e.g., we merge segments that are close to avoid                   spend more time reviewing each video comprehensively,
visual clutter (<3% of the whole video length apart).                      leading to higher decision quality. In our experiment
   Finally, to reduce information overload, we borrow                      setup, each video in an evaluation dataset is indepen-
from the learning to rank concept [33] to rank candidate                   dently reviewed by 1 expert and 2 generalist raters. We
segments and limit the number of displayed segments.                       use the labels from expert raters on the dataset as the
The ranking algorithm prioritizes segments based on the                    ground truth to evaluate the 2 sets of generalist rater
max score across segment frames, and egregiousness of                      decisions.
the predicted policies. We then select the top 𝑁 segments
to display to raters, with 𝑁 selected through user studies.
We pre-populate each ML suggested segment in the video
                                                                           4.1. Experimental Setup
timeline UI as seen in Figure 3. Raters can choose to                      4.1.1. Datasets
accept or reject the suggested segments. These logged
interactions can be used to provide feedback to improve                    Our two evaluation datasets contain videos viewed on
the ML models.                                                             a large online video platform: (1) "Live traffic" dataset:
                                                                           Sampled from live traffic, hence containing a very low
                                                                           proportion of policy violating videos; (2) "Affected slice"
3.2. Human Feedback to ML-Models                                           dataset, sampled from live traffic and filtered to only
The ambiguous and fluid policy definitions along with a                    videos with ML-hints present, containing 13-20% policy
changing distribution of videos on online platforms poses                  violating videos.
a challenge for building robust models to accurately pre-
dict policy violations for providing ML-assistance. We                     4.1.2. Metrics
hence continuously need more data and human feedback
                                                                           Our human ratings quality metric is calculated at the
to improve the models. We show that ML-assistance in-
                                                                           video-level and conveys the correctness of the final, bi-
creases the total number of segment labels submitted
                                                                           nary content moderation decision, e.g., take down or not.
vs. no assistance. The labels in turn can serve as new
                                                                           We compute the precision (P), recall (R), and disagree-
ground truth for continuously re-training the models and
                                                                           ment rate for each of the 2 sets of generalist’s video-level
improving performance. Additionally, with the reject but-
                                                                           decisions. We consider the expert decision as ground
ton shown in V2 we collect clean negative labels; earlier,
                                                                           truth and report the averaged values across the 2 sets. On
we had only "weak" negatives from videos where no vio-
                                                                           live traffic datasets, we use P/R over standard inter-rater
lations were annotated 1 . Further exploiting the human
                                                                           disagreement metrics due to the high class imbalance.
feedback in combination with active learning strategies
                                                                              For rater efficiency, we measure: (1) percentage of
is an area of future work.
                                                                           policy violating videos where raters provide segment an-
                                                                           notations. (2) number of segment annotations submitted
4. Experiments                                                             by raters per video (3) average review duration per video.

Our proposed methodology is evaluated using raters from                    4.2. Results
the author’s organization that regularly perform video
content moderation reviews on live queues. Raters are                      4.2.1. Rater Quality Improvements
separated into two pools: experts and generalists, with                    We conduct experiments on both "live traffic" and "af-
150 and 400 raters respectively. The expert pool is a group                fected slice" datasets, with the baseline as a review pro-
of more experienced quality assurance (QA) raters with                     cess without ML-hints. Tables 1 and 2 compare the
demonstrably better decision quality over a long period                    ratings quality metrics of our proposed V1 (line plot of
of time. Since their focus is QA, they don’t have fixed                    model scores) ML-assistance treatment relative to the
                                                                           baseline, and evaluate the incremental benefit of the V2
1
    Even if no violations were annotated, they could still be present in   (pre-populated segments) treatment over V1, in the V1 +
    video segments the rater did not watch, hence the negative labels
    inferred are weak/noisy.
                                                                           V2 vs. V1 row.
                             Treatment          Precision    Recall    Disagreement%        # Videos
                           V1 vs. Baseline       +4.24%     +3.64%         -15.71%            3319
                           V1 + V2 vs. V1        +7.02%     +14.30%        -32.27%             682
Table 2
Relative impact on affected slice (Dataset 2)


   From the live traffic results in Table 1, we see that V1               Policy Area         AUCPR    # Positive Labels
                                                                      Sexually Suggestive      +5.9%        +12.7%
shows an improvement in precision and recall over the
                                                                             Nudity            +4.9%        +12.7%
baseline, driven by the improvement on the affected slice
                                                                          Illegal Acts         +8.6%        +12.1%
as seen in Table 2.
   We also see large rater quality gains of V2 over V1 on       Table 3
both live traffic and affected slice datasets. The segmenta-    Model Quality Improvements
tion and ranking algorithms in V2 allows us to overcome
the scalability limitation of V1 and expand the number
of granular policies covered by model hints from 7 to           raters accepting model hints without verification. Our
18. Specifically for violence related violations, we see a      video-level ratings quality evaluation metrics are robust
35% relative recall gain over V1 by expanding policies          to this since the ground truth comes from expert (QA)
with ML-hints from 2 to 9. The V2 design can be scaled          raters who review videos comprehensively, looking be-
to cover hundreds of policies in future versions by dy-         yond ML-hints. In practice, we observe little evidence
namically surfacing the most relevant violating segments,       of both (i) and (ii). 56% of the violation segments sub-
further improving recall.                                       mitted by raters in the V2 setup are organically created,
                                                                i.e., don’t overlap with pre-populated hint segments. The
4.2.2. Rater Efficiency Improvements                            segment acceptance rate is 35%, aligned with our segmen-
                                                                tation model precision tuning point of 40%, indicating
We observe reduced review duration on policy violating
                                                                that raters are verifying and rejecting false positive hints
videos with V1 hints vs. without, with more efficiency
                                                                at the expected rate. We could mitigate the risk of (ii)
benefits on longer videos as expected; -14% on videos
                                                                further by enforcing that at least some percentage of
longer than 10 minutes and -20% on videos longer than
                                                                hint segments/video is actually watched or by surfacing
30 minutes. With V2, relative to V1, we see a 3% increase
                                                                the model’s confidence in the predicted hint to raters.
in review duration, but it is traded off by a 9% increase
                                                                To ensure robust evaluation of model quality, the AUC
in the percentage of policy violating videos with exact
                                                                improvements in Section 4.2.4 are evaluated on a set of
segments annotated, and a 24% increase in the number
                                                                labels collected without model generated segments.
of segment annotations submitted per video.

4.2.3. Interactive ML-Assistance Metrics                        6. Future work
Isolating the precision of the ML-assisted segments, we         For content moderation to scale to the size of online plat-
see raters accepting 35% of ML generated hints, which is        forms, it is necessary to take model-based enforcement
in line with the 40% precision constraint we chose when         action. We would like to explore the relation between
converting model scores into discrete segments.                 improved ground truth and improvement of automated,
                                                                model based enforcement. Leveraging active learning
4.2.4. Model Quality Improvements                               strategies in combination with utilizing rater feedback
                                                                on model generated segments to show further quality
Since the introduction of V1 hints, we see significant
                                                                improvements in the models is another open area of re-
model performance improvements with more human
                                                                search. Finally, we will explore multi-armed bandits to
labels collected within a 3 month period on specific policy
                                                                balance active learning based exploration for model im-
areas.
                                                                provement with model exploitation for providing high
                                                                quality ML-assistance [41].
5. Discussion                                                      This paper used content moderation as the test bed for
                                                                our human-ML collaboration proposal. However, it is a
One of the potential risks of our proposed human-ML             more generalized framework that applies to the problem
collaboration framework is automation bias [40], where          of granular, localized video annotation encountered in
a rater’s over-reliance on ML-assistance can result in (i)      various other industry applications such as identifying
blind-spots due to humans missing violations and (ii)           products/brands in videos to inform the placement of
relevant ads, which we would like to explore further.            ation, in: Proceedings of the SIGCHI Conference
                                                                 on Human Factors in Computing Systems, 2013, pp.
                                                                 1999–2008.
References                                                   [9] A. Zubiaga, M. Liakata, R. Procter, K. Bontcheva,
                                                                 P. Tolmie, Crowdsourcing the annotation of ru-
 [1] K. Gogarty, Hate speech and misinformation
                                                                 mourous conversations in social media, in: Pro-
     proliferate on meta products, with 13,500 policy
                                                                 ceedings of the 24th international conference on
     violations documented in the past year alone,
                                                                 World Wide Web, 2015, pp. 347–353.
     https://www.mediamatters.org/facebook/hate-
                                                            [10] A. M. Founta, C. Djouvas, D. Chatzakou, I. Leon-
     speech-and-misinformation-proliferate-meta-
                                                                 tiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Siri-
     products-13500-policy-violations, 2022.
                                                                 vianos, N. Kourtellis, Large scale crowdsourcing
 [2] S. Mellor, Tiktok slammed for videos shar-
                                                                 and characterization of twitter abusive behavior,
     ing false information about russia’s war on
                                                                 in: Twelfth International AAAI Conference on Web
     ukraine, https://fortune.com/2022/03/21/tiktok-
                                                                 and Social Media, 2018.
     misinformation-ukraine/, 2022.
                                                            [11] H. Zhao, A. Torralba, L. Torresani, Z. Yan, Hacs:
 [3] S. Jhaver, A. Bruckman, E. Gilbert, Does trans-
                                                                 Human action clips and segments dataset for
     parency in moderation really matter? user behavior
                                                                 recognition and temporal localization, 2017. URL:
     after content removal explanations on reddit, Proc.
                                                                 https://arxiv.org/abs/1712.09374. doi:10.48550/
     ACM Hum.-Comput. Interact. 3 (2019). URL: https:
                                                                 ARXIV.1712.09374.
     //doi.org/10.1145/3359252. doi:10.1145/3359252.
                                                            [12] M. Z. Trujillo, M. Gruppi, C. Buntain, B. D. Horne,
 [4] A. Kittur, E. H. Chi, B. Suh,         Crowdsourc-
                                                                 The mela bitchute dataset, in: Proceedings of the
     ing user studies with mechanical turk,           in:
                                                                 International AAAI Conference on Web and Social
     Proceedings of the SIGCHI Conference on Hu-
                                                                 Media, volume 16, 2022, pp. 1342–1351.
     man Factors in Computing Systems, CHI ’08,
                                                            [13] S. Abu-El-Haija, N. Kothari, J. Lee, A. P. Natsev,
     Association for Computing Machinery, New
                                                                 G. Toderici, B. Varadarajan, S. Vijayanarasimhan,
     York, NY, USA, 2008, p. 453–456. URL: https:
                                                                 Youtube-8m: A large-scale video classification
     //doi.org/10.1145/1357054.1357127. doi:10.1145/
                                                                 benchmark, in: arXiv:1609.08675, 2016. URL: https:
     1357054.1357127.
                                                                 //arxiv.org/pdf/1609.08675v1.pdf.
 [5] M. S. Bernstein, G. Little, R. C. Miller, B. Hart-
                                                            [14] V. Lai, S. Carton, R. Bhatnagar, Q. V. Liao, Y. Zhang,
     mann, M. S. Ackerman, D. R. Karger, D. Crow-
                                                                 C. Tan, Human-ai collaboration via conditional
     ell, K. Panovich,         Soylent: A word pro-
                                                                 delegation: A case study of content modera-
     cessor with a crowd inside,          in: Proceed-
                                                                 tion, in: Proceedings of the 2022 CHI Con-
     ings of the 23nd Annual ACM Symposium on
                                                                 ference on Human Factors in Computing Sys-
     User Interface Software and Technology, UIST
                                                                 tems, CHI ’22, Association for Computing Ma-
     ’10, Association for Computing Machinery, New
                                                                 chinery, New York, NY, USA, 2022. URL: https:
     York, NY, USA, 2010, p. 313–322. URL: https:
                                                                 //doi.org/10.1145/3491102.3501999. doi:10.1145/
     //doi.org/10.1145/1866029.1866078. doi:10.1145/
                                                                 3491102.3501999.
     1866029.1866078.
                                                            [15] E. Chandrasekharan, C. Gandhi, M. W. Mustelier,
 [6] C. Hu, B. B. Bederson, P. Resnik, Y. Kron-
                                                                 E. Gilbert, Crossmod: A cross-community learning-
     rod,     Monotrans2: A new human compu-
                                                                 based system to assist reddit moderators, Proc.
     tation system to support monolingual transla-
                                                                 ACM Hum.-Comput. Interact. 3 (2019). URL: https:
     tion, in: Proceedings of the SIGCHI Conference
                                                                 //doi.org/10.1145/3359276. doi:10.1145/3359276.
     on Human Factors in Computing Systems, CHI
                                                            [16] E. Beede, E. Baylor, F. Hersch, A. Iurchenko,
     ’11, Association for Computing Machinery, New
                                                                 L. Wilcox, P. Ruamviboonsuk, L. M. Vardoulakis, A
     York, NY, USA, 2011, p. 1133–1136. URL: https:
                                                                 Human-Centered Evaluation of a Deep Learning
     //doi.org/10.1145/1978942.1979111. doi:10.1145/
                                                                 System Deployed in Clinics for the Detection of
     1978942.1979111.
                                                                 Diabetic Retinopathy, Association for Computing
 [7] W. Lasecki, C. Miller, A. Sadilek, A. Abumoussa,
                                                                 Machinery, New York, NY, USA, 2020, p. 1–12. URL:
     D. Borrello, R. Kushalnagar, J. Bigham, Real-Time
                                                                 https://doi.org/10.1145/3313831.3376718.
     Captioning by Groups of Non-Experts, Associa-
                                                            [17] V. Lai, C. Tan, On human predictions with
     tion for Computing Machinery, New York, NY,
                                                                 explanations and predictions of machine learn-
     USA, 2012, p. 23–34. URL: https://doi.org/10.1145/
                                                                 ing models: A case study on deception de-
     2380116.2380122.
                                                                 tection,      in: Proceedings of the Conference
 [8] L. B. Chilton, G. Little, D. Edge, D. S. Weld, J. A.
                                                                 on Fairness, Accountability, and Transparency,
     Landay, Cascade: Crowdsourcing taxonomy cre-
                                                                 FAT* ’19, Association for Computing Machinery,
     New York, NY, USA, 2019, p. 29–38. URL: https:                 formation overload on consumers’ subjective state
     //doi.org/10.1145/3287560.3287590. doi:10.1145/                towards buying decision in the internet shopping
     3287560.3287590.                                               environment, Electronic Commerce Research and
[18] J. Park, R. Krishna, P. Khadpe, L. Fei-Fei,                    Applications 8 (2009) 48–58.
     M. Bernstein, Ai-based request augmentation               [29] H. Hu, L. Xie, Z. Du, R. Hong, Q. Tian, One-bit
     to increase crowdsourcing participation, Pro-                  supervision for image classification (2020). URL:
     ceedings of the AAAI Conference on Human                       https://arxiv.org/abs/2009.06168. doi:10.48550/
     Computation and Crowdsourcing 7 (2019) 115–                    ARXIV.2009.06168.
     124. URL: https://ojs.aaai.org/index.php/HCOMP/           [30] R. M. Losee Jr, Minimizing information overload:
     article/view/5282.                                             the ranking of electronic messages, Journal of In-
[19] M. Bartolo, T. Thrush, S. Riedel, P. Stenetorp, R. Jia,        formation Science 15 (1989) 179–189.
     D. Kiela, Models in the loop: Aiding crowdwork-           [31] K. Koroleva, A. J. Bolufé Röhler, Reducing infor-
     ers with generative annotation assistants, CoRR                mation overload: Design and evaluation of filtering
     abs/2112.09062 (2021). URL: https://arxiv.org/abs/             & ranking algorithms for social networking sites
     2112.09062. arXiv:2112.09062.                                  (2012).
[20] Z. Ashktorab, M. Desmond, J. Andres, M. Muller,           [32] J. Chen, R. Nairn, L. Nelson, M. Bernstein, E. Chi,
     N. N. Joshi, M. Brachman, A. Sharma, K. Brimi-                 Short and tweet: experiments on recommending
     join, Q. Pan, C. T. Wolf, E. Duesterwald, C. Dugan,            content from information streams, in: Proceed-
     W. Geyer, D. Reimer, Ai-assisted human labeling:               ings of the SIGCHI conference on human factors in
     Batching for efficiency without overreliance, Proc.            computing systems, 2010, pp. 1185–1194.
     ACM Hum.-Comput. Interact. 5 (2021). URL: https:          [33] A. Karatzoglou, L. Baltrunas, Y. Shi, Learning to
     //doi.org/10.1145/3449163. doi:10.1145/3449163.                rank for recommender systems, in: Proceedings of
[21] S. Anjum, A. Verma, B. Dang, D. Gurari, Exploring              the 7th ACM Conference on Recommender Systems,
     the use of deep learning with crowdsourcing to                 2013, pp. 493–494.
     annotate images, Human Computation 8 (2021)               [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo-
     76–106. doi:10.15346/hc.v8i2.121.                              jna, Rethinking the inception architecture for
[22] I. Arous, J. Yang, M. Khayati, P. Cudré-Mauroux,               computer vision, 2015. URL: https://arxiv.org/abs/
     OpenCrowd: A Human-AI Collaborative Approach                   1512.00567. doi:10.48550/ARXIV.1512.00567.
     for Finding Social Influencers via Open-Ended An-         [35] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F.
     swers Aggregation, Association for Computing Ma-               Gemmeke, A. Jansen, R. C. Moore, M. Plakal,
     chinery, New York, NY, USA, 2020, p. 1851–1862.                D. Platt, R. A. Saurous, B. Seybold, M. Slaney,
     URL: https://doi.org/10.1145/3366423.3380254.                  R. J. Weiss, K. W. Wilson,            CNN architec-
[23] J. W. Vaughan, Making better use of the crowd:                 tures for large-scale audio classification, CoRR
     How crowdsourcing can advance machine learning                 abs/1609.09430 (2016). URL: http://arxiv.org/abs/
     research, Journal of Machine Learning Research                 1609.09430. arXiv:1609.09430.
     18 (2018) 1–46. URL: http://jmlr.org/papers/v18/17-       [36] J.-B. Alayrac, A. Recasens, R. Schneider,
     234.html.                                                      R. Arandjelović, J. Ramapuram, J. De Fauw,
[24] B. Settles, Active Learning Literature Survey, Com-            L. Smaira, S. Dieleman, A. Zisserman, Self-
     puter Sciences Technical Report 1648, University of            supervised multimodal versatile networks,
     Wisconsin–Madison, 2009.                                       2020. URL: https://arxiv.org/abs/2006.16228.
[25] Y. Yang, Z. Ma, F. Nie, X. Chang, A. G. Hauptmann,             doi:10.48550/ARXIV.2006.16228.
     Multi-class active learning by uncertainty sampling       [37] C. Vondrick, D. Ramanan, D. Patterson, Efficiently
     with diversity maximization, International Journal             scaling up video annotation with crowdsourced
     of Computer Vision 113 (2015) 113–127.                         marketplaces, in: European Conference on Com-
[26] B. Ghai, Q. V. Liao, Y. Zhang, R. Bellamy, K. Mueller,         puter Vision, Springer, 2010, pp. 610–623.
     Explainable active learning (xal): An empirical           [38] I. Portugal, P. Alencar, D. Cowan, The use of ma-
     study of how local explanations impact annota-                 chine learning algorithms in recommender systems:
     tor experience, 2020. URL: https://arxiv.org/abs/              A systematic review, Expert Systems with Applica-
     2001.09219. doi:10.48550/ARXIV.2001.09219.                     tions 97 (2018) 205–227.
[27] M. J. Eppler, J. Mengis, The concept of information       [39] K. Swearingen, R. Sinha, Interaction design for
     overload-a review of literature from organization              recommender systems, in: Designing Interactive
     science, accounting, marketing, mis, and related               Systems, volume 6, Citeseer, 2002, pp. 312–334.
     disciplines (2004), Kommunikationsmanagement              [40] L. J. Skitka, K. L. Mosier, M. Burdick, Does automa-
     im Wandel (2008) 271–305.                                      tion bias decision-making?, International Journal
[28] Y.-C. Chen, R.-A. Shang, C.-Y. Kao, The effects of in-         of Human-Computer Studies 51 (1999) 991–1006.
[41] J. McInerney, B. Lacker, S. Hansen, K. Higley,
     H. Bouchard, A. Gruson, R. Mehrotra, Explore,
     exploit, and explain: personalizing explainable rec-
     ommendations with bandits, in: Proceedings of the
     12th ACM conference on recommender systems,
     2018, pp. 31–39.