=Paper= {{Paper |id=Vol-3318/short16 |storemode=property |title=A human-ML collaboration framework for improving video content reviews |pdfUrl=https://ceur-ws.org/Vol-3318/short16.pdf |volume=Vol-3318 |authors=Meghana Deodhar,Xiao Ma,Yixin Cai,Alex Koes,Alex Beutel,Jilin Chen |dblpUrl=https://dblp.org/rec/conf/cikm/DeodharMCKBC22 }} ==A human-ML collaboration framework for improving video content reviews== https://ceur-ws.org/Vol-3318/short16.pdf

A human-ML collaboration framework for improving
video content reviews
Meghana Deodhar, Xiao Ma, Yixin Cai, Alex Koes, Alex Beutel and Jilin Chen
Google, USA

Abstract
We deal with the problem of localized in-video taxonomic human annotation in the video content moderation domain, where
the goal is to identify video segments that violate granular policies, e.g., community guidelines on an online video platform.
High quality human labeling is critical for enforcement in content moderation. This is challenging due to the problem
of information overload - raters need to apply a large taxonomy of granular policy violations with ambiguous definitions,
within a limited review duration to relatively long videos. Our key contribution is a novel human-machine learning (ML)
collaboration framework aimed at maximizing the quality and efficiency of human decisions in this setting - human labels
are used to train segment-level models, the predictions of which are displayed as “hints” to human raters, indicating probable
regions of the video with specific policy violations. The human verified/corrected segment labels can help refine the model
further, hence creating a human-ML positive feedback loop. Experiments show improved human video moderation decision
quality, and efficiency through more granular annotations submitted within a similar review duration, which enable a 5-8%
AUC improvement in the hint generation models.

Keywords
human computation, machine learning, video content moderation, ranking

1. Introduction categories such as Profanity, Violence, Nudity, etc., each
of which contains several granular violations. For in-
The importance of content moderation on online video stance, Violence could include a range of granular classes
platforms such as TikTok, YouTube or Instagram is grow- such as animal abuse or graphic violence in video games.
ing [1, 2]. These platforms strive to accurately detect The class definitions are complex, ambiguous and often
the presence of policy violations within the video, which require nuanced judgment to apply, e.g., graphic violence.
drive enforcement actions, e.g., the video can be taken New policy classes may be added over time as well, e.g.,
down. Given the complexity of this problem, content Covid anti-vaccination. Moreover, there is a class im-
moderation relies heavily on human judgement and em- balance issue - some egregious violations may be very
ploys large teams of content moderators to perform re- rare.
views. Since human annotations directly lead to high Our goal is to maximize the quality and efficiency of
stakes decisions, such as content take downs, the quality the complex, granular, localized policy annotations task,
of the annotations is critical. hence leading to the correct video level enforcement de-
For content moderation decisions there is a growing cision. We achieve this by tackling the key issue of "in-
need for transparency in detected policy violations to pro- formation overload" faced by raters in providing high
vide feedback to content creators [3]. This motivates a quality annotations, where 1) the sheer volume of videos
in-video taxonomic annotation task, where the goal is to on large online video platforms means raters only have
provide localized and fine-grained policy-specific annota- limited review time per video; and 2) the large taxonomy
tions, i.e., both the time regions (video segments) and the of policies makes it hard for raters to recall the complete
exact policies violated, which inform downstream video set of all granular violations for every video region they
content moderation decisions. To cover the spectrum of watch in the limited review duration.
potential violations, policy playbooks typically contain We propose a human-ML collaboration framework to
hundreds of fine-grained policies. This large space of pol- maximize human ratings quality and efficiency by ad-
icy violations can be organized as a taxonomy of broad dressing "information overload". We train models on
granular rater annotations to predict policy violations,
CIKM’22: Human-in-the-Loop Data Curation Workshop at the 31st which are then combined with innovative front-end ele-
ACM International Conference on Information and Knowledge Man- ments in the rating tool to provide "hints" to assist raters.
agement, October 17–22, 2022, Atlanta, GA We borrow from information retrieval literature and use
email: mdeodhar@google.com (M. Deodhar); xmaa@google.com
(X. Ma); yixincai@google.com (Y. Cai); koes@google.com ranking mechanisms for identifying the most useful and
(A. Koes); alexbeutel@google.com (A. Beutel); jilinc@google.com succinct set of hints. In experiments, we show that this
(J. Chen) enables raters to efficiently label policy violations more
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). correctly and comprehensively. The human interactions
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
with model hints pave the way for leveraging human
feedback to improve the underlying ML models.

2. Related Work
The crowd sourcing literature is very rich in the applica-
tion of human annotations to perform a variety of tasks
such as text processing [4, 5, 6], audio transcription [7],
taxonomy creation [8], and social media analysis [9, 10].
Although there is existing literature on video annotation,
Figure 1: Human-ML collaboration set up.
it is primarily focused on identifying actions or labeling
entities easily distinguishable by humans using visual
information only, e.g., high jump, thunderstorms. The
primary goal of these tasks is to create large datasets for ing. The novelty of our approach is that the models
facilitating Machine Learning/Perception applications are re-trained continuously on the output of the human
[11, 12], e.g., the YouTube-8m dataset [13]. This is very annotation task, which they provide assistance for, con-
different from our set up, where raters annotate gran- structing a positive feedback loop between humans and
ular, ambiguously defined policies using multi-modal models. In this collaborative framework, we have the
signals - video, audio and text, from the transcript, and opportunity to improve both modeling and human rater
the video title and description. The recent emergence of performance.
crowd sourcing literature on content moderation primar- Information overload, which we encounter in our con-
ily covers textual content such as user comments [14, 15], tent moderation setting, is a well studied problem that
however there has been little focus on video moderation reduces the effectiveness of a human’s decision making
tasks. ability [27, 28]. To address this, we build on the intuition
Human-ML collaboration is an emerging area of re- that humans find it easier to verify or correct sugges-
search with two main categories of work: tions rather than produce new annotations from scratch
ML-Assisted Human Labeling: ML-assistance [29, 5]. Our ML-assistance proposal strives to select the
through predictions and explanations has been used to most informative but succinct ML-based "hints" to sur-
improve the quality of human decisions in several do- face to raters by drawing on the information retrieval and
mains [16, 17, 18], including content moderation [15, 14]. ranking literature. ML-based ranking has been shown
Crossmod [15], for instance, uses a model trained on his- to reduce information overload effectively in electronic
toric cross-community moderation decisions to enable messaging [30] and social media [31, 32]. We draw in-
Reddit human moderators to find more violations. Inter- spiration from the learning to rank idea [33] to reduce
active ML-assistance, which we leverage in Section 3.1.2, information overload for raters.
is used by Bartolo et al. [19] to assist human annotators
to develop adversarial examples for improving a natural
language question answering model. ML-assistance has 3. Proposed Human-ML
been shown to also improve the efficiency of the human Collaboration Framework
labeling task [20, 15, 21]. Our work aims to exploit both
the human labeling quality and efficiency benefits. The main contribution of this paper is the human-ML
Improving ML-models Through Human Annota- collaboration framework visualized in Figure 1. We use
tions: Human annotations are useful in constructing the predictions of ML models to provide assistance to
hybrid human-ML systems that leverage the complemen- human raters and evaluate the effectiveness of different
tary strengths of both to improve the performance of ML user interfaces for the ML-assistance. Since the policy
models [22, 23]. Existing work on active learning [24, 25] violation prediction task is hard for ML models, the feed-
shows that strategically sampling data points can reduce back from human raters is useful to improve the models.
human workload, but the purpose is to improve machine ML-assistance enables raters to provide segment level an-
learning models instead of assisting raters. Recent work notations more efficiently leading to more ground truth
on explainable active learning (XAL) [26] has called for to train/update the ML models. Additionally, we can en-
better designing for the human experience in the human- able raters to interact with the ML hints (accept/reject),
AI interface. providing direct feedback to refine the model, establish-
Our work shows that it is possible to achieve model ing a positive human-ML feedback loop.
improvements and assist human raters, bridging the gap
between ML-assisted human labeling and active learn-
3.1. ML-Assisted Human Reviews
As discussed in the introduction, raters face an informa-
tion overload problem due to a combination of granular,
complex policy definitions and limited time per video.
Often, raters need to use their judgement to decide what
parts of the video to watch to identify violating segments,
resulting in fleeting violations being missed, or inconsis- Figure 2: V1 Continuous Line Graph for Specific Policy Hints
tencies across raters watching different sections of the
same video. Line Graph Hints
It is intuitive that raters would benefit from pointers to
likely unsafe regions within a video, labeled with the ex-
act policies being violated. Even if not completely precise,
this will enable them to optimize their review bandwidth
by focusing on potentially more relevant regions, making
them less likely to miss violations. We achieve this by
training per policy ML models and transforming their
predictions into "hints" provided to raters, described in
more detail in Section 3.1.2. We tune the models to be
high recall to minimize uncaught violations, while rely-
ing on human judgement to improve the precision of the Figure 3: V2 Pre-Populated Segments from ML Models
labeled violations. Pre-Populated Segments

3.1.1. Segment Level Model Training
3.1.2. Techniques to Provide ML-Assistance
To train segment level policy violation models we frame
the following modeling problem - given multi-modal fea- We proceed to develop ways to use model predictions
tures for a fixed length video segment, predict whether to most effectively assist human reviews, and provide
the segment contains specific policy violations. We gen- details on two different designs (V1 and V2).
erate training datasets by extracting per-frame visual and V1 Hints: Continuous Line Graphs. For video an-
audio features from the human labeled violation region. notations, it is standard to display the video itself with
The visual and audio features are the dense embedding playback controls and additional information in the form
layers of standard convolutional network image seman- of a timeline [37]. For V1, we display the ML predictions
tic similarity [34] and audio classification models [35] as a line graph across the entire timeline of the video.
respectively. Based on empirical evidence, we select flat The user interface is demonstrated in Figure 2. Raters
concatenation to aggregate the frame features over a seg- can examine the line graphs and jump to the point in the
ment, versus average/max pooling. The final model we video where a peak (policy violation) occurs.
train is a multi-label DNN model with the aggregated While we have model predictions for hundreds of gran-
frame-level visual and audio features as input, where ular policies, due to visual clutter, we only display plots
each label corresponds to a fine grained policy violation. for a small subset of the most frequent policies. Raters
We use MultiModal Versatile Networks [36] during model also don’t have the ability to provide feedback to improve
training to learn a better representation for audio and model predictions.
visual features for our classification task, which further V2 Hints: Towards a Scalable and Interactive-ML
improves model performance. We use a sliding window Assistance UI. In V2, we borrow elements from rec-
approach to utilize the trained model to generate predic- ommender systems [38, 39] to develop a more scalable
tion scores per policy violation for a fixed length segment and interactive interface (see Figure 3), where we pre-
starting at each frame of the video. Using a window of populate video segments that may contain policy viola-
frame size n and stride of 1 frame, we produce model tions detected by machine learning models.
scores for segments with start and end frames [0 to n-1], To generate video segments, as shown in Figure 3,
[1 to n], and so on until the end of the video, padding we introduce an algorithm to binarize continuous model
with empty features to fit the segment length for the last scores per policy into discrete segments and use a rank-
n frames. ing algorithm to recommend the most useful segments
to raters. For simplicity, we used a threshold-based al-
gorithm. The threshold selection constitutes a tradeoff
between the precision and recall, where precision cap-
tures the utility of the predicted segments to raters and Treatment Precision Recall # Videos
V1 vs. Baseline +9.82% +1.37% 3456
recall captures the comprehensive coverage of all video
V1 + V2 vs. V1 +9.97% +5.64% 2914
violations. The algorithm chooses a threshold maximiz-
ing recall, while maintaining a minimum precision (40% Table 1
based on user studies). The regions of the video where Relative impact on live traffic (Dataset 1)
the ML scores are above this chosen threshold are dis-
played as predicted policy violating segments to raters.
Several heuristics are then applied to maximize segment productivity targets as generalists do. They can hence
quality, e.g., we merge segments that are close to avoid spend more time reviewing each video comprehensively,
visual clutter (<3% of the whole video length apart). leading to higher decision quality. In our experiment
Finally, to reduce information overload, we borrow setup, each video in an evaluation dataset is indepen-
from the learning to rank concept [33] to rank candidate dently reviewed by 1 expert and 2 generalist raters. We
segments and limit the number of displayed segments. use the labels from expert raters on the dataset as the
The ranking algorithm prioritizes segments based on the ground truth to evaluate the 2 sets of generalist rater
max score across segment frames, and egregiousness of decisions.
the predicted policies. We then select the top 𝑁 segments
to display to raters, with 𝑁 selected through user studies.
We pre-populate each ML suggested segment in the video
4.1. Experimental Setup
timeline UI as seen in Figure 3. Raters can choose to 4.1.1. Datasets
accept or reject the suggested segments. These logged
interactions can be used to provide feedback to improve Our two evaluation datasets contain videos viewed on
the ML models. a large online video platform: (1) "Live traffic" dataset:
Sampled from live traffic, hence containing a very low
proportion of policy violating videos; (2) "Affected slice"
3.2. Human Feedback to ML-Models dataset, sampled from live traffic and filtered to only
The ambiguous and fluid policy definitions along with a videos with ML-hints present, containing 13-20% policy
changing distribution of videos on online platforms poses violating videos.
a challenge for building robust models to accurately pre-
dict policy violations for providing ML-assistance. We 4.1.2. Metrics
hence continuously need more data and human feedback
Our human ratings quality metric is calculated at the
to improve the models. We show that ML-assistance in-
video-level and conveys the correctness of the final, bi-
creases the total number of segment labels submitted
nary content moderation decision, e.g., take down or not.
vs. no assistance. The labels in turn can serve as new
We compute the precision (P), recall (R), and disagree-
ground truth for continuously re-training the models and
ment rate for each of the 2 sets of generalist’s video-level
improving performance. Additionally, with the reject but-
decisions. We consider the expert decision as ground
ton shown in V2 we collect clean negative labels; earlier,
truth and report the averaged values across the 2 sets. On
we had only "weak" negatives from videos where no vio-
live traffic datasets, we use P/R over standard inter-rater
lations were annotated 1 . Further exploiting the human
disagreement metrics due to the high class imbalance.
feedback in combination with active learning strategies
For rater efficiency, we measure: (1) percentage of
is an area of future work.
policy violating videos where raters provide segment an-
notations. (2) number of segment annotations submitted
4. Experiments by raters per video (3) average review duration per video.

Our proposed methodology is evaluated using raters from 4.2. Results
the author’s organization that regularly perform video
content moderation reviews on live queues. Raters are 4.2.1. Rater Quality Improvements
separated into two pools: experts and generalists, with We conduct experiments on both "live traffic" and "af-
150 and 400 raters respectively. The expert pool is a group fected slice" datasets, with the baseline as a review pro-
of more experienced quality assurance (QA) raters with cess without ML-hints. Tables 1 and 2 compare the
demonstrably better decision quality over a long period ratings quality metrics of our proposed V1 (line plot of
of time. Since their focus is QA, they don’t have fixed model scores) ML-assistance treatment relative to the
baseline, and evaluate the incremental benefit of the V2
1
Even if no violations were annotated, they could still be present in (pre-populated segments) treatment over V1, in the V1 +
video segments the rater did not watch, hence the negative labels
inferred are weak/noisy.
V2 vs. V1 row.
Treatment Precision Recall Disagreement% # Videos
V1 vs. Baseline +4.24% +3.64% -15.71% 3319
V1 + V2 vs. V1 +7.02% +14.30% -32.27% 682
Table 2
Relative impact on affected slice (Dataset 2)

From the live traffic results in Table 1, we see that V1 Policy Area AUCPR # Positive Labels
Sexually Suggestive +5.9% +12.7%
shows an improvement in precision and recall over the
Nudity +4.9% +12.7%
baseline, driven by the improvement on the affected slice
Illegal Acts +8.6% +12.1%
as seen in Table 2.
We also see large rater quality gains of V2 over V1 on Table 3
both live traffic and affected slice datasets. The segmenta- Model Quality Improvements
tion and ranking algorithms in V2 allows us to overcome
the scalability limitation of V1 and expand the number
of granular policies covered by model hints from 7 to raters accepting model hints without verification. Our
18. Specifically for violence related violations, we see a video-level ratings quality evaluation metrics are robust
35% relative recall gain over V1 by expanding policies to this since the ground truth comes from expert (QA)
with ML-hints from 2 to 9. The V2 design can be scaled raters who review videos comprehensively, looking be-
to cover hundreds of policies in future versions by dy- yond ML-hints. In practice, we observe little evidence
namically surfacing the most relevant violating segments, of both (i) and (ii). 56% of the violation segments sub-
further improving recall. mitted by raters in the V2 setup are organically created,
i.e., don’t overlap with pre-populated hint segments. The
4.2.2. Rater Efficiency Improvements segment acceptance rate is 35%, aligned with our segmen-
tation model precision tuning point of 40%, indicating
We observe reduced review duration on policy violating
that raters are verifying and rejecting false positive hints
videos with V1 hints vs. without, with more efficiency
at the expected rate. We could mitigate the risk of (ii)
benefits on longer videos as expected; -14% on videos
further by enforcing that at least some percentage of
longer than 10 minutes and -20% on videos longer than
hint segments/video is actually watched or by surfacing
30 minutes. With V2, relative to V1, we see a 3% increase
the model’s confidence in the predicted hint to raters.
in review duration, but it is traded off by a 9% increase
To ensure robust evaluation of model quality, the AUC
in the percentage of policy violating videos with exact
improvements in Section 4.2.4 are evaluated on a set of
segments annotated, and a 24% increase in the number
labels collected without model generated segments.
of segment annotations submitted per video.

4.2.3. Interactive ML-Assistance Metrics 6. Future work
Isolating the precision of the ML-assisted segments, we For content moderation to scale to the size of online plat-
see raters accepting 35% of ML generated hints, which is forms, it is necessary to take model-based enforcement
in line with the 40% precision constraint we chose when action. We would like to explore the relation between
converting model scores into discrete segments. improved ground truth and improvement of automated,
model based enforcement. Leveraging active learning
4.2.4. Model Quality Improvements strategies in combination with utilizing rater feedback
on model generated segments to show further quality
Since the introduction of V1 hints, we see significant
improvements in the models is another open area of re-
model performance improvements with more human
search. Finally, we will explore multi-armed bandits to
labels collected within a 3 month period on specific policy
balance active learning based exploration for model im-
areas.
provement with model exploitation for providing high
quality ML-assistance [41].
5. Discussion This paper used content moderation as the test bed for
our human-ML collaboration proposal. However, it is a
One of the potential risks of our proposed human-ML more generalized framework that applies to the problem
collaboration framework is automation bias [40], where of granular, localized video annotation encountered in
a rater’s over-reliance on ML-assistance can result in (i) various other industry applications such as identifying
blind-spots due to humans missing violations and (ii) products/brands in videos to inform the placement of
relevant ads, which we would like to explore further. ation, in: Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems, 2013, pp.
1999–2008.
References [9] A. Zubiaga, M. Liakata, R. Procter, K. Bontcheva,
P. Tolmie, Crowdsourcing the annotation of ru-
[1] K. Gogarty, Hate speech and misinformation
mourous conversations in social media, in: Pro-
proliferate on meta products, with 13,500 policy
ceedings of the 24th international conference on
violations documented in the past year alone,
World Wide Web, 2015, pp. 347–353.
https://www.mediamatters.org/facebook/hate-
[10] A. M. Founta, C. Djouvas, D. Chatzakou, I. Leon-
speech-and-misinformation-proliferate-meta-
tiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Siri-
products-13500-policy-violations, 2022.
vianos, N. Kourtellis, Large scale crowdsourcing
[2] S. Mellor, Tiktok slammed for videos shar-
and characterization of twitter abusive behavior,
ing false information about russia’s war on
in: Twelfth International AAAI Conference on Web
ukraine, https://fortune.com/2022/03/21/tiktok-
and Social Media, 2018.
misinformation-ukraine/, 2022.
[11] H. Zhao, A. Torralba, L. Torresani, Z. Yan, Hacs:
[3] S. Jhaver, A. Bruckman, E. Gilbert, Does trans-
Human action clips and segments dataset for
parency in moderation really matter? user behavior
recognition and temporal localization, 2017. URL:
after content removal explanations on reddit, Proc.
https://arxiv.org/abs/1712.09374. doi:10.48550/
ACM Hum.-Comput. Interact. 3 (2019). URL: https:
ARXIV.1712.09374.
//doi.org/10.1145/3359252. doi:10.1145/3359252.
[12] M. Z. Trujillo, M. Gruppi, C. Buntain, B. D. Horne,
[4] A. Kittur, E. H. Chi, B. Suh, Crowdsourc-
The mela bitchute dataset, in: Proceedings of the
ing user studies with mechanical turk, in:
International AAAI Conference on Web and Social
Proceedings of the SIGCHI Conference on Hu-
Media, volume 16, 2022, pp. 1342–1351.
man Factors in Computing Systems, CHI ’08,
[13] S. Abu-El-Haija, N. Kothari, J. Lee, A. P. Natsev,
Association for Computing Machinery, New
G. Toderici, B. Varadarajan, S. Vijayanarasimhan,
York, NY, USA, 2008, p. 453–456. URL: https:
Youtube-8m: A large-scale video classification
//doi.org/10.1145/1357054.1357127. doi:10.1145/
benchmark, in: arXiv:1609.08675, 2016. URL: https:
1357054.1357127.
//arxiv.org/pdf/1609.08675v1.pdf.
[5] M. S. Bernstein, G. Little, R. C. Miller, B. Hart-
[14] V. Lai, S. Carton, R. Bhatnagar, Q. V. Liao, Y. Zhang,
mann, M. S. Ackerman, D. R. Karger, D. Crow-
C. Tan, Human-ai collaboration via conditional
ell, K. Panovich, Soylent: A word pro-
delegation: A case study of content modera-
cessor with a crowd inside, in: Proceed-
tion, in: Proceedings of the 2022 CHI Con-
ings of the 23nd Annual ACM Symposium on
ference on Human Factors in Computing Sys-
User Interface Software and Technology, UIST
tems, CHI ’22, Association for Computing Ma-
’10, Association for Computing Machinery, New
chinery, New York, NY, USA, 2022. URL: https:
York, NY, USA, 2010, p. 313–322. URL: https:
//doi.org/10.1145/3491102.3501999. doi:10.1145/
//doi.org/10.1145/1866029.1866078. doi:10.1145/
3491102.3501999.
1866029.1866078.
[15] E. Chandrasekharan, C. Gandhi, M. W. Mustelier,
[6] C. Hu, B. B. Bederson, P. Resnik, Y. Kron-
E. Gilbert, Crossmod: A cross-community learning-
rod, Monotrans2: A new human compu-
based system to assist reddit moderators, Proc.
tation system to support monolingual transla-
ACM Hum.-Comput. Interact. 3 (2019). URL: https:
tion, in: Proceedings of the SIGCHI Conference
//doi.org/10.1145/3359276. doi:10.1145/3359276.
on Human Factors in Computing Systems, CHI
[16] E. Beede, E. Baylor, F. Hersch, A. Iurchenko,
’11, Association for Computing Machinery, New
L. Wilcox, P. Ruamviboonsuk, L. M. Vardoulakis, A
York, NY, USA, 2011, p. 1133–1136. URL: https:
Human-Centered Evaluation of a Deep Learning
//doi.org/10.1145/1978942.1979111. doi:10.1145/
System Deployed in Clinics for the Detection of
1978942.1979111.
Diabetic Retinopathy, Association for Computing
[7] W. Lasecki, C. Miller, A. Sadilek, A. Abumoussa,
Machinery, New York, NY, USA, 2020, p. 1–12. URL:
D. Borrello, R. Kushalnagar, J. Bigham, Real-Time
https://doi.org/10.1145/3313831.3376718.
Captioning by Groups of Non-Experts, Associa-
[17] V. Lai, C. Tan, On human predictions with
tion for Computing Machinery, New York, NY,
explanations and predictions of machine learn-
USA, 2012, p. 23–34. URL: https://doi.org/10.1145/
ing models: A case study on deception de-
2380116.2380122.
tection, in: Proceedings of the Conference
[8] L. B. Chilton, G. Little, D. Edge, D. S. Weld, J. A.
on Fairness, Accountability, and Transparency,
Landay, Cascade: Crowdsourcing taxonomy cre-
FAT* ’19, Association for Computing Machinery,
New York, NY, USA, 2019, p. 29–38. URL: https: formation overload on consumers’ subjective state
//doi.org/10.1145/3287560.3287590. doi:10.1145/ towards buying decision in the internet shopping
3287560.3287590. environment, Electronic Commerce Research and
[18] J. Park, R. Krishna, P. Khadpe, L. Fei-Fei, Applications 8 (2009) 48–58.
M. Bernstein, Ai-based request augmentation [29] H. Hu, L. Xie, Z. Du, R. Hong, Q. Tian, One-bit
to increase crowdsourcing participation, Pro- supervision for image classification (2020). URL:
ceedings of the AAAI Conference on Human https://arxiv.org/abs/2009.06168. doi:10.48550/
Computation and Crowdsourcing 7 (2019) 115– ARXIV.2009.06168.
124. URL: https://ojs.aaai.org/index.php/HCOMP/ [30] R. M. Losee Jr, Minimizing information overload:
article/view/5282. the ranking of electronic messages, Journal of In-
[19] M. Bartolo, T. Thrush, S. Riedel, P. Stenetorp, R. Jia, formation Science 15 (1989) 179–189.
D. Kiela, Models in the loop: Aiding crowdwork- [31] K. Koroleva, A. J. Bolufé Röhler, Reducing infor-
ers with generative annotation assistants, CoRR mation overload: Design and evaluation of filtering
abs/2112.09062 (2021). URL: https://arxiv.org/abs/ & ranking algorithms for social networking sites
2112.09062. arXiv:2112.09062. (2012).
[20] Z. Ashktorab, M. Desmond, J. Andres, M. Muller, [32] J. Chen, R. Nairn, L. Nelson, M. Bernstein, E. Chi,
N. N. Joshi, M. Brachman, A. Sharma, K. Brimi- Short and tweet: experiments on recommending
join, Q. Pan, C. T. Wolf, E. Duesterwald, C. Dugan, content from information streams, in: Proceed-
W. Geyer, D. Reimer, Ai-assisted human labeling: ings of the SIGCHI conference on human factors in
Batching for efficiency without overreliance, Proc. computing systems, 2010, pp. 1185–1194.
ACM Hum.-Comput. Interact. 5 (2021). URL: https: [33] A. Karatzoglou, L. Baltrunas, Y. Shi, Learning to
//doi.org/10.1145/3449163. doi:10.1145/3449163. rank for recommender systems, in: Proceedings of
[21] S. Anjum, A. Verma, B. Dang, D. Gurari, Exploring the 7th ACM Conference on Recommender Systems,
the use of deep learning with crowdsourcing to 2013, pp. 493–494.
annotate images, Human Computation 8 (2021) [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wo-
76–106. doi:10.15346/hc.v8i2.121. jna, Rethinking the inception architecture for
[22] I. Arous, J. Yang, M. Khayati, P. Cudré-Mauroux, computer vision, 2015. URL: https://arxiv.org/abs/
OpenCrowd: A Human-AI Collaborative Approach 1512.00567. doi:10.48550/ARXIV.1512.00567.
for Finding Social Influencers via Open-Ended An- [35] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F.
swers Aggregation, Association for Computing Ma- Gemmeke, A. Jansen, R. C. Moore, M. Plakal,
chinery, New York, NY, USA, 2020, p. 1851–1862. D. Platt, R. A. Saurous, B. Seybold, M. Slaney,
URL: https://doi.org/10.1145/3366423.3380254. R. J. Weiss, K. W. Wilson, CNN architec-
[23] J. W. Vaughan, Making better use of the crowd: tures for large-scale audio classification, CoRR
How crowdsourcing can advance machine learning abs/1609.09430 (2016). URL: http://arxiv.org/abs/
research, Journal of Machine Learning Research 1609.09430. arXiv:1609.09430.
18 (2018) 1–46. URL: http://jmlr.org/papers/v18/17- [36] J.-B. Alayrac, A. Recasens, R. Schneider,
234.html. R. Arandjelović, J. Ramapuram, J. De Fauw,
[24] B. Settles, Active Learning Literature Survey, Com- L. Smaira, S. Dieleman, A. Zisserman, Self-
puter Sciences Technical Report 1648, University of supervised multimodal versatile networks,
Wisconsin–Madison, 2009. 2020. URL: https://arxiv.org/abs/2006.16228.
[25] Y. Yang, Z. Ma, F. Nie, X. Chang, A. G. Hauptmann, doi:10.48550/ARXIV.2006.16228.
Multi-class active learning by uncertainty sampling [37] C. Vondrick, D. Ramanan, D. Patterson, Efficiently
with diversity maximization, International Journal scaling up video annotation with crowdsourced
of Computer Vision 113 (2015) 113–127. marketplaces, in: European Conference on Com-
[26] B. Ghai, Q. V. Liao, Y. Zhang, R. Bellamy, K. Mueller, puter Vision, Springer, 2010, pp. 610–623.
Explainable active learning (xal): An empirical [38] I. Portugal, P. Alencar, D. Cowan, The use of ma-
study of how local explanations impact annota- chine learning algorithms in recommender systems:
tor experience, 2020. URL: https://arxiv.org/abs/ A systematic review, Expert Systems with Applica-
2001.09219. doi:10.48550/ARXIV.2001.09219. tions 97 (2018) 205–227.
[27] M. J. Eppler, J. Mengis, The concept of information [39] K. Swearingen, R. Sinha, Interaction design for
overload-a review of literature from organization recommender systems, in: Designing Interactive
science, accounting, marketing, mis, and related Systems, volume 6, Citeseer, 2002, pp. 312–334.
disciplines (2004), Kommunikationsmanagement [40] L. J. Skitka, K. L. Mosier, M. Burdick, Does automa-
im Wandel (2008) 271–305. tion bias decision-making?, International Journal
[28] Y.-C. Chen, R.-A. Shang, C.-Y. Kao, The effects of in- of Human-Computer Studies 51 (1999) 991–1006.
[41] J. McInerney, B. Lacker, S. Hansen, K. Higley,
H. Bouchard, A. Gruson, R. Mehrotra, Explore,
exploit, and explain: personalizing explainable rec-
ommendations with bandits, in: Proceedings of the
12th ACM conference on recommender systems,
2018, pp. 31–39.