<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the Trigger Detection Task at PAN 2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matti Wiegmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Magdalena Wolska</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benno Stein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bauhaus-Universität Weimar</institution>
          ,
          <addr-line>Weimar</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leipzig University</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ScaDS.AI</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Trigger warnings are document labels that warn the reader about content that might cause discomfort or distress. These labels are often asked for by online communities, especially by vulnerable groups. Here, we present trigger detection at PAN 2023 as a multi-label document classification task: Given a fan fiction document, assign all appropriate trigger warnings from the given label set. We derive a set of 32 trigger warnings based on two widely referenced institutional guidelines on sensitive content. We compile a 341,000 document evaluation resource, fan fiction documents from Archive of our Own (AO3), which we fully annotated with the 32 trigger warnings. Six participants submitted solutions to the task. The submissions cover several different methods; the most effective submissions use hierarchical deep learning with RoBERTa-based encodings. The top approach achieves a macro F1 of 0.35 and a micro F1 of 0.75.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this pilot edition of the Trigger Detection task at PAN 2023, we establish the computational
problem of identifying whether or not a given document contains triggering content. In particular,
we formalize trigger detection as a multi-label document classification (MLC) task as follows:
Given a fan fiction document, assign all appropriate trigger warnings from the given label set.</p>
      <p>We created a new evaluation resource, PAN23-trigger-detection, containing ca. 340,000 fan
ifction works from Archive of our Own (AO3) annotated with a 32-label trigger warning set.
We rely on user-generated labels (authors assigned warning-like labels) and follow the authors’
understanding of triggers and which documents require a warning. The warnings are assigned
via AO3’s freeform content descriptors (“tags”), a custom, high-dimensional label system. Since
tags include also non-warning descriptors, we developed a distant-supervision strategy to detect
if a freeform tag corresponds to one of the 32 predefined warnings compiled from institutional
content warning guidelines. The task is primarily evaluated with the standard measures for
multi-label classification, micro and macro F 1. In total, 6 participants submitted software to
Trigger Detection 2023.</p>
      <p>This overview paper first details the creation of the evaluation resource (Section 2), in particular
the distillation of the warning label set from two institutional content guidelines, the scraping of
AO3, the distant-supervision labeling, and the curation of the works. Furthermore, the evaluation
procedure (metrics and baselines) are described in Section 3, the 6 participant submissions are
described in Section 4, and the results are discussed in Section 5.1,2</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>
        For the trigger detection task, we created a new evaluation resource, the PAN23-trigger-detection
corpus, consisting of 341,246 fan fiction works downloaded from Archive of our Own and
annotated in a multi-label setting with a set of 32 warning labels. An extended version of our
annotation method and the evaluation resource is presented by Wiegmann et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Curating a Set of Warning Labels</title>
        <p>
          Since there is no authoritative (closed-set) set of trigger warning labels, we derived these labels
for use in our dataset from two guideline documents for labeling sensitive content: the University
of Reading list of “themes that require trigger warnings” [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and the University of Michigan
list of content warnings [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The two largely overlapping lists comprise, each, 21 categories of
triggering concepts, including health-related (eating disorders, mental illness), sexually-oriented
(sexual assault, pornography) as well as verbal (hate speech, racial slurs), and physical abuse
(animal cruelty, blood, suicide). The lists were pre-processed to unfold compound categories into
individual elements (e.g. “Animal cruelty or animal death” → “animal cruelty”, “animal death”)
and lower-cased. Table 9 (see Appendix) shows the aligned source labels and the merged set.
This merged set of warnings comprises 35 categories; we removed the rarest three labels since
there were too few annotated documents with those labels in the final dataset. From them, we
derived the 32 label trigger warning set for the PAN 2023 Trigger Detection task (see Table 1).
1The baseline and evaluation code used for this task is available at github.com/pan-webis-de/pan-code.
2The data used are available at zenodo.org/record/7612628.
        </p>
        <sec id="sec-2-1-1">
          <title>The process of dying from the subject’s perspective, Drowning, Euthanasia</title>
          <p>Death of others Character death, Killing, Corpses, Coping with Loss or Grief</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Death of animals</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>Sex between family members, Sibling Incest, Twincest</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Sex with a minor, consensual and non-consensual, Pedophilia</title>
        </sec>
        <sec id="sec-2-1-5">
          <title>Graphic display of sex, plays, toys, technique descriptions</title>
          <p>Three major observations can be made of the merged university label set (Table 9): First,
the granularity of triggers is not uniform (e.g., both abuse and the more specific child abuse
are included). Second, the set comprises subsets of related concepts which lend themselves to
semantic abstraction (e.g., sexism, classism and other -isms and -phobias can be considered types
of prejudice). Third, the prescribed list is not exhaustive, as is also pointed out on both websites.</p>
          <p>To obtain labels that abstract over the inconsistent granularity (Table 1), that are orthogonal in
terms of semantics, and to better inform later annotation decisions, we grouped the original labels
into semantically related subsets. The grouping was done by identifying semantic fields, trigger
domains, with which the triggering concepts can be associated via some semantic relation, for
instance, is-a or results-in; since the label set is sufficiently small, grouping was done manually.
Technically speaking, for warnings formulated as complex nouns, we first identified the semantic
content-bearing lexeme and used that as the basis for grouping. For most complex nouns, the
head noun was used; the label “pornographic content” is an example of an exception in the case
of which the content-bearing adjective was used to identify its semantic domain.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Acquiring the Source Documents</title>
        <p>Table 2 shows the descriptive statistics of our source data: ca. 8 million works of fan fiction from
Archive of our Own. We initially downloaded all works released between August 13, 2008 (the
platform launch) and August 09, 2021, from archiveofourown.org and extracted the document
text and metadata (i.e., the freeform tags) from the scraped HTML. To download the HTML
page of each work, we scraped the output of the search function to get the work ID and then
constructed a direct URL to that work’s page. Since the search function was limited to 10,000
works per page, we constructed queries to search for all works released on one particular day, for
each day in the release window.</p>
        <p>No Fandom</p>
        <p>Anime</p>
        <p>Abuse
Sexual</p>
        <p>Abuse
Warning:
Sexual assault
#abuse (like a LOT
of abuse)</p>
        <p>Supernatural
(Anime)
Abusive John
Winchester</p>
        <p>Abusive John</p>
        <p>Warning: Abuse
Tag relations:</p>
        <p>Meta-sub
Parent–child</p>
        <p>Canonical-synonym
Trigger subgraph</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Assigning Trigger Warnings to Source Documents</title>
        <p>We labeled all works in the source data via distant supervision based on the freeform tags assigned
to the works by their author(s). The results and evaluation are shown in Table 3. We first identify
the freeform tags that also indicate a warning from our 32-label set and, second, we assign this
warning to all works labeled with the indicative freeform tag. The underlying mapping table,
which maps from freeform tag to trigger warning, was created by (i) manually annotating the
2,000 most common tags, (ii) efficiently identifying sub-structures of the tag graph that indicate a
trigger warning, annotating each node in the structure with that warning, and (iii) merging both
results, giving priority to the manual annotations.</p>
        <p>We manually annotated two sets of freeform tags: first, the 2,000 most frequent tags ( 0-2k),
which cover just over 50% of tag occurrences, and second, the 10,000–11,000 most frequent
tags (10-11k) as an evaluation dataset. All tags were annotated by two annotators; diverging
annotations were merged by critical discussion. Then, the sub-structures of the tag graph that
indicate the same trigger warning across all nodes were identified (cf. Figure 1) by extracting and
manually annotating rooted sub-graphs from the tag graph in a 5-stage process:
1. Grouping of all tags via the synonym relation and identification of the canonical tag. One
tag per synonym set is marked as canonical by wranglers, all other synonyms are direct
successors of the canonical tag and have no other arcs.
2. Identification of meta-sources: canonical tags that are source nodes in the meta-sub graph.</p>
        <p>Meta-sub relations indicate a directed lexical entailment between canonical tags and have a
typical depth of 2 to 4.
3. Identification of candidate sources of trigger graphs: meta-sources that are also direct
successors of the No Fandom node in the parent–child graph. Sinks in this graph are the
canonical tags and all predecessors are either a Fandom, media type, or No Fandom. The
latter is added as a parent to tags that apply to many Fandoms, including content warnings
but also, for example, holidays and languages. This yields ca. 5,000 tags.
4. Identification of trigger graph sources: manual annotation of all candidate sources,
discarding the nodes without a trigger warning.
5. Identification of all trigger graphs: manual depth-first traversal of the tag graph along the
meta-sub relation, starting from a trigger graph source. If a successor does not agree with
the trigger warning assigned to its predecessor, the arc between them is removed, and the
successor added as new trigger graph source to be annotated with a new trigger warning.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Sampling the Evaluation Dataset</title>
        <p>From the resulting collection of annotated fan fiction works, we sampled PAN23-trigger-detection
by discarding all works that had no warning assigned, were originally published pre-2009 (as
opposed to posted after that since AO3 also archives works from older fan fiction sites), had
freeform tags that could not clearly mapped, was not in English (ca. 8% of the works), had less
than 50 or more than 6,000 words (outliers; ease of computation), less than 2 or more than 66
freeform tags (confidence threshold), less than 1,000 hits (views), or, less than 10 kudos (likes;
popularity threshold). We also removed all (near) duplicates. The resulting dataset contains
341,246 fan fiction works, which was split with stratified sampling into 90:5:5 training, validation,
and test sets; i.e., we kept the label distribution equal across the three splits.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Properties of the Evaluation Dataset</title>
        <p>Table 4 shows the descriptive statistics of the dataset splits. The training dataset with ca. 300,000
works is large enough to train deep neural classifiers. The datasets contain ca. 5% very short
documents (&lt;512 words) that can be used by a BERT-based system without truncation and
ca. 85% medium-sized documents (&lt;4,096 words) that can be used by a sparse-attention model.
Figure 2 shows the distribution of the labels over the test dataset. The most frequent label is
pornography and occurs in ca. 77% of the documents. Most labels are less common, between
ca. 10% for sexual-assault and 6e-4% for animal-cruelty. Documents have 1–13 labels per
document, ca. 71% with a single label, 20% with two, and 6% with three.</p>
        <p>,529462 ,42803 ,05323 ,26542 ,53902 ,67771 ,51761 ,92914 ,9498 ,2291 ,3283 ,8379 ,2558 ,5549 ,0498 ,5346 ,8918 ,7115 ,5412 ,8711 095 082 593 565 445 419 384 927 428 232 209 618
porsneoxguraalp-ahsysvaiuolltencaebusedeatphbrleogondanciynucnedsetragseuicidcehdiyldin-agsbeuhlsof-emhaorpmkhidomnbeaianptpailn-igldlnisebssoeesdcayttii-onhnga-tdreisdoarbddeurccthioildnfabti-rpthhombsiisaecxaisrrmiagterrsaacnisspmhoabbiaortaiobnlmeaiisnsmiomgayln-dyaecnalaitmhssails-cmruelty</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation and Baselines</title>
      <p>We evaluate the submissions primarily through the established multi-label classification metrics
F1 and Accuracy (primary metrics). In addition, we also evaluate the effectiveness of individual
labels/label groups (extended metrics) and the effectiveness in relation to document metadata.
Lastly, we construct and evaluate voting-based ensembles from the submissions.</p>
      <p>As primary metrics, we compare precision, recall, and F1 at both micro- and macro average, and
subset accuracy, which measures accuracy on a per-sample basis (i.e., if all labels of one example
are set correctly). In our assessments, we favor the macro over the micro F1 scores due to the label
imbalance. We also favor recall over precision, since we consider trigger warning assignment a
high-recall task where false negatives cause more harm than false positives. However, we opted
not to modify the metrics or their parameters to reflect this preference.</p>
      <p>As extended metrics, we compare precision and recall of pornography (due to its frequency),
the average effectiveness of the 15 next-most common labels (sexual-assault–dissection), and the
average effectiveness of the 16 least common labels. We also compute the number of classes with
either zero or a very low (&lt;0.1) precision and recall to check for high-frequency label bias.</p>
      <p>As metadata-based metrics, we compare micro and macro F1 for the document subsets that fall
within certain metadata thresholds. First, we compare short (&lt;500), medium, and long (&gt;4,000)
documents. We assume that short works are easier to classify since models can capitalize more
directly on BERT (which has a short input size). Second, we compare works with few (&lt;5),
medium, and many (&gt;20) freeform tags. We assume that works with many freeform tags are
easier to classify because many tags suggest that authors took greater care with annotating their
works and the resulting higher label quality leads to better effectiveness. Third, we compare
works with low (&lt;50 comments, &lt;60 bookmarks, &gt;450 kudos, &gt;8,500 hits), medium, and high
(&gt;280 comments, &gt;330 bookmarks, &gt;1,850 kudos, &gt;35,000 hits) popularity. We assume that
works with high popularity are also easier to classify because authors are more diligent when
tagging works that gain much attention. Fourth, we compare works with an archive warning
(Graphic Depictions Of Violence, Major Character Death, Rape/Non-Con, Underage), without
warning (No Archive Warnings Apply), and works that do not specify the warnings (Choose Not
To Use Archive Warnings). We assume that works with a warning are easier to classify and works
without specified warning are the hardest, since authors hide warning tags within spoilers and
might therefore less diligently annotate freeform warnings. Fifth, we compare works with an
Explicit or Mature rating to works with neither. We assume that explicit or mature works contain
more markers and are thus easier to classify.</p>
      <p>Finally, we construct four ensembles from the submitted results, where the assignment of a
true label is decided by voting to surpass a threshold  . The Top-3 ensemble uses the three best
submissions with  = 2, the other ensembles use all submissions with  = {3, 5, 7}.</p>
      <p>
        As a baseline, we trained an XGBoost [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] classifier based on word-1–3-gram features encoded
as TF· IDF document vectors with a minimum document frequency of 5. We used only the top
10,000 features according to a  2 feature selection. The dataset was undersampled uniformly at
random to 1,000 samples per label. As parameters, we used a max depth or 3, a learning rate
of 0.25, and 300 estimators with 10-round early stopping. The features word-1, 2, and 3-grams
and character-3 and 5-grams were evaluated, as well as feature selection (with or without), model
parameters, and the thresholds for over- and undersampling via grid search.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Submissions</title>
      <p>The 6 submissions to the PAN 2023 Trigger Detection task employed a broad set of techniques,
from hierarchical transformer structures to strategic feature engineering. Table 5 shows an
overview of the different strategies used by the participants. All participants used a form of a
neural network as a model, where RoBERTa was most common and most successful as a classifier
or pre-trained model to produce a strong input encoding. Most submissions also focused on
improving the long document aspect of the task (most documents are longer than the input size of
the state-of-the-art classification models) by using hierarchical classifiers (chunks are encoded,
and prediction is based on a combination of encodings), or voting-based approaches (chunks are
labeled individually, document labels are aggregated over chunk labels). The submissions cope
with the label imbalance (the most common label (pornography) is an order of magnitude more
common than the other labels) through over- and undersampling or by changing class-weights in
the loss function, so that misclassifying a rare class increases the error more than a common label.</p>
      <p>
        Sahin et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] submitted a hierarchical transformer architecture that achieved the top macro
F1 score (by a slim margin of 0.002) and came in second in micro F1 and accuracy, while having
a relatively high recall within the top approaches. The approach first segments the document
into chunks (200 words with 50 words overlap) and then pre-trains a RoBERTa transformer on
the chunks to learn the genre. The architecture then embeds all chunks of a document using the
pre-trained transformer, followed by an LSTM for each label (in a one-vs-all setting), predicting
the class from a sequence of chunk-embeddings (RoBERTa’s [CLS] token). To cope with label
imbalance, the approach assigns positive weights in the loss function to the rare half of the labels.
      </p>
      <p>
        Su et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] submitted a siamese transformer that achieves the second-best macro F1 score
(by a slim margin of 0.002) and the top scores in micro F1 and accuracy, while notably favoring
precision over recall. The approach segments the documents into 505-word chunks, encodes the
ifrst and last chunk using RoBERTa, mean-pools the contextual embeddings (ignoring the [CLS]
token), and classifies based on the pooled embeddings using a 1D convolutional neural network.
      </p>
      <p>
        Cao et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] submitted a voting-based transformer that favors recall over precision. The
approach segments the training documents into chunks, assigns each chunk the labels from
its source document, and trains a single RoBERTa-based classifier on each chunk. To make
predictions, the documents are again chunked, the labels for each chunk are predicted, and a label
is assigned to the document if it is assigned to more than half of the chunks. The training data
was dynamically over- and undersampled: pornography was undersampled to 5,000 examples
and other labels to 2,000 examples. Examples with rare labels were replicated 8-10 times.
      </p>
      <p>
        Cao et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] also submitted a voting-based transformer that achieved very balanced results,
neither favoring macro over micro scores nor precision over recall. The approach chunks and
votes similarly to Cao et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] but builds two different models to overcome the data imbalance,
one for pornography and one for the other 31 classes. The pornography model was trained on
a random selection of 40,000 works with and 40,000 works without the pornography warning.
The model for the other labels removes works with only the pornography warning, undersamples
frequent classes to 3,000 examples, and oversamples rare labels by replicating works 4-6 times.
      </p>
      <p>
        Felser et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] submitted a 1-vs-rest multi-layer perceptron based on two features:
fasttextbased document embeddings and superclass probabilities. This approach achieved the top micro
and macro recall, at the cost of precision on the test dataset. Document embeddings were created
by training a fasttext model from the training data, generating the embeddings for each unique
word in a document, scaling them by term frequency, and adding and normalizing the scaled
word vectors over the document. The superclass probabilities were determined by grouping the
32 labels semantically into 6 superclasses, bootstrapping a seeded LDA with the 50 most relevant
bi-grams of each group (determined through a TF· IDF-like approach for n-gram weighting, which
downgrades pornographic terms), and training a classifier to predict the superclass based on the
topic model outputs, using class probabilities as features. Label imbalance was addressed via
class penalties in the loss function, where the MLP-2 variant has a higher penalty.
      </p>
      <p>
        Lastly, Lakshmaiah et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] present an LSTM-based approach using GloVE-embeddings,
which is third in micro F1 with very high precision but rather weak in macro average scores.
zero or below 0.1. Participants are sorted by total macro F1 (cf. Table 6).
      </p>
      <p>Participant</p>
      <p>Porn.</p>
      <p>Mid</p>
      <p>Bot</p>
      <p>
        Zero P/R
&lt;0.1 P/R
Sahin et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
Su et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
XGBoost baseline
Cao H. et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
Cao G. et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
Felser et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (MLP1)
Felser et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (MLP2)
Lakshmaiah et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>macro F1. Here, the hierarchical classifiers are the most effective by a large margin, followed
by the XGBoost baseline. The most effective approach by macro F1 is the one by Sahin et al.
with 0.352, a small margin before that of Su et al. with 0.350. The best approach by micro F1
and subset accuracy is the one by Su et al.. The XGBoost baseline is only beaten by these two
top approaches. The models score very differently in precision and recall, depending on the
architecture. Four models score generally higher in recall, the other 4 in precision. There is no
obvious relationship between effectiveness and preference for precision or recall. The ensembles
(top 3 and  = 3) beat the submissions but by a very small margin of ca. 0.02.
Participant scores of the trigger detection task at PAN 2023. Shown are scores of examples
with certain properties based on different document lengths, number of freeform tags (tag
confidence), popularity confidence (hits, kudos, comments, bookmarks), works with, without, and
with unspecified AO3</p>
      <p>archive warning, and works with or without and explicit or mature rating.</p>
      <p>Participants are sorted by total macro F1 (cf. Table 6).</p>
      <p>Participant Team</p>
      <p>Length</p>
      <p>Tag count</p>
      <p>Popularity
Short Med. Long</p>
      <p>Few Med. Many</p>
      <p>
        Low Med. High
score very high on pornography and notably lower on all rare labels, which explains the difference
between macro and micro F1. There is a clear decrease in efficiency with decreasing label
frequency. It also becomes more obvious that models tend to be good in either precision or recall
with large differences between them. Combining the strength of the high-recall and high-precision
approaches is a potential way forward, albeit our basic ensemble exploits that only marginally.
Regarding the document length, the macro F1 scores are mixed: Models that use the complete
work as single examples during training (Sahin et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the baseline, and Felser et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) are
slightly (0.05–0.1) less effective on short texts; models that use only a section of the document
(Su et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Cao, G. et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) are slightly (0.05–1.0) less effective on long texts. On micro F1,
all models tend to perform worse on shorter texts. This contradicts our assumption (and prior
evidence [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) that models will be generally better on short texts which can fully capitalize on
BERTs strength on short inputs. An alternative hypothesis is that shorter documents are simply
less clear and have fewer of the markers that the classifier expects to make a positive prediction.
Regarding the tag count, the top models are slightly (0–0.1) less effective when there are many
freeform tags. There is no difference between the less effective models. This also contradicts
our assumption that models with many tags are easier to classify due to higher label reliability.
Regarding popularity, there is no notable difference in micro F1. On macro F1, models are
slightly (0.04–0.14) more efficient on high popularity works than on low popularity works. This
agrees with our assumption that labels of popular works are more reliable. Regarding the archive
warnings, there is no notable difference between works with or without warnings. However,
the most effective models are slightly (ca. 0.05 macro, ca. 0.15 micro) less effective on works
with undeclared warnings than on others. This agrees with our assumption that these works
are less diligently tagged by their authors (potentially as a spoiler tag). Lastly, regarding the
rating, models are more (ca. 0.2–0.3 micro F1) effective on explicit works, which is likely an
artifact from the very effective classification of the pornography label. On macro F1, contrary to
the micro score, the submissions are slightly (0–0.1) less effective on explicit works. This also
contradicts our assumptions that explicit or mature works are easier to classify.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Conclusion</title>
      <p>We present the first task on trigger detection at PAN 2023, for which we created a 341,000
document evaluation resource of fan fiction works annotated with up to 32 labels in a multi-label
classification setting. We extensively evaluate the results of six participant submissions. The most
effective submissions score 0.35 on macro F1 and 0.75 on micro 1.</p>
      <p>We find several factors that impact the effectiveness of the submissions. First, we find that
encoding and training on the full documents is important for good scores on long documents and
hierarchical models appear to be best in this regard. We assume that it is central to find triggering
passages that only appear in some parts of the document and that inform the classification decision,
instead of finding the topic or style that is also present in the beginning. Surprisingly, short
documents appear to be much harder to classify, so models with a strong encoding for short
texts (BERT) are important and document vectors are less effective as features. None of the top
models manage to be great at both, short and long-document effectiveness, leaving potential for
improvement. The effect sizes on all metadata comparisons are small (ca. 0.05–0.15).</p>
      <p>Second, we find that all submissions are much less effective on rare labels and very effective on
very common labels. We assume that the triggering concept goes beyond what can be observed
from the passages in the training data, hence the models can not connect the triggers in the test
data to the learned concept.</p>
      <p>Third, we find that the submissions are more effective on popular works and less effective
on works with an Choose Not To Use Archive Warnings declaration. We assume that authors’
diligence in annotating freeform tags varies a lot, so some works are under-tagged (i.e. authors
want to avoid spoilers) and authors are more diligent in assigning warnings for popular works.
However, we also find that the submissions are less effective on works with many freeform tags,
so the reverse assumption (over-tagging decreases label reliability) also has some merit.
Blood
violence
animal-cruelty
sexual-assault
abuse
child-abuse
abduction
kidnapping</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Knox</surname>
          </string-name>
          , Trigger Warnings: History, Theory, Context, Rowman &amp; Littlefield,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schröder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Borchardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Trigger Warning Assignment as a Multi-Label Document Classification Problem, in: Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] University of Reading,
          <article-title>Guide to policy and procedures for teaching and learning; Guidance on content warnings on course content ('trigger' warnings</article-title>
          ),
          <year>2023</year>
          . URL: https://www. reading.ac.uk/cqsd/-/media/project/functions/cqsd/documents/qap/trigger-warnings.pdf,
          <source>last accessed: May</source>
          <volume>10</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] University of Michigan, An Introduction to Content Warnings and
          <string-name>
            <given-names>Trigger</given-names>
            <surname>Warnings</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://sites.lsa.umich.edu/inclusive-teaching-sandbox/wp-content/uploads/sites/853/ 2021/02/
          <article-title>An-Introduction-to-Content-Warnings-and-</article-title>
          <string-name>
            <surname>Trigger-</surname>
          </string-name>
          Warnings-Draft.pdf,
          <source>last accessed: May</source>
          <volume>10</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          , C. Guestrin,
          <article-title>XGBoost: A scalable tree boosting system</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2016</year>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          . URL: http://doi.acm.
          <source>org/10</source>
          . 1145/2939672.2939785. doi:
          <volume>10</volume>
          .1145/2939672.2939785.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>U.</given-names>
            <surname>Sahin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. E.</given-names>
            <surname>Kucukkaya</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Toraman, ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          , Y. Han,
          <string-name>
            <surname>H</surname>
          </string-name>
          . Qi,
          <article-title>Siamese Networks in Trigger Detection task</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Cao,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Trigger Warning Labeling with RoBERTa and Resampling for Distressing Content Detection</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>A dual-model classification method based on RoBERTa for Trigger Detection</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Felser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Demus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Labudde</surname>
          </string-name>
          , M. Spranger, FoSIL at PAN?23:
          <article-title>Trigger Detection with a Two Stage Topic Classifier</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Lakshmaiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Balouchzahi</surname>
          </string-name>
          , Trigger Detection in Social Media Text , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>