1. Introduction

Overview of the Trigger Detection Task at PAN 2023

Matti Wiegmann

Magdalena Wolska

Martin Potthast

1 2

Benno Stein

0 0 Bauhaus-Universität Weimar , Weimar , Germany 1 Leipzig University , Leipzig , Germany 2 ScaDS.AI , Leipzig , Germany

Trigger warnings are document labels that warn the reader about content that might cause discomfort or distress. These labels are often asked for by online communities, especially by vulnerable groups. Here, we present trigger detection at PAN 2023 as a multi-label document classification task: Given a fan fiction document, assign all appropriate trigger warnings from the given label set. We derive a set of 32 trigger warnings based on two widely referenced institutional guidelines on sensitive content. We compile a 341,000 document evaluation resource, fan fiction documents from Archive of our Own (AO3), which we fully annotated with the 32 trigger warnings. Six participants submitted solutions to the task. The submissions cover several different methods; the most effective submissions use hierarchical deep learning with RoBERTa-based encodings. The top approach achieves a macro F1 of 0.35 and a micro F1 of 0.75.

1. Introduction

In this pilot edition of the Trigger Detection task at PAN 2023, we establish the computational problem of identifying whether or not a given document contains triggering content. In particular, we formalize trigger detection as a multi-label document classification (MLC) task as follows: Given a fan fiction document, assign all appropriate trigger warnings from the given label set.

We created a new evaluation resource, PAN23-trigger-detection, containing ca. 340,000 fan ifction works from Archive of our Own (AO3) annotated with a 32-label trigger warning set. We rely on user-generated labels (authors assigned warning-like labels) and follow the authors’ understanding of triggers and which documents require a warning. The warnings are assigned via AO3’s freeform content descriptors (“tags”), a custom, high-dimensional label system. Since tags include also non-warning descriptors, we developed a distant-supervision strategy to detect if a freeform tag corresponds to one of the 32 predefined warnings compiled from institutional content warning guidelines. The task is primarily evaluated with the standard measures for multi-label classification, micro and macro F 1. In total, 6 participants submitted software to Trigger Detection 2023.

This overview paper first details the creation of the evaluation resource (Section 2), in particular the distillation of the warning label set from two institutional content guidelines, the scraping of AO3, the distant-supervision labeling, and the curation of the works. Furthermore, the evaluation procedure (metrics and baselines) are described in Section 3, the 6 participant submissions are described in Section 4, and the results are discussed in Section 5.1,2

2. Data

For the trigger detection task, we created a new evaluation resource, the PAN23-trigger-detection corpus, consisting of 341,246 fan fiction works downloaded from Archive of our Own and annotated in a multi-label setting with a set of 32 warning labels. An extended version of our annotation method and the evaluation resource is presented by Wiegmann et al. [ 2 ].

2.1. Curating a Set of Warning Labels

Since there is no authoritative (closed-set) set of trigger warning labels, we derived these labels for use in our dataset from two guideline documents for labeling sensitive content: the University of Reading list of “themes that require trigger warnings” [ 3 ] and the University of Michigan list of content warnings [ 4 ]. The two largely overlapping lists comprise, each, 21 categories of triggering concepts, including health-related (eating disorders, mental illness), sexually-oriented (sexual assault, pornography) as well as verbal (hate speech, racial slurs), and physical abuse (animal cruelty, blood, suicide). The lists were pre-processed to unfold compound categories into individual elements (e.g. “Animal cruelty or animal death” → “animal cruelty”, “animal death”) and lower-cased. Table 9 (see Appendix) shows the aligned source labels and the merged set. This merged set of warnings comprises 35 categories; we removed the rarest three labels since there were too few annotated documents with those labels in the final dataset. From them, we derived the 32 label trigger warning set for the PAN 2023 Trigger Detection task (see Table 1). 1The baseline and evaluation code used for this task is available at github.com/pan-webis-de/pan-code. 2The data used are available at zenodo.org/record/7612628.

The process of dying from the subject’s perspective, Drowning, Euthanasia

Death of others Character death, Killing, Corpses, Coping with Loss or Grief

Death of animals Sex between family members, Sibling Incest, Twincest Sex with a minor, consensual and non-consensual, Pedophilia Graphic display of sex, plays, toys, technique descriptions

Three major observations can be made of the merged university label set (Table 9): First, the granularity of triggers is not uniform (e.g., both abuse and the more specific child abuse are included). Second, the set comprises subsets of related concepts which lend themselves to semantic abstraction (e.g., sexism, classism and other -isms and -phobias can be considered types of prejudice). Third, the prescribed list is not exhaustive, as is also pointed out on both websites.

To obtain labels that abstract over the inconsistent granularity (Table 1), that are orthogonal in terms of semantics, and to better inform later annotation decisions, we grouped the original labels into semantically related subsets. The grouping was done by identifying semantic fields, trigger domains, with which the triggering concepts can be associated via some semantic relation, for instance, is-a or results-in; since the label set is sufficiently small, grouping was done manually. Technically speaking, for warnings formulated as complex nouns, we first identified the semantic content-bearing lexeme and used that as the basis for grouping. For most complex nouns, the head noun was used; the label “pornographic content” is an example of an exception in the case of which the content-bearing adjective was used to identify its semantic domain.

2.2. Acquiring the Source Documents

Table 2 shows the descriptive statistics of our source data: ca. 8 million works of fan fiction from Archive of our Own. We initially downloaded all works released between August 13, 2008 (the platform launch) and August 09, 2021, from archiveofourown.org and extracted the document text and metadata (i.e., the freeform tags) from the scraped HTML. To download the HTML page of each work, we scraped the output of the search function to get the work ID and then constructed a direct URL to that work’s page. Since the search function was limited to 10,000 works per page, we constructed queries to search for all works released on one particular day, for each day in the release window.

No Fandom

Anime

Abuse Sexual

Abuse Warning: Sexual assault #abuse (like a LOT of abuse)

Supernatural (Anime) Abusive John Winchester

Abusive John

Warning: Abuse Tag relations:

Meta-sub Parent–child

Canonical-synonym Trigger subgraph

2.3. Assigning Trigger Warnings to Source Documents

We labeled all works in the source data via distant supervision based on the freeform tags assigned to the works by their author(s). The results and evaluation are shown in Table 3. We first identify the freeform tags that also indicate a warning from our 32-label set and, second, we assign this warning to all works labeled with the indicative freeform tag. The underlying mapping table, which maps from freeform tag to trigger warning, was created by (i) manually annotating the 2,000 most common tags, (ii) efficiently identifying sub-structures of the tag graph that indicate a trigger warning, annotating each node in the structure with that warning, and (iii) merging both results, giving priority to the manual annotations.

We manually annotated two sets of freeform tags: first, the 2,000 most frequent tags ( 0-2k), which cover just over 50% of tag occurrences, and second, the 10,000–11,000 most frequent tags (10-11k) as an evaluation dataset. All tags were annotated by two annotators; diverging annotations were merged by critical discussion. Then, the sub-structures of the tag graph that indicate the same trigger warning across all nodes were identified (cf. Figure 1) by extracting and manually annotating rooted sub-graphs from the tag graph in a 5-stage process: 1. Grouping of all tags via the synonym relation and identification of the canonical tag. One tag per synonym set is marked as canonical by wranglers, all other synonyms are direct successors of the canonical tag and have no other arcs. 2. Identification of meta-sources: canonical tags that are source nodes in the meta-sub graph.

Meta-sub relations indicate a directed lexical entailment between canonical tags and have a typical depth of 2 to 4. 3. Identification of candidate sources of trigger graphs: meta-sources that are also direct successors of the No Fandom node in the parent–child graph. Sinks in this graph are the canonical tags and all predecessors are either a Fandom, media type, or No Fandom. The latter is added as a parent to tags that apply to many Fandoms, including content warnings but also, for example, holidays and languages. This yields ca. 5,000 tags. 4. Identification of trigger graph sources: manual annotation of all candidate sources, discarding the nodes without a trigger warning. 5. Identification of all trigger graphs: manual depth-first traversal of the tag graph along the meta-sub relation, starting from a trigger graph source. If a successor does not agree with the trigger warning assigned to its predecessor, the arc between them is removed, and the successor added as new trigger graph source to be annotated with a new trigger warning.

2.4. Sampling the Evaluation Dataset

From the resulting collection of annotated fan fiction works, we sampled PAN23-trigger-detection by discarding all works that had no warning assigned, were originally published pre-2009 (as opposed to posted after that since AO3 also archives works from older fan fiction sites), had freeform tags that could not clearly mapped, was not in English (ca. 8% of the works), had less than 50 or more than 6,000 words (outliers; ease of computation), less than 2 or more than 66 freeform tags (confidence threshold), less than 1,000 hits (views), or, less than 10 kudos (likes; popularity threshold). We also removed all (near) duplicates. The resulting dataset contains 341,246 fan fiction works, which was split with stratified sampling into 90:5:5 training, validation, and test sets; i.e., we kept the label distribution equal across the three splits.

2.5. Properties of the Evaluation Dataset

Table 4 shows the descriptive statistics of the dataset splits. The training dataset with ca. 300,000 works is large enough to train deep neural classifiers. The datasets contain ca. 5% very short documents (<512 words) that can be used by a BERT-based system without truncation and ca. 85% medium-sized documents (<4,096 words) that can be used by a sparse-attention model. Figure 2 shows the distribution of the labels over the test dataset. The most frequent label is pornography and occurs in ca. 77% of the documents. Most labels are less common, between ca. 10% for sexual-assault and 6e-4% for animal-cruelty. Documents have 1–13 labels per document, ca. 71% with a single label, 20% with two, and 6% with three.

,529462 ,42803 ,05323 ,26542 ,53902 ,67771 ,51761 ,92914 ,9498 ,2291 ,3283 ,8379 ,2558 ,5549 ,0498 ,5346 ,8918 ,7115 ,5412 ,8711 095 082 593 565 445 419 384 927 428 232 209 618 porsneoxguraalp-ahsysvaiuolltencaebusedeatphbrleogondanciynucnedsetragseuicidcehdiyldin-agsbeuhlsof-emhaorpmkhidomnbeaianptpailn-igldlnisebssoeesdcayttii-onhnga-tdreisdoarbddeurccthioildnfabti-rpthhombsiisaecxaisrrmiagterrsaacnisspmhoabbiaortaiobnlmeaiisnsmiomgayln-dyaecnalaitmhssails-cmruelty

3. Evaluation and Baselines

We evaluate the submissions primarily through the established multi-label classification metrics F1 and Accuracy (primary metrics). In addition, we also evaluate the effectiveness of individual labels/label groups (extended metrics) and the effectiveness in relation to document metadata. Lastly, we construct and evaluate voting-based ensembles from the submissions.

As primary metrics, we compare precision, recall, and F1 at both micro- and macro average, and subset accuracy, which measures accuracy on a per-sample basis (i.e., if all labels of one example are set correctly). In our assessments, we favor the macro over the micro F1 scores due to the label imbalance. We also favor recall over precision, since we consider trigger warning assignment a high-recall task where false negatives cause more harm than false positives. However, we opted not to modify the metrics or their parameters to reflect this preference.

As extended metrics, we compare precision and recall of pornography (due to its frequency), the average effectiveness of the 15 next-most common labels (sexual-assault–dissection), and the average effectiveness of the 16 least common labels. We also compute the number of classes with either zero or a very low (<0.1) precision and recall to check for high-frequency label bias.

As metadata-based metrics, we compare micro and macro F1 for the document subsets that fall within certain metadata thresholds. First, we compare short (<500), medium, and long (>4,000) documents. We assume that short works are easier to classify since models can capitalize more directly on BERT (which has a short input size). Second, we compare works with few (<5), medium, and many (>20) freeform tags. We assume that works with many freeform tags are easier to classify because many tags suggest that authors took greater care with annotating their works and the resulting higher label quality leads to better effectiveness. Third, we compare works with low (<50 comments, <60 bookmarks, >450 kudos, >8,500 hits), medium, and high (>280 comments, >330 bookmarks, >1,850 kudos, >35,000 hits) popularity. We assume that works with high popularity are also easier to classify because authors are more diligent when tagging works that gain much attention. Fourth, we compare works with an archive warning (Graphic Depictions Of Violence, Major Character Death, Rape/Non-Con, Underage), without warning (No Archive Warnings Apply), and works that do not specify the warnings (Choose Not To Use Archive Warnings). We assume that works with a warning are easier to classify and works without specified warning are the hardest, since authors hide warning tags within spoilers and might therefore less diligently annotate freeform warnings. Fifth, we compare works with an Explicit or Mature rating to works with neither. We assume that explicit or mature works contain more markers and are thus easier to classify.

Finally, we construct four ensembles from the submitted results, where the assignment of a true label is decided by voting to surpass a threshold . The Top-3 ensemble uses the three best submissions with = 2, the other ensembles use all submissions with = {3, 5, 7}.

As a baseline, we trained an XGBoost [ 5 ] classifier based on word-1–3-gram features encoded as TF· IDF document vectors with a minimum document frequency of 5. We used only the top 10,000 features according to a 2 feature selection. The dataset was undersampled uniformly at random to 1,000 samples per label. As parameters, we used a max depth or 3, a learning rate of 0.25, and 300 estimators with 10-round early stopping. The features word-1, 2, and 3-grams and character-3 and 5-grams were evaluated, as well as feature selection (with or without), model parameters, and the thresholds for over- and undersampling via grid search.

4. Submissions

The 6 submissions to the PAN 2023 Trigger Detection task employed a broad set of techniques, from hierarchical transformer structures to strategic feature engineering. Table 5 shows an overview of the different strategies used by the participants. All participants used a form of a neural network as a model, where RoBERTa was most common and most successful as a classifier or pre-trained model to produce a strong input encoding. Most submissions also focused on improving the long document aspect of the task (most documents are longer than the input size of the state-of-the-art classification models) by using hierarchical classifiers (chunks are encoded, and prediction is based on a combination of encodings), or voting-based approaches (chunks are labeled individually, document labels are aggregated over chunk labels). The submissions cope with the label imbalance (the most common label (pornography) is an order of magnitude more common than the other labels) through over- and undersampling or by changing class-weights in the loss function, so that misclassifying a rare class increases the error more than a common label.

Sahin et al. [ 6 ] submitted a hierarchical transformer architecture that achieved the top macro F1 score (by a slim margin of 0.002) and came in second in micro F1 and accuracy, while having a relatively high recall within the top approaches. The approach first segments the document into chunks (200 words with 50 words overlap) and then pre-trains a RoBERTa transformer on the chunks to learn the genre. The architecture then embeds all chunks of a document using the pre-trained transformer, followed by an LSTM for each label (in a one-vs-all setting), predicting the class from a sequence of chunk-embeddings (RoBERTa’s [CLS] token). To cope with label imbalance, the approach assigns positive weights in the loss function to the rare half of the labels.

Su et al. [ 7 ] submitted a siamese transformer that achieves the second-best macro F1 score (by a slim margin of 0.002) and the top scores in micro F1 and accuracy, while notably favoring precision over recall. The approach segments the documents into 505-word chunks, encodes the ifrst and last chunk using RoBERTa, mean-pools the contextual embeddings (ignoring the [CLS] token), and classifies based on the pooled embeddings using a 1D convolutional neural network.

Cao et al. [ 8 ] submitted a voting-based transformer that favors recall over precision. The approach segments the training documents into chunks, assigns each chunk the labels from its source document, and trains a single RoBERTa-based classifier on each chunk. To make predictions, the documents are again chunked, the labels for each chunk are predicted, and a label is assigned to the document if it is assigned to more than half of the chunks. The training data was dynamically over- and undersampled: pornography was undersampled to 5,000 examples and other labels to 2,000 examples. Examples with rare labels were replicated 8-10 times.

Cao et al. [ 9 ] also submitted a voting-based transformer that achieved very balanced results, neither favoring macro over micro scores nor precision over recall. The approach chunks and votes similarly to Cao et al. [ 8 ] but builds two different models to overcome the data imbalance, one for pornography and one for the other 31 classes. The pornography model was trained on a random selection of 40,000 works with and 40,000 works without the pornography warning. The model for the other labels removes works with only the pornography warning, undersamples frequent classes to 3,000 examples, and oversamples rare labels by replicating works 4-6 times.

Felser et al. [ 10 ] submitted a 1-vs-rest multi-layer perceptron based on two features: fasttextbased document embeddings and superclass probabilities. This approach achieved the top micro and macro recall, at the cost of precision on the test dataset. Document embeddings were created by training a fasttext model from the training data, generating the embeddings for each unique word in a document, scaling them by term frequency, and adding and normalizing the scaled word vectors over the document. The superclass probabilities were determined by grouping the 32 labels semantically into 6 superclasses, bootstrapping a seeded LDA with the 50 most relevant bi-grams of each group (determined through a TF· IDF-like approach for n-gram weighting, which downgrades pornographic terms), and training a classifier to predict the superclass based on the topic model outputs, using class probabilities as features. Label imbalance was addressed via class penalties in the loss function, where the MLP-2 variant has a higher penalty.

Lastly, Lakshmaiah et al. [ 11 ] present an LSTM-based approach using GloVE-embeddings, which is third in micro F1 with very high precision but rather weak in macro average scores. zero or below 0.1. Participants are sorted by total macro F1 (cf. Table 6).

Participant

Porn.

Mid

Bot

Zero P/R <0.1 P/R Sahin et al. [ 6 ] Su et al. [ 7 ] XGBoost baseline Cao H. et al. [ 8 ] Cao G. et al. [ 9 ] Felser et al. [ 10 ] (MLP1) Felser et al. [ 10 ] (MLP2) Lakshmaiah et al. [ 11 ]

5. Results

macro F1. Here, the hierarchical classifiers are the most effective by a large margin, followed by the XGBoost baseline. The most effective approach by macro F1 is the one by Sahin et al. with 0.352, a small margin before that of Su et al. with 0.350. The best approach by micro F1 and subset accuracy is the one by Su et al.. The XGBoost baseline is only beaten by these two top approaches. The models score very differently in precision and recall, depending on the architecture. Four models score generally higher in recall, the other 4 in precision. There is no obvious relationship between effectiveness and preference for precision or recall. The ensembles (top 3 and = 3) beat the submissions but by a very small margin of ca. 0.02. Participant scores of the trigger detection task at PAN 2023. Shown are scores of examples with certain properties based on different document lengths, number of freeform tags (tag confidence), popularity confidence (hits, kudos, comments, bookmarks), works with, without, and with unspecified AO3

archive warning, and works with or without and explicit or mature rating.

Participants are sorted by total macro F1 (cf. Table 6).

Participant Team

Length

Tag count

Popularity Short Med. Long

Few Med. Many

Low Med. High score very high on pornography and notably lower on all rare labels, which explains the difference between macro and micro F1. There is a clear decrease in efficiency with decreasing label frequency. It also becomes more obvious that models tend to be good in either precision or recall with large differences between them. Combining the strength of the high-recall and high-precision approaches is a potential way forward, albeit our basic ensemble exploits that only marginally. Regarding the document length, the macro F1 scores are mixed: Models that use the complete work as single examples during training (Sahin et al. [ 6 ], the baseline, and Felser et al. [ 10 ]) are slightly (0.05–0.1) less effective on short texts; models that use only a section of the document (Su et al. [ 7 ], Cao, G. et al. [ 9 ]) are slightly (0.05–1.0) less effective on long texts. On micro F1, all models tend to perform worse on shorter texts. This contradicts our assumption (and prior evidence [ 2 ]) that models will be generally better on short texts which can fully capitalize on BERTs strength on short inputs. An alternative hypothesis is that shorter documents are simply less clear and have fewer of the markers that the classifier expects to make a positive prediction. Regarding the tag count, the top models are slightly (0–0.1) less effective when there are many freeform tags. There is no difference between the less effective models. This also contradicts our assumption that models with many tags are easier to classify due to higher label reliability. Regarding popularity, there is no notable difference in micro F1. On macro F1, models are slightly (0.04–0.14) more efficient on high popularity works than on low popularity works. This agrees with our assumption that labels of popular works are more reliable. Regarding the archive warnings, there is no notable difference between works with or without warnings. However, the most effective models are slightly (ca. 0.05 macro, ca. 0.15 micro) less effective on works with undeclared warnings than on others. This agrees with our assumption that these works are less diligently tagged by their authors (potentially as a spoiler tag). Lastly, regarding the rating, models are more (ca. 0.2–0.3 micro F1) effective on explicit works, which is likely an artifact from the very effective classification of the pornography label. On macro F1, contrary to the micro score, the submissions are slightly (0–0.1) less effective on explicit works. This also contradicts our assumptions that explicit or mature works are easier to classify.

6. Discussion and Conclusion

We present the first task on trigger detection at PAN 2023, for which we created a 341,000 document evaluation resource of fan fiction works annotated with up to 32 labels in a multi-label classification setting. We extensively evaluate the results of six participant submissions. The most effective submissions score 0.35 on macro F1 and 0.75 on micro 1.

We find several factors that impact the effectiveness of the submissions. First, we find that encoding and training on the full documents is important for good scores on long documents and hierarchical models appear to be best in this regard. We assume that it is central to find triggering passages that only appear in some parts of the document and that inform the classification decision, instead of finding the topic or style that is also present in the beginning. Surprisingly, short documents appear to be much harder to classify, so models with a strong encoding for short texts (BERT) are important and document vectors are less effective as features. None of the top models manage to be great at both, short and long-document effectiveness, leaving potential for improvement. The effect sizes on all metadata comparisons are small (ca. 0.05–0.15).

Second, we find that all submissions are much less effective on rare labels and very effective on very common labels. We assume that the triggering concept goes beyond what can be observed from the passages in the training data, hence the models can not connect the triggers in the test data to the learned concept.

Third, we find that the submissions are more effective on popular works and less effective on works with an Choose Not To Use Archive Warnings declaration. We assume that authors’ diligence in annotating freeform tags varies a lot, so some works are under-tagged (i.e. authors want to avoid spoilers) and authors are more diligent in assigning warnings for popular works. However, we also find that the submissions are less effective on works with many freeform tags, so the reverse assumption (over-tagging decreases label reliability) also has some merit. Blood violence animal-cruelty sexual-assault abuse child-abuse abduction kidnapping

[1]

Knox , Trigger Warnings: History, Theory, Context, Rowman & Littlefield, 2017 .

[2]

Wiegmann ,

Wolska ,

Schröder ,

Borchardt ,

Stein ,

Potthast , Trigger Warning Assignment as a Multi-Label Document Classification Problem, in: Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 .

[3] University of Reading, Guide to policy and procedures for teaching and learning; Guidance on content warnings on course content ('trigger' warnings ), 2023 . URL: https://www. reading.ac.uk/cqsd/-/media/project/functions/cqsd/documents/qap/trigger-warnings.pdf, last accessed: May 10 , 2023 .

[4] University of Michigan, An Introduction to Content Warnings and

Trigger

Warnings , 2023 . URL: https://sites.lsa.umich.edu/inclusive-teaching-sandbox/wp-content/uploads/sites/853/ 2021/02/ An-Introduction-to-Content-Warnings-and- Trigger- Warnings-Draft.pdf, last accessed: May 10 , 2023 .

[5]

Chen , C. Guestrin, XGBoost: A scalable tree boosting system , in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16 , ACM , New York, NY, USA, 2016 , pp. 785 - 794 . URL: http://doi.acm. org/10 . 1145/2939672.2939785. doi: 10 .1145/2939672.2939785.

[6]

Sahin ,

I. E.

Kucukkaya , C. Toraman, ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2023 .

[7]

Su , Y. Han, H . Qi, Siamese Networks in Trigger Detection task , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2023 .

[8]

Cao ,

Han , G . Cao,

Zhu ,

Liang , S. Liu,

Huang , Trigger Warning Labeling with RoBERTa and Resampling for Distressing Content Detection , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2023 .

[9]

Cao ,

Han ,

Cao ,

Huang ,

Zeng ,

Tan ,

Cai , A dual-model classification method based on RoBERTa for Trigger Detection , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2023 .

[10]

Felser ,

Demus ,

Labudde , M. Spranger, FoSIL at PAN?23: Trigger Detection with a Two Stage Topic Classifier , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2023 .

[11]

S. H.

Lakshmaiah ,

Hegde ,

Balouchzahi , Trigger Detection in Social Media Text , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2023 .