Towards Reducing Misinformation and Toxic Content
Using Cross-Lingual Text Summarization
Hoai Nam Tran, Udo Kruschwitz
University of Regensburg, Germany


                                      Abstract
                                      Misinformation has long been considered a major problem in our digital world, but automatically
                                      identifying it still remains a challenging issue. It becomes even more of a problem when tackling content
                                      written in languages other than English. We also note that much progress has been made in classifying
                                      short social media posts, but there are many other types of misinformation. We present steps towards
                                      addressing the problem by adopting ideas that have shown to be promising in related prior work, namely
                                      applying extractive and abstractive text summarization methods so that we can process documents of
                                      any length and by incorporating machine translation as part of our overall architecture. We consider
                                      misinformation as just one out of many types of content that should be identified automatically on
                                      the way to a healthier digital ecosystem and see toxic content such as hate speech as naturally falling
                                      within the same scope of our work. We demonstrate on several benchmark collections covering both
                                      misinformation and toxic content that our approach is robust and achieves competitive performance on
                                      these datasets. This offers plenty of scope for future work. To foster reproducibility, we make all code
                                      and models available to the community via GitHub and Hugging Face.

                                      Keywords
                                      Misinformation, Text summarization, Toxic content detection, Cross-lingual


1. Introduction
Fake News and Hate Speech have one thing in common: the aim is to spread toxicity and to
bring harm to the world. They have now become serious and significant social and political
issues [1]. How did we get there? One aspect is that users have been shown to get more
easily persuaded and influenced by social media posts, causing them to change their attitude
[2]. In combination with the excessive usage of social media, the desire for validation and the
fear of rejection negatively impact our mental health, especially for teenagers and children
[3]. Much progress has been made recently in addressing the challenge, often focussing on
social media. Searching for relevant information with common information retrieval systems
and natural language processing pipelines gets more complicated with the amount of harmful
misinformation. The flood of toxic content and polarization leads to distrust in any news
channel; e.g., only 26% of American adults trust any news media [4, 5], which is why we need
to improve the quality of the information we consume. Most competitive approaches include

ROMCIR 2023: The 3rd Workshop on Reducing Online Misinformation through Credible Information Retrieval, held as
part of ECIR 2023: the 45th European Conference on Information Retrieval, April 2-6, 2023, Dublin, Ireland
$ Hoai-Nam.Tran@student.ur.de (H. N. Tran); Udo.Kruschwitz@ur.de (U. Kruschwitz)
 0000-0002-5503-0341 (U. Kruschwitz)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
incorporating transformer-based models like BERT [6] to assist social media moderators and
fact-checkers in combating harmful content. However, since news articles and popular blog
posts are also affected by misinformation and hateful assertions, and transformer-based models
have a limited input size (e.g., 512 for BERT), the challenge here is to find a way to also use
these models effectively for longer texts. This is where we propose text summarization. The
second motivation is the fact that very few languages can be considered resource-rich, making
it desirable to tap into such resources when tackling toxic content in other languages.
   This paper presents a framework combining automatic machine translation, text summariza-
tion, and classification, tackling misinformation and toxic content. We provide experimental
results for several common benchmarks using both binary and multi-class classification. To
foster reproducibility, we make all code, hyperparameters, and detailed result tables available
via GitHub1 .


2. Related Work
This section provides an overview of related work in fake news detection, hate speech detection,
multilingual machine translation, and text summarization. Since misinformation leads to toxic
polarization, which again leads to abusive language, and hate speech is generally considered
harmful, we consider both fields to fall within the scope of detecting toxic content.
   Fake news detection and hate speech detection (HSD) are two active research areas,
with research often guided by shared tasks and competitions, e.g., as part of CLEF, SemEval,
or GermEval. While users usually try to write more engaging comments to achieve more user
interactions, the user’s “dark side” [7] is the posting of hateful comments, including toxic and
offensive language. There are monolingual and multilingual approaches to detect hate speech
and toxic comments since online comments can be written in different languages and possibly
a mix of several. A number of different approaches can be adopted to tackle this task, and at a
high level one can distinguish content-based and context-based methods [8]. We are focusing on
content-based approaches. Rather than providing a review of this massive body of literature we
just point out that Transformer models with self-attention [9] like BERT [6], BART [10], and
T5 [11] dominate the field. HateBERT [12] is a BERT variant pre-trained with abusive online
community data from the social news and discussion platform Reddit2 . A provided list of criteria
from tweets as predictive features can help to identify racist and sexist insults [13]. They are
all Twitter-based in the HSD domain, similar to some of the datasets we use for our approach.
A survey of datasets on the topic of fake news detection and fact verification includes several
fact-checking sites [14]. Full Fact3 is an example of a fact-checking organisation that aims at
identifying harmful content with intelligence and monitoring tools, e.g., CrowdTangle, which
helps the user with manual fact-claim checking by raising alerts if exact user-defined keywords
are triggered [15]. Multilinguality is commonly addressed by using transformer models trained
on multiple languages, e.g., XLM-RoBERTa [16] has cross-lingual capabilities to work in several
tasks and benchmarks containing harmful texts which can appear in multiple languages. Fusion

   1
     https://github.com/HN-Tran/ROMCIR_2023
   2
     https://www.reddit.com/
   3
     https://fullfact.org/
strategies with mBERT and XLM-RoBERTa for multilingual toxic text detection [17] or deep
learning ensembles for effective hate speech detection are other approaches that are similar to
ours. Given that multilingual models still focus on a limited number of languages for pre-training,
we explore automatic machine translation as an alternative.
   Machine translation is an essential part of many online services nowadays. In a survey
by CSA Research, 76% of online shoppers prefer information in their native language, and
40% would never buy from websites with only other languages [18]. Also, the global machine
translation market has increased from 450 million USD in 2017 to 1.1 billion USD in 2022 [19, 20].
The two most popular translation services are Google Translate [21] and Microsoft Translator
[22]. Both services are available in multiple languages and can be used for free. DeepL Translator
is a relatively new translation service that uses proprietary neural networks to translate text
[23]. They claim to surpass Google Translate and Microsoft Translator in terms of quality and
speed in several European languages [24]. These services are however not always accurate
and can even be exploited by malicious actors [25, 26]. Since transformer-based models and
automatic translation services are limited in their input length, summarization is our approach
to overcome this limitation.
   Summarization is still an active research field that successfully utilizes extractive machine
learning [27, 28] and abstractive approaches [29, 30, 31]. It has only recently been considered
in this context with state-of-the-art performance for fake news detection using a common
reference benchmark collection reported [32]. We propose to utilize progress in the field by
providing a novel combination of established techniques, leaving plenty of room for future
work to explore this idea further.
   In summary, we observe that misinformation and toxic content detection are conceptually
related areas that remain open problems despite progress that has been made in recent years.
We are interested in exploring content-based ideas that have shown promise in previous work
and see our contribution as one possible direction that utilises different types of automatic text
summarisation as well as machine translation. Future work can then explore this in more depth
and breadth.


3. Methodology
Our general framework is a pipeline-based architecture, as illustrated in Figure 1. It has three
main components: automatic machine translation, summarization, and classification. Due
to the availability of common benchmarks in German, it is our language of choice for the
source documents (but we obviously envisage this approach to be applied to actually under-
resourced languages in future work). Each dataset gets machine-translated into English which
is followed by transformer-based text summarization. German-based models take the original
texts/comments and summarized texts as the input in the fine-tuning process, while domain-
specific and multilingual models use the translations. We train our models 5 times (runs) with
a different seed, where the model with the highest macro F1 score in each 50 steps for each
run gets chosen, and the inference outputs the predictions of each model. After that, all the 5
runs get ensembled in both majority voting types (hard and soft voting). Finally, the ensembled
models are ensembled again. Finding the optimal number of models for the ensemble is difficult
   Dataset                                            Transformers (Trainer)


                                                                               Ensembles
                                                           English models
   German
  comments       Machine
                Translation

                                                                                           Ensemble
                                 Summarization                                 Ensembles    strategy
                                                        Multilingual models


   Labels

                                                                               Ensembles
                                                          German models


Figure 1: General Architecture


because of the danger of overfitting and averaging. Thus we set a fixed number of 5 runs.
   German tasks and resources for training are sparse since English is the most represented
language in many benchmarks (~1000 tasks). German (~30 tasks) is even less represented than
other languages like Spanish (~60 tasks), Hindi (~45 tasks), and even Bengali (~35 tasks) [33].
Thus, we have decided on bilingual German and English datasets that can later be applied to
other languages.
   For fine-tuning, we use the recommended arithmetic mean over harmonic mean (see Equation
1) due to its robustness towards error type distribution [34]:

                                        1 ∑︁       1 ∑︁ 2𝑃𝑥 𝑅𝑥
                                 ℱ1 =        F1𝑥 =                                                 (1)
                                        𝑛 𝑥        𝑛 𝑥 𝑃𝑥 + 𝑅𝑥

  As the GermEval metric, we use the harmonic mean over arithmetic mean (see Equation 2):

                                       ¯𝑅¯     ( 𝑛1 𝑥 𝑃𝑥 )( 𝑛1 𝑥 𝑅𝑥 )
                                                   ∑︀          ∑︀
                                      2𝑃
                           ¯  ¯
                  F1 = 𝐻(𝑃 , 𝑅) = ¯ ¯ = 2 1 ∑︀                1 ∑︀                    (2)
                                     𝑃 +𝑅       𝑛    𝑥 𝑃𝑥 + 𝑛      𝑥 𝑅𝑥


4. Experiments
Here we will briefly outline the experimental setup, choice of tools, and datasets.

4.1. Data
We use the following shared task datasets (there is clearly scope to explore other, less-resourced
languages and classification tasks in future work):

     • GermEval 2018 Subtask 1 [35]
     • GermEval 2019 Task 2 Subtask 1 [36]
     • GermEval 2021 Subtasks 1-3 [37]
     • CLEF 2022 CheckThat! Lab Task 3 [38]
Table 1
GermEval Dataset Sizes
                                        GermEval 2021                  GermEval 2018          GermEval 2019 Task 2
 Dataset          Label     Subtask 1      Subtask 2     Subtask 3         Subtask 1               Subtask 1
                  True         1122              865        1103             1688                       1287
 Training
                  False        2122             2379        2141             3321                       2707
                  True          350             253         314              1202                       970
   Test
                  False         594             691         630              2330                       2061


Table 2
CLEF CheckThat! 2022 Dataset Sizes
                                                          CLEF CheckThat! 2022
                                      Training Set     Development Set         Test Set
                    Label              Subtask 3         Subtask 3          Subtask 3A        Subtask 3B
                  True                    142                69                 210               243
                  False                   465               113                 315               191
             Partially False              217               141                 56                97
                 Other                     76                41                 31                55


Table 3
Character Length Distribution for CheckThat! 2022
    Variables             Statistics     Training Set      Development Set          Test Set 3A     Test Set 3B
                           Median                70                 66                   73               67
                           Mean                 286                171                   78               71
          Title
                          Minimum                 3                  3                   11               3
                          Maximum               9960               8092                 200              234
                           Median            3035                  3115                 3655            4009
                           Mean              4167                  4498                 6052            5617
          Text
                          Minimum              18                    25                  289             507
                          Maximum            32767                 44359               100000           45309


  All datasets contain an imbalance in their class label distributions (see Table 1), and the
number of characters are also very different. GermEval datasets fit the short text scenario since
they are comments from social networks and the CheckThat! dataset fits the long text scenario
with a maximum size of 100,000 characters (see Table 3). Since BERT models usually have a
maximum token limit of 512, the input would automatically be truncated to the first 512 tokens,
and thus relevant information might get lost in this process.

4.2. Automatic Machine Translation
For machine translation, we can choose between a text generation model like T5 [39] or
commercial translation services for our purpose. Since the translation quality depends on
Table 4
Hyperparameters
                                            GermEval                            CLEF CheckThat! 2022
                        GBERT                        XLM-R
 Hyperparameters                    BERT-based                      T5-based   BERT-based      T5-based
                       GELECTRA                     BERTweet
 Learning rate            5e-6           2e-5           1e-5          1e-3        1e-5           4e-5
 Max Steps                               705                            —         705              —
 Max Epochs                               —                           200          —             200
 Evaluation Steps                        50                           2000        50             2000
 Early Stopping                          no                            yes        no              yes
 Batch Size                              32                          2 – 32       32               4
                                       128 (GermEval 2021)
 Max Sequence Length                                                                     256
                                 150 (GermEval 2018, 2019 Task 2)


its pre-trained corpora quantity and quality, we decided on the two most popular machine
translation services: Google Translate and DeepL Translator.

4.3. Splitting Methods
There are two standard options for deciding how to split our data for the fine-tuning process:
Fixed random seeds and k-fold cross-validation. In [40], they used random seeding, and since
we made five runs with an imbalanced dataset, a stratified 5-fold cross-validation was the other
option. Our results show that using fixed random seed values is better than using stratified
k-fold cross-validation.

4.4. Hyperparameter Optimization
Since searching for the optimal hyperparameters for our models is difficult, especially looking
for ways to avoid overfitting, we use the Optuna [41] library, which can be integrated into the
Hugging Face Trainer library as an option for hyperparameter search. Since even the default
hyperparameters can lead to overfitting in specific benchmark datasets, the chance of having
similar data points between the development and test set is given. We tested 100 combinations,
which evaluates the best possible setting for our macro-F1 metric (see Equation 1). The question
is whether a complete automatic hyperparameter search can be conducted by a tool like Optuna
to work effectively without looking for any working hyperparameters.
   After choosing the first three best runs, the results show that it is an appropriate way to
find parameters for the development set but not for the test set. We have not tested this on
a fixed amount of known default hyperparameters yet. Since hyperparameter optimization
is also a very time-consuming process, we have decided to use each model’s recommended
hyperparameters if reported in the corresponding papers. If not, we use the exact parameters of
their model architecture.
4.5. Summarization
We use both extractive and abstractive summarization separately and exclusively. Since we only
have one textual input, we first concatenate the title and the text with a dot so that the title is
considered the first sentence (see Equation 3). Sometimes, the title is written like clickbait, a
sentence without any information-relevant value.
                                                 𝑡𝑖𝑡𝑙𝑒 + .     + 𝑡𝑒𝑥𝑡                           (3)
  Extraction-based summarization aims to select the most relevant representations of the
given text input. In the used library [42], we apply k-means clustering and use the Elbow
method to find the optimal 𝑘 [43, 44]. Our chosen model is DistilBART-CNN-12-64 which is
based on BART [10] with distillation [45], fine-tuned with the CNN and DailyMail dataset [46].
  Abstraction-based summarization aims to generate shorter text with the most relevant
representations of the given text input. We use the version of T5 [11] with three billion
parameters (T5-3B) to generate shorter text with the identical prompt template ("summarize:")
used in the pre-training process for the CNN/DailyMail dataset [46]. With the use of relative
positional embeddings, the utilization of much longer text at the cost of higher computing
consumption is possible [47, 11].

4.6. Classification Tasks
Longer text can contain more information, therefore we often need more labels to classify
them and thus show two different classification types: Binary classification for two labels, and
multi-class classification for more than two class labels. At the end of the pipeline, we ensemble
the results of the summarization and classification tasks to get the final result.

4.6.1. Binary Classification
If the text is short, possibly containing a single sentence (as is often the case in social media),
the labels might be "true" and "false" or "toxic" and "non-toxic". However, other labels might be
used (such as "other"), turning the task into a multi-label classification. The chosen GermEval
datasets have two labels; thus, only the machine translation before is needed for fine-tuning.
We have decided for BERT [6] as our English-based model, GBERT and GELECTRA [40] as
our German-based model, XLM-RoBERTa [16] as our multilingual model with both German
and English input, and BERTweet [48] as our Twitter-based model. After five runs, they get
ensembled together in hard and soft majority voting (see Table 8). Then again, we choose the
best five model ensembles (in GermEval 2018 and 2019, the best three) and ensemble them in
three different ensembling strategies: Majority Voting (both hard and soft voting), Gradient
Boosting Machines and Logistic Regression (see Table 8).

4.6.2. Multi-Class Classification
Long texts typically contain more sentences and possibly a broader spread of topics. This
leads to classification tasks that go beyond a simple binary decision (e.g., one might consider
    4
        https://huggingface.co/sshleifer/distilbart-cnn-12-6
Table 5
Machine Translation Performance
        Translation Service        Run 1     Run 2     Run 3      Run 4   Run 5   Hard    Soft
        Google Translate           69.48     67.08      67.67     68.74   68.28   68.39   68.42
        DeepL Translator           70.01     70.22      69.24     68.26   67.67   70.13   70.09


"partially false" or "partially true"). The CheckThat! 2022 dataset has four different class labels
with imbalanced distributions. For the classification process, we use three large models: BERT
Uncased [6], XLM-RoBERTa [16], and T5-3B [11].

4.7. General Setup
All of our experiments are conducted on the following datasets with the following GPUs:
the GermEval 2018 and 2019 datasets with GTX/RTX 1080/2080 Ti (11 GB VRAM) including
GermEval 2021 base models, Tesla V100S (32 GB VRAM) for the large models in the GermEval
2021 datasets, and the CheckThat! 2022 datasets with RTX A6000 with 48 GB VRAM. We use
the SimpleTransformers library5 for the T5 model and all other transformer models with the
Hugging Face Transformers library6 . For the summarization task, we use the BERT Extractive
Summarizer library7 [42], and for machine translation, we use the deep-translator library8 in
combination with the free public Google Translate service9 and the pro version of the DeepL
Translator service10 . Our hyperparameters are in Table 4.


5. Results
We observe that our approach is highly competitive and robust for both types of classification
and all datasets.

5.1. Machine Translation
We would first like to report some insights into the choice of Machine Translation tools. The
results show that for this experiment DeepL Translator appears to be a better choice than Google
Translate, but the score difference is very close, so both are solid choices (see Table 5). For the
GermEval datasets, we apply the machine translations of the DeepL Translator service. For the
CheckThat! 2022 dataset, since the maximum of the text, can be at 100,000 characters, we use
the free Google Translate service as a financial constraint. Since the service has an internal
character limit, we only take the first 5,000 characters for translation.


    5
      https://simpletransformers.ai/
    6
      https://huggingface.co/transformers
    7
      https://github.com/dmmiller612/bert-extractive-summarizer
    8
      https://github.com/nidhaloff/deep-translator
    9
      https://translate.google.com/
   10
      https://www.deepl.com/en/pro#developer
Table 6
Random Hyperparameter Tuning with Optuna
  Run    Dataset             With Hyperparameter Tuning           Without Hyperparameter Tuning
         Development Set                  70.49                                70.57
   1
         Test Set                         67.61                                67.67
         Development Set                  70.44                                70.43
   2
         Test Set                         68.15                                70.22
         Development Set                  69.81                                69.81
   3
         Test Set                         67.25                                68.26


Table 7
Splitting Strategy
  Splitting Strategy                   Run 1      Run 2   Run 3     Run 4   Run 5      Hard    Soft
  Stratified K-Fold Cross-Validation   68.78      67.79   68.21     69.62    67.29     68.99   69.59
  Random Seed                          70.01      70.22   69.24     68.26    67.67     70.13   70.09


5.2. Hyperparameter Optimization
Table 6 shows that the difference between the use of hyperparameter search is marginal and
even worse on the test set. It shows that tuning more into the development set leads to worse
results on the test set, especially visible in the third run. While this indicates overfitting and the
general preference for generalization, the generally worse results on the development set with
hyperparameter tuning cannot be explained with overfitting but with too many other factors to
consider. Also, the drawback of the search duration makes this step insignificant and redundant.
That is why we continued with the default hyperparameters.

5.3. Splitting Methods
As shown in Table 7, the difference after majority voting is minor, and thus both strategies are
eligible. If we look at each run, the difference is also very narrow. Thus, picking up a splitting
strategy is not essential and is not a deciding factor in the system architecture. We decide to
continue with random seeding.

5.4. Binary Classification and Ensembling
For all GermEval datasets, we observe the potential for improvement over previously reported
SOTA results (see Table 8). For GermEval 2021 Subtask 1, the score improvement is noticeable
at 4.48% compared to the highest score reported so far. Except for GermEval 2021 Subtask 2,
where all results are more or less on par (which might in part be an issue with the gold standard
labels), all other results demonstrate the added value our approach offers.
   Of all the ensembling strategies, the popular majority voting is still the most effective one.
Since Gradient Boosting Machines and Logistic Regression are both linear models, we expect
Table 8
Binary Classification on GermEval datasets (new best performance in bold)
                       GermEval ’18    GermEval ’19    GermEval ’21    GermEval ’21    GermEval ’21
                        Subtask 1      T2 Subtask 1     Subtask 1       Subtask 2       Subtask 3
 Model                 Hard    Soft    Hard    Soft    Hard    Soft    Hard    Soft    Hard    Soft
 GBERTbase             76.28   75.91   76.50   76.64   68.11   67.84   68.22   68.30   74.23   74.78
 GELECTRAbase          75.45   75.37   75.15   74.92   69.38   69.68   67.96   67.60   76.52   77.11
 BERTweetbase          78.02   78.05   77.23   77.44   70.13   70.09   68.23   68.84   75.51   75.47
 BERTbase              77.23   77.17   76.63   76.55   64.68   64.71   68.89   69.39   73.36   72.71
 XLM-Rbase (de)        75.71   76.00   75.51   75.17   67.37   67.21   68.49   67.90   73.84   74.26
 XLM-Rbase (en)        76.67   77.04   77.35   77.11   68.24   68.20   69.11   69.72   74.35   74.61
 GBERTlarge            80.74   80.63   80.06   80.23   72.09   72.69   69.45   68.89   75.77   76.10
 GELECTRAlarge         80.06   79.85   80.80   80.79   71.62   71.72   70.16   70.24   75.06   74.26
 BERTweetlarge         79.97   79.86   79.56   79.86   73.60   72.24   69.82   70.36   75.14   75.48
 BERTlarge             78.34   78.32   77.79   77.79   67.00   65.26   69.86   69.47   74.58   75.07
 XLM-Rlarge (de)         -       -       -       -     69.04   69.12   69.51   68.60   76.36   76.82
 XLM-Rlarge (en)         -       -       -       -     71.71   71.48   68.77   69.99   76.54   77.44

 Ensemble              Hard    Soft    Hard    Soft    Hard    Soft    Hard    Soft    Hard    Soft
 Gradient Boosting     79.97   80.95   80.28   81.77   76.23   74.03   68.25   69.47   75.82   76.12
 Logistic Regression   79.97   80.91   81.14   81.52   74.11   75.09   68.56   69.65   75.61   74.03
 Majority Voting       80.99   81.48   82.06   82.36   75.22   74.72   69.22   70.09   77.82   76.89

 SOTA                   80.70 [40]      76.95 [49]      71.75 [50]      69.98 [51]      76.26 [50]


that the linear combination of their predictions will be more effective than the majority voting.
However, the results show that the majority voting is still the best strategy.

5.5. Summarization and Multi-Class Classification
As shown in Table 9, the combination of summarization and classification leads to noticeable
improvements (e.g., 5.63% for Task 3A). Unlike in the previous experiments, here we do not
apply ensembling, which could lead to further improvements in robustness and overall results.
An interesting observation here is the discrepancy between development sets and the test set
results.


6. Limitations and Future Work
The results show that our approach is robust and achieves state-of-the-art performance on these
datasets. That offers plenty of directions for future work. However, before we can start with
future work, we need to discuss the limitations of our approach.
   The first limitation is the fact that the hyperparameter search was random. A fixed scope of
hyperparameters might have led to better results for training. Another limitation is that we
have summarized every data point in the dataset. That means that even short text snippets
were summarized. We do not use the DeepL Translator service for all experiments because
Table 9
Multi-Class Classification on CheckThat! 2022 dataset (new SOTA in bold)
 Summarization Model          Classification Model   Run Nr.   Dev     Dev-Test   Test 3A      Test 3B
                                                        1      52.40    52.18       28.33        28.99
                                                        2      46.43    39.96       26.87        19.46
                                     BERTlarge          3      48.77    52.78       30.70        28.69
                                                        4      49.21    48.44       32.31        25.32
                                                        5      53.25    51.85       30.19        20.46
   DistilBART-CNN-12-6
                                                        1      50.53    41.04       30.42        27.40
         (extractive)
                                                        2      50.93    44.54       33.11        28.01
                                    XLM-Rlarge          3      49.08    48.56       30.82        26.09
                                                        4      50.80    43.99       28.23        21.94
                                                        5      50.95    40.29       32.47        23.34
                                      T5-3B             1      48.05    46.52       39.54        29.58
                                                        1      56.33    51.15       28.89        21.34
                                                        2      45.85    37.87       32.88        23.43
                                     BERTlarge          3      55.08    46.80       35.24        28.33
                                                        4      52.15    47.08       36.48        27.01
                                                        5      51.32    46.91       30.56        21.77
             T5-3B
                                                        1      51.54    44.81       31.66        28.99
          (abstractive)
                                                        2      49.36    42.84       35.63        30.06
                                    XLM-Rlarge          3      49.73    44.91       35.67        27.82
                                                        4      50.59    44.79       36.01        26.86
                                                        5      51.78    40.25       35.29        28.09
                                      T5-3B             1      52.08    43.82       29.72        23.72

                                        SOTA                                      33.91 [52]   28.99 [53]


the free version is limited to 500,000 characters per month11 and one data point of the CLEF
CheckThat! 2022 dataset already hits 100,000 characters. Since the performance difference is
very close, we decided to use the free Google Translate service for the CheckThat! 2022 dataset.
We also want to warn about possible outputs caused by "model hallucination," which is not yet
usable for production.
   As future work, the investigation of text generation with bigger models like GPT-3 [54],
ChatGPT [55], PaLM [56], Flan-T5 [33], and others is interesting to see if our approach will
improve by simply having more parameters and more pre-trained data. Text generation tasks
like machine translation or summarization would benefit the increased accuracy of the models
and thus would lead to a real-world production environment to tackle fake news and hate
speech. Especially in the summarization task, we want to understand if summarizing text
snippets below 512 tokens makes a difference in performance. The increased performance by
summarization opens the question of why exactly it works and remains contentious. Another
open question is what the optimal amount of models for the ensemble is, where a correlation
between amount of dataset and diversity of models needs to be explored. Another important
question is how each module of the pipeline, especially summarization and machine translation,
work separately on a larger scale. Different benchmark datasets for each different tasks are
needed to investigate the performance of each module.
   11
        https://www.deepl.com/en/pro#developer
7. Conclusion
We propose a general architecture to deal with text classification in a cross-lingual context
tapping into resources available for high-resourced languages and making use of abstractive
and extractive summarization. We demonstrate the potential that this approach offers using
existing non-English benchmark collections for fake news and hate speech classification. This
lays the groundwork for future work, which should look at a range of low-resource languages.


Acknowledgments
This work was supported by the project COURAGE: A Social Media Companion Safeguarding
and Educating Students funded by the Volkswagen Foundation, grant number 95564. We want
to thank all reviewers for their insightful comments that helped us to improve our work.


References
 [1] P. Nakov, D. P. A. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti,
     S. Shaar, G. D. S. Martino, Automated fact-checking for assisting human fact-checkers, in:
     Z. Zhou (Ed.), Proceedings of the Thirtieth International Joint Conference on Artificial
     Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, ijcai.org,
     2021, pp. 4551–4558.
 [2] Y. Wang, Y. Dai, H. Li, L. Song, Social media and attitude change: Information booming
     promote or resist persuasion?, Frontiers in Psychology 12 (2021).
 [3] H. Fersko, Is social media bad for teens’ mental health?, 2018. URL: https://www.unicef.
     org/stories/social-media-bad-teens-mental-health.
 [4] N. Newman, R. Fletcher, C. T. Robertson, K. Eddy, R. K. Nielsen, Digital news re-
     port 2022, 2022. URL: https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2022-06/
     Digital_News-Report_2022.pdf.
 [5] Reuters Institute for the Study of Journalism, Share of adults who trust news media most
     of the time in selected countries worldwide as of February 2022 [Graph], 2022. URL: https:
     //www.statista.com/statistics/308468/importance-brand-journalist-creating-trust-news/.
 [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
     N19-1423. doi:10.18653/v1/N19-1423.
 [7] G. L. de Holanda Coelho, P. H. P. Hanel, R. P. Monteiro, R. Vilar, V. V. Gouveia, The dark
     side of human values: How values are related to bright and dark personality traits, The
     Spanish Journal of Psychology 24 (2021).
 [8] K. Sharma, F. Qian, H. Jiang, N. Ruchansky, M. Zhang, Y. Liu, Combating fake news: A
     survey on identification and mitigation techniques, ACM Transactions on Intelligent
     Systems and Technology (TIST) 10 (2019) 1–42. Publisher: ACM New York, NY, USA.
 [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Proceedings of the 31st International Conference on
     Neural Information Processing Systems, NIPS’17, Curran Associates Inc., USA, 2017, pp.
     6000–6010. URL: http://dl.acm.org/citation.cfm?id=3295222.3295349.
[10] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, BART: Denoising sequence-to-sequence pre-training for natural language gen-
     eration, translation, and comprehension, in: Proceedings of the 58th Annual Meeting
     of the Association for Computational Linguistics, Association for Computational Lin-
     guistics, Online, 2020, pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703.
     doi:10.18653/v1/2020.acl-main.703.
[11] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of
     Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[12] T. Caselli, V. Basile, J. Mitrović, M. Granitzer, HateBERT: Retraining BERT for abusive
     language detection in English, in: Proceedings of the 5th Workshop on Online Abuse and
     Harms (WOAH 2021), Association for Computational Linguistics, Online, 2021, pp. 17–25.
[13] Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate
     speech detection on Twitter, in: Proceedings of the NAACL Student Research Workshop,
     Association for Computational Linguistics, San Diego, California, 2016, pp. 88–93.
[14] T. Murayama, Dataset of fake news detection and fact verification: A survey, ArXiv
     abs/2111.03299 (2021).
[15] P. Arnold, The challenges of online fact checking, Technical Report, Full Fact, London, UK,
     2020. URL: https://fullfact.org/media/uploads/coof-2020.pdf.
[16] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL:
     https://aclanthology.org/2020.acl-main.747. doi:10.18653/v1/2020.acl-main.747.
[17] G. Song, D. Huang, Z. Xiao, A study of multilingual toxic text detection approaches under
     imbalanced sample distribution, Information (Switzerland) 12 (2021) 1–16.
[18] CSA Research, Survey of 8,709 Consumers in 29 Countries Finds that 76%
     Prefer Purchasing Products with Information in their Own Language, 2020.
     URL:           https://csa-research.com/Blogs-Events/CSA-in-the-Media/Press-Releases/
     Consumers-Prefer-their-Own-Language.
[19] Global Market Insights Inc., Machine translation market size worldwide, from 2016 to 2024
     (in million U.S. dollars) [Graph], 2017. URL: https://www.statista.com/statistics/748358/
     worldwide-machine-translation-market-size/.
[20] P. Wadhwani, S. Gankar, Machine translation market size: Industry anal-
     ysis,    2022-2030,       2022. URL: https://www.gminsights.com/industry-analysis/
     machine-translation-market-size.
[21] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,
     Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato,
     T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. R. Smith, J. Riesa,
     A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, J. Dean, Google’s neural machine
     translation system: Bridging the gap between human and machine translation, ArXiv
     abs/1609.08144 (2016).
[22] M. Junczys-Dowmunt, Microsoft translator at WMT 2019: Towards large-scale document-
     level neural machine translation, in: Proceedings of the Fourth Conference on Ma-
     chine Translation (Volume 2: Shared Task Papers, Day 1), Association for Computational
     Linguistics, Florence, Italy, 2019, pp. 225–233. URL: https://aclanthology.org/W19-5321.
     doi:10.18653/v1/W19-5321.
[23] D. Coldewey, F. Lardinois, DeepL schools other online translators with
     clever machine learning,               2017. URL: https://techcrunch.com/2017/08/29/
     deepl-schools-other-online-translators-with-clever-machine-learning/.
[24] DeepL, Translation quality, ???? URL: https://www.deepl.com/en/quality.html.
[25] J. Fuchs, Spoofing Google Translate to Steal Credentials, 2022. URL: https://www.avanan.
     com/blog/spoofing-google-translate-to-steal-credentials.
[26] E. Montalbano, Cyberattackers spoof google translate in unique phish-
     ing      tactic,     2022.     URL:      https://www.darkreading.com/threat-intelligence/
     cyberattackers-spoof-google-translate-unique-phishing-tactic.
[27] J. L. Neto, A. A. Freitas, C. A. A. Kaestner, Automatic text summarization using a machine
     learning approach, in: Brazilian Symposium on Artificial Intelligence, 2002.
[28] R. Nallapati, F. Zhai, B. Zhou, Summarunner: A recurrent neural network based sequence
     model for extractive summarization of documents, ArXiv abs/1611.04230 (2016).
[29] S. Chopra, M. Auli, A. M. Rush, Abstractive sentence summarization with attentive recur-
     rent neural networks, in: North American Chapter of the Association for Computational
     Linguistics, 2016.
[30] R. Nallapati, B. Zhou, C. N. dos Santos, Çaglar Gülçehre, B. Xiang, Abstractive text summa-
     rization using sequence-to-sequence rnns and beyond, in: Conference on Computational
     Natural Language Learning, 2016.
[31] R. Paulus, C. Xiong, R. Socher, A deep reinforced model for abstractive summarization,
     ArXiv abs/1705.04304 (2017).
[32] P. Hartl, U. Kruschwitz, Applying automatic text summarization for fake news detection,
     in: Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022),
     2022, pp. 2702-–2713.
[33] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani,
     S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, D. Valter,
     S. Narang, G. Mishra, A. W. Yu, V. Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. hsin
     Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. Le, J. Wei, Scaling instruction-finetuned
     language models, ArXiv abs/2210.11416 (2022).
[34] J. Opitz, S. Burst, Macro f1 and macro f1, arXiv preprint arXiv:1911.03347 (2019).
[35] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the germeval 2018 shared task on the
     identification of offensive language, Proceedings of GermEval 2018, 14th Conference on
     Natural Language Processing (KONVENS 2018), Vienna, Austria – September 21, 2018,
     Austrian Academy of Sciences, Vienna, Austria, 2019, pp. 1 – 10. URL: https://nbn-resolving.
     org/urn:nbn:de:bsz:mh39-84935.
[36] J. M. Struß, M. Siegel, J. Ruppenhofer, M. Wiegand, M. Klenner, Overview of germeval task
     2, 2019 shared task on the identification of offensive language, Preliminary proceedings
     of the 15th Conference on Natural Language Processing (KONVENS 2019), October 9
     – 11, 2019 at Friedrich-Alexander-Universität Erlangen-Nürnberg, German Society for
     Computational Linguistics & Language Technology und Friedrich-Alexander-Universität
     Erlangen-Nürnberg, München [u.a.], 2019, pp. 352 – 363. URL: https://nbn-resolving.org/
     urn:nbn:de:bsz:mh39-93197.
[37] J. Risch, A. Stoll, L. Wilms, M. Wiegand, Overview of the GermEval 2021 shared task on
     the identification of toxic, engaging, and fact-claiming comments, in: Proceedings of the
     GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming
     Comments, Association for Computational Linguistics, Duesseldorf, Germany, 2021, pp.
     1–12.
[38] J. Köhler, G. K. Shahi, J. M. Struß, M. Wiegand, M. Siegel, T. Mandl, M. Schütz, Overview of
     the clef-2022 checkthat! lab task 3 on fake news detection, Working Notes of CLEF (2022).
[39] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of
     Machine Learning Research 21 (2020) 1–67.
[40] B. Chan, S. Schweter, T. Möller, German’s next language model, in: Proceedings of the
     28th International Conference on Computational Linguistics, International Committee on
     Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 6788–6796.
[41] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperpa-
     rameter optimization framework, in: Proceedings of the 25th ACM SIGKDD International
     Conference on Knowledge Discovery & Data Mining, KDD ’19, Association for Computing
     Machinery, New York, NY, USA, 2019, p. 2623–2631.
[42] D. Miller, Leveraging bert for extractive text summarization on lectures, arXiv preprint
     arXiv:1906.04165 (2019).
[43] J. MacQueen, et al., Some methods for classification and analysis of multivariate obser-
     vations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and
     probability, volume 1, Oakland, CA, USA, 1967, pp. 281–297.
[44] T. M. Kodinariya, P. R. Makwana, Review on determining number of cluster in k-means
     clustering, International Journal 1 (2013) 90–95.
[45] S. Shleifer, A. M. Rush, Pre-trained summarization distillation, ArXiv abs/2010.13002
     (2020).
[46] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, P. Blun-
     som, Teaching machines to read and comprehend, in: C. Cortes, N. Lawrence, D. Lee,
     M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems,
     volume 28, Curran Associates, Inc., 2015. URL: https://proceedings.neurips.cc/paper/2015/
     file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf.
[47] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position representations, in:
     Proceedings of the 2018 Conference of the North American Chapter of the Association
     for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers),
     Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 464–468.
     URL: https://aclanthology.org/N18-2074. doi:10.18653/v1/N18-2074.
[48] D. Q. Nguyen, T. Vu, A. T. Nguyen, BERTweet: A pre-trained language model for English
     Tweets, in: Q. Liu, D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos,
     Online, November 16-20, 2020, Association for Computational Linguistics, 2020, pp. 9–14.
[49] A. Paraschiv, D.-C. Cercel, Upb at germeval-2019 task 2: Bert-based offensive language
     classification of german tweets., in: KONVENS, 2019.
[50] T. Bornheim, N. Grieger, S. Bialonski, FHAC at GermEval 2021: Identifying German toxic,
     engaging, and fact-claiming comments with ensemble learning, in: Proceedings of the
     GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming
     Comments, Association for Computational Linguistics, Duesseldorf, Germany, 2021, pp.
     105–111.
[51] N. Hildebrandt, B. Boenninghoff, D. Orth, C. Schymura, Data science kitchen at Ger-
     mEval 2021: A fine selection of hand-picked features, delivered fresh from the oven, in:
     Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging,
     and Fact-Claiming Comments, Association for Computational Linguistics, Duesseldorf,
     Germany, 2021, pp. 88–94. URL: https://aclanthology.org/2021.germeval-1.13.
[52] B. Taboubi, M. A. B. Nessir, H. Haddad, icompass at checkthat! 2022: combining deep
     language models for fake news detection, Working Notes of CLEF (2022).
[53] H. N. Tran, U. Kruschwitz, ur-iw-hnt at CheckThat! 2022: cross-lingual text summarization
     for fake news detection, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Working
     Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna,
     Italy, 2022.
[54] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
     R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
     Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
     H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran
     Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper/2020/file/
     1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[55] OpenAI, ChatGPT: Optimizing Language Models for Dialogue, 2022. URL: https://openai.
     com/blog/chatgpt/.
[56] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W.
     Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
     P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope,
     J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat,
     S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito,
     D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick,
     A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee,
     Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern,
     D. Eck, J. Dean, S. Petrov, N. Fiedel, PaLM: Scaling Language Modeling with Pathways,
     arxiv:2204.02311 (2022). URL: https://arxiv.org/abs/2204.02311.