1. Introduction

Team The Toxinators 2000 at TextDetox CLEF 2025/Multilingual Text Detoxification 2025: The Evolution of Methods for Text Detoxification: The Role of Language in Method Selection

Andrei Totok

Artemiy Ermolaev

Anastasia Izyumova

0 1

Evgeniy Finogeev

0 2 0 Ecom.tech , Moscow , Russia 1 Institute of Precision Mechanics and Optics (ITMO) , Saint Petersburg , Russia 2 National Research Nuclear University (MEPhI) , Moscow , Russia

2025

This paper explores various approaches to the task of text detoxication, including the use of lexical resources (such as toxic word dictionaries), methods based on deep learning algorithms (T5, VAE, etc.) and modern large language models (LLMs). The study is conducted on data from PAN: Multilingual Text Detoxication (TextDetox) 2025 - with the aim of identifying the most e‌ective strategies for handling toxic language depending on linguistic specics. It is shown that for low-resource languages, like Tatar, dictionary-based methods show the highest e‌ectiveness. For widely spoken languages, such as English, deep learning methods show better quality. The results highlight the importance of considering linguistic features when selecting a detoxication method and open up possibilities for the further development of adaptive multilingual content ltering systems.

eol>detoxication pretrained models dictionaries PAN 2025

1. Introduction

The rapid growth of online communication platforms has led to an increasing amount of user-generated content, which ofien includes toxic, aggressive, or otherwise harmful language. This phenomenon poses a serious challenge for maintaining safe and inclusive digital environments. One of the key solutions to this issue lies in the development of automated systems capable of detecting and neutralizing toxic language — a task commonly referred to as text detoxication.

Text detoxication involves transforming a given text in such a way that its toxic or o‌ensive elements are removed or sofiened, while preserving the original meaning and stylistic coherence. The eld of NLP has seen rapid development in recent years, enabling diverse approaches to address the problem of text detoxication.

Each of these approaches comes with its own strengths and limitations. Lexical methods, such as ltering based on lists of toxic words, o‌er simplicity and speed but ofien fail to account for context, sarcasm, or subtle forms of toxicity. Deep learning models provide better contextual understanding when trained on high-quality annotated datasets[ 1 ], yet they may struggle with generalization across domains and languages[ 2 ]. LLMs, particularly those ne-tuned for generation tasks, show good results in rewriting toxic sentences into non-toxic equivalents while maintaining uency and intent. However, performance can vary signicantly depending on the linguistic structure and data availability for each specic language.

This paper presents a comparative analysis of di‌erent detoxication strategies across multiple languages, emphasizing how linguistic characteristics inuence the e‌ectiveness of each method. We argue that no single approach is universally optimal and that the choice of method should be guided by the specic properties of the target language. This study utilizes data from PAN: Multilingual Text Detoxication (TextDetox) 2025[ 3, 4 ] to explore the most e‌ective strategies for managing toxic language, taking into account linguistic and cultural di‌erences. We use the Joint metric from PAN, which consists of three components: Style Transfer Accuracy, Content Preservation, and Fluency. Each component ranges from 0 to 1, and higher values of both individual components and the nal score indicate better performance.

2. The formation of toxic words in di‌erent languages

In modern linguistics, toxic words are commonly referred to as invective vocabulary. The invective lexicon of a language is enriched through several sources: reinterpretation of existing literary units, borrowings from foreign languages, and slang. Importantly, invective vocabulary di‌ers from slang in terms of stability: while slang undergoes frequent changes and updates, the core set of invectives tends to remain relatively stable over long periods.

As an example, consider American English, where a signicant source of lexical borrowings is AfricanAmerican vernacular, including such expressions as "motherf*cker" or "of*y". However, invectives should be distinguished from colloquial speech, which may appear similar but lacks the same emotionally charged connotation. Colloquial words such as "sh*t" or "a*s", although considered vulgar in formal contexts, can function as neutral terms in informal speech — especially within certain social groups[ 5 ].

Thus, the meaning and communicative function of the same word depend on the speaker’s social status and context, leading to an overlap between invective and colloquial vocabulary. This process facilitates lexical exchange between these two linguistic layers.

The formation of invective vocabulary can be classied into several key directions. First, many invectives originate from colloquial, slang, or marginal vocabulary ofien used to express social di‌erences and conicts. Second, a major contribution to the development of invectives comes from pejoration — the semantic re-evaluation of literary and conversational words, during which originally neutral or positive meanings transform into negative ones.

Linguist I.I. Kremikh identies several types of pejoration, among which metaphorical pejoration is considered the most productive. It is based on the transfer of negative meaning to an object through similarity or association. Examples include the Spanish words "globo" ("balloon"), used metaphorically to refer to a fat person, and "buzo´n de correos" ("mailbox"), used to describe someone with a large mouth.

Of particular importance in invective vocabulary are zoomorphisms — metaphors in which people are compared to animals based on stereotypical associations. For instance, the Spanish words "cerdo" and "cochino" (both meaning "pig") or the English "pig" and "dirty dog" are used to describe messy or rude individuals. Similarly, repurposed names of fruits and vegetables, such as "melo´n" ("melon") and "carrot" (used derogatorily for red-haired people), illustrate how stable phonetic forms can undergo substantial semantic transformation and shifi in communicative function.

An interesting phenomenon is the positive use of invectives. In some cases, these expressions are used to convey admiration or praise, reecting their deep integration into everyday speech and the wide range of emotions they can convey. For instance, in the Tatar language, the word "эт" ("dog"), which functions as an invective, appears in the phrase "Ах, эт, эшне ничек оста хәл иткэн!" ("Wow, what a masterful way to solve the problem!"), where it serves as a compliment to one’s skills. Such expressive usage of invectives is also widely employed in literature to reect natural, vivid speech[ 6 ].

Ethnolinguistic factors play a signicant role in the formation and functioning of invective vocabulary, especially evident when comparing related or contact languages. The Tatar language, with its rich historical and cultural layer, contains numerous invectives whose semantics are closely tied to collective memory and religious identity. A comparative analysis of Tatar invectives with corresponding units in Bashkir, Chuvash, Udmurt, Mari, and Turkish — a fellow Turkic language — reveals consistent patterns in the development of o‌ensive vocabulary and its national-cultural characteristics.

For example, in Tatar, the lexemes "таре" ("christian cross", "damned") and "чукынган" ("christened", and in an invective sense — "damned") emerged under the inuence of the Christianization of Volga Tatars in the 16th–18th centuries. Originally neutral or positively connotated, these words transformed into strong invectives with clearly negative semantics within the Muslim community. The invective "чукынган" demonstrates high derivational productivity, generating forms such as "чукынчык" ("scoundrel"), "чукынып кит" ("disappear!", "go to hell!"), "чукынды" ("all is lost"), and "чукынгыры" ("may he perish"), which can serve as insults, curses, or exclamatory interjections.

It is important to note that the strength of the invective tone of these words depends on the religious and cultural identity of the addressee: for Muslims, they carry a clearly negative nuance, while for Christians, they retain a neutral or even positive meaning. Similar semantic shifis are observed in Russian as well (e.g., "нехристь" ("heathen") and "креста на тебе нет" ("no cross, no crown"), originally used to refer to people of di‌erent faiths).

Thus, invective vocabulary serves not only as a means of expressing insult but also as a marker of deep cultural-historical layers and boundaries of linguistic consciousness, reecting complex processes of identity interaction within society.

3. Lexical methods

One of the simplest and fastest ways to reduce the level of toxicity in text is the dictionary-based approach, which involves replacing potentially toxic words with neutral equivalents or completely removing o‌ensive terms. The replacement can be either full (replacing the entire word) or partial (applied when a toxic word appears as part of a longer lexical unit).

We use a Multilingual Toxic Lexicon [ 7 ] which consists of toxic words and phrases for each language from the current task. Three types of processing were considered in this section: • copy base — the text remains unchanged; • replace — toxic words or their parts are removed or replaced; • delete — toxic words are removed without replacement.

Copy base method consisted in the fact that the toxic was not changed in any way. In other words, we looked at the metrics of unmodied sentences. Replace method involved removing the toxic component from a word. For example, if the dictionary includes the word "f*ck", but "f*cker" not in the toxic dictionary, only the sux "er" would remain afier processing.

Delete method consisted of removing an entire word if it contained any toxic substring. For instance, if the toxic dictionary included the word "f*ck", but "f*cker" was not included, then full word "f*cker" would be deleted.

Results of lexical methods presented in Table 1. The values in the table are given for the Joint metric.

However, in some languages, such as Ukrainian and Japanese, the replace method demonstrated signicantly better metrics. This e‌ect may be attributed to specic contextual features that inadvertently amplify the negative tone of the text during word replacement. Delete method showed signicantly better results for the Arabic language. Therefore, when applying automated detoxication methods, it is essential to take into account the linguistic structure and contextual usage of words.

The highest level of toxicity was observed when using the copy base method, where the input text remained unchanged. Although this approach preserves the original meaning of the message, it does not address the task of reducing toxicity and should only be used as a baseline for comparison.

The purely dictionary-based approach has several limitations, including an inability to account for context, the need for regular dictionary updates, and the risk of false positives. Hence, a promising direction for future work appears to be the combination of lexical ltering with text rewriting models, which can not only eliminate toxic elements but also restore textual structure and preserve semantic meaning.

4. Deep learning approaches

A more recent solution in the eld of text detoxication involves approaches based on deep learning models, such as T5 (Text-To-Text Transfer Transformer) [ 8 ] and its variations. These models are capable not only of removing or replacing toxic words but also of rewriting the original text while preserving its semantic structure and stylistic features. Thanks to their training on parallel corpora — where each toxic sentence has a corresponding non-toxic version — these models can generate more natural and contextually appropriate outputs.

Such models demonstrate strong performance in neutralizing o‌ensive language without signicantly altering the meaning of the original message. Their e‌ectiveness is especially evident when they are ne-tuned on high-quality detoxication datasets. A more detailed discussion of these experiments will follow in the section on large language models (LLMs).

For the current set of experiments, we used the HRQ-VAE [ 9 ] model, which has shown promising results in text paraphrasing tasks. The model is based on the Variational Autoencoder (VAE) [ 10 ] architecture and is designed for sofi style transfer, including reducing the level of toxicity in text.

The experiments were conducted on the English language using both the pretrained version of the model and several versions ne-tuned on the Paradetox [ 11, 12 ] parallel dataset. We applied the models trained on three di‌erent datasets: MSCOCO [ 13 ], Paralex [ 14 ], and QQP [ 15 ], evaluating its ability to reduce toxicity in input sentences. We also explored how di‌erent training strategies a‌ect performance whether toxic phrases are included or excluded from paraphrase clusters. A paraphrase cluster is a group containing sentences that are similar in meaning. For example, the sentences "How do I get to the nearest metro station?", "Can you tell me how to get to the subway?", and "Where is the closest metro station?" have similar meanings, so they are placed in the same cluster.

Results of di‌erent HRQ-VAE models for English presented in Table 2.The values in the table are given for the Joint metric.

Our ndings show that the model’s performance depends heavily on the quality of the training data and the strategy used to form paraphrase clusters. In some congurations, the model was able to signicantly reduce toxicity levels, while in others, it failed to fully eliminate harmful content or even preserved negative connotations.

In summary, the HRQ-VAE model demonstrates potential for use in text detoxication, particularly when ne-tuned on carefully curated parallel datasets. However, like other methods, it requires careful tuning and contextual awareness to avoid unintended preservation or amplication of toxic tone.

5. Large language models

A modern stage in the development of text models is represented by large language models (LLMs) such as ChatGPT [16], Llama3 [17] and others. These models possess signicant capacity for understanding and generating natural language, making them promising candidates for solving complex tasks, including text detoxication. Unlike traditional approaches based on rigid rules or simple neural architectures, LLMs are capable not only of removing or replacing potentially toxic words, but also of rewriting phrases while preserving the original meaning, style, and logical coherence.

One of the key advantages of LLMs is their ability to interpret user instructions through prompts, which enables exible control over the generation process. For example, prompts such as "Make this more neutral" or "Paraphrase this sentence" allow the model to understand that the tone of the text needs to be adjusted without sacricing its informational content. This makes LLMs particularly suitable for use in real-world content ltering systems, where both the reduction of toxicity and the maintenance of communication quality are essential.

In the course of this study, several versions of the Flan-T5 [18] model (small, large) were tested. These models were initially trained on the Paradetox dataset and further ne-tuned on a mixed dataset that included Paradetox, Parallel Detoxication Dataset small [19], Paranmt-for-detox [20], and Filtered_paranmt [21]. The training was conducted using di‌erent prompts, such as "Make more neutral" and "Detoxify", which allowed us to assess the impact of prompt formulation on the quality of text rewriting. mT0 [22] model from the TextDetox 2024 Multilingual Text Detoxication task.

Results of di‌erent prompts presented in Table 3. The values in the table are given for the Joint metric.

The results demonstrated that the Flan-T5-large variant outperformed smaller versions in terms of detoxication performance. Specically, afier 4696 training steps, the joint metric reached 0.70497 when using the "Make more neutral" prompt, compared to 0.68628 for Flan-T5-small. This indicates that increasing the number of model parameters has a positive e‌ect on its ability to reduce textual toxicity.

It is also worth noting that T5-based models can be adapted to specic languages and domains. For instance, the spivavtor [23] model, designed for the Ukrainian language, showed improved results when prompted with "Перефразуйте" ("Paraphrase"). It achieved a joint metric of 0.33832, signicantly lower than that of dictionary-based approaches. This supports the conclusion that T5-based models have strong potential for application in multilingual environments.

Results for spivavtor models on Ukrainian language presented in Table 4. The values in the table are given for the Joint metric.

Additionally, we evaluated the Llama3[24] model with varying temperature settings, di‌erent prompts, and multiple LoRA adapters [25], which allowed us to explore the impact of both generation parameters and architectural modications on detoxication quality. In our experiments, pre-trained versions of Llama3 were enhanced with adapters trained on parallel datasets of toxic and non-toxic text.

Prompts were carefully selected to explicitly instruct the model to reduce toxicity while preserving the original meaning. Generation was performed using various temperature values (ranging from 0.01 to 0.95), allowing us to evaluate the balance between deterministic and diverse outputs.

The inuence of temperature on detoxication quality proved to be inversely proportional. Lower temperature values yielded better results, as the model generated outputs that closely adhered to the original sentence structure and meaning. In contrast, higher temperature values increased output variability — although the overall meaning was preserved, the rewritten sentences ofien used completely di‌erent wording, sometimes reintroducing potentially harmful expressions.

We also tested several LoRA adapters that di‌ered in size and training data. Interestingly, an adapter trained on only 50 examples performed slightly better than another trained on 200 examples. However, the di‌erence between these congurations was minimal, and both still produced outputs that were relatively close to the base Llama3. Results of llama3-70B model presented in Table 5. The values in the table are given for the Joint metric.

6. Leaderboard overview

Our team, The Toxinators 2000, participated in the Multilingual Text Detoxication task at PAN 2025. The developed system combined dictionary-based ltering with generative rewriting using a large language model (LLMs) mT0. According to the evaluation on the platform, we ranked 5th out of 32 teams, achieving an average score of 0.675 across all languages. Additionally, on the LLM-as-Judge Language English Spanish Deutsch Chinese Arabic Hindi Ukrainian Russian Amharic Italian Japanese Hebrew French Tatar Hindi leaderboard, we ranked 13th out of 32 teams, achieving an average score of 0.648 across all languages. The best results were obtained for the following languages on leaderboard:

The full leaderboard is available on the competition page. 7. Conclusion

In the course of this study, various approaches to the task of text detoxication were examined, including the use of lexical resources (dictionaries), deep learning models, and modern large language models (LLMs). The main objective was to compare the e‌ectiveness of these methods across several languages — English, Russian, Ukrainian, and others — and to identify optimal strategies for processing toxic content depending on linguistic characteristics and data availability.

The mT0 model , when combined with preliminary removal of toxic words and word parts, demonstrated better performance than the baseline system. For example, in Tatar, the score increased from 0.580 to 0.617 and in Japanese — from 0.582 to 0.644. At the same time, the dictionary-based method achieved the best results for Hindi, increasing the score from 0.351 to 0.449.

These ndings conrm that the e‌ectiveness of text detoxication methods is directly inuenced by linguistic specicity. No single approach proves universally optimal; instead, the choice of method should be guided by the morphological complexity, cultural context, and data availability of the target language.

Thus, future research should focus on developing language-adaptive detoxication systems, which combine the strengths of dictionary ltering, deep learning, and prompt-based LLM rewriting to ensure both high-quality output and meaningful reduction of toxic content.

Declaration on Generative AI The authors have not employed any Generative AI tools.

N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Parthasarathy, R. Li, R. Hogan, R. Battey, R. Wang, R. Howes, R. Rinott, S. Mehta, S. Siby, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Mahajan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin, S. C. Zha, S. Patil, S. Shankar, S. Zhang, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Sattereld, S. Govindaprasad, S. Gupta, S. Deng, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Koehler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Montanez, V. Mohan, V. S. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wu, X. Wang, X. Wu, X. Gao, Y. Kleinman, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Zhao, Y. Hao, Y. Qian, Y. Li, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, Z. Zhao, Z. Ma, The llama 3 herd of models, 2024. URL: https://arxiv.org/abs/2407.21783. arXiv:2407.21783. [18] Hugging Face, Flan-t5 models, https://huggingface.co/google/an-t5-small, https://huggingface.

co/google/an-t5-base, https://huggingface.co/google/an-t5-large, 2025. Accessed: 2025-07-06. [19] s-nlp, Parallel detoxication dataset (small), https://github.com/s-nlp/parallel_detoxication_data set/blob/main/parallel_detoxification_dataset_small.tsv, 2025. Accessed: 2025-07-06. [20] Hugging Face, Paranmt for detox, https://huggingface.co/datasets/s-nlp/paranmt_for_detox, 2025.

Accessed: 2025-07-06. [21] s-nlp, Detox releases, https://github.com/s-nlp/detox/releases/, 2025. Accessed: 2025-07-06. [22] Hugging Face, mt0-xl-detox-orpo, https://huggingf ace.co/s- nlp/mt0- xl- detox- orpo, 2025.

Accessed: 2025-07-06. [23] Hugging Face, Spivavtor models, https://huggingf ace.co/grammarly/spivavtor-xxl, https: //huggingface.co/grammarly/spivavtor-large, 2025. Accessed: 2025-07-06. [24] Hugging Face, Meta-llama-3-70b, https://huggingf ace.co/meta-llama/Meta-Llama-3-70B, 2025.

Accessed: 2025-07-06. [25] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, in: Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.

[1]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need, 2023 . URL: https://arxiv.org/abs/1706.03762. arXiv: 1706 . 03762 .

[2]

Arivazhagan ,

Bapna ,

Firat ,

Lepikhin ,

Johnson , M. Krikun,

M. X.

Chen ,

Cao , G. Foster,

Cherry ,

Macherey ,

Chen ,

Wu , Massively multilingual neural machine translation in the wild: Findings and challenges, 2019 . URL: https://arxiv.org/abs/ 1907 .05019. arXiv: 1907 .05019.

[3]

Bevendor‌ ,

Dementieva , M. Fro¨be,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kamp‌ Generative AI Detection, Multilingual Text Detoxication, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025 .

[4]

Dementieva ,

Protasov ,

Babakov ,

Rizwan , I. Alimova,

Brune ,

Konovalov ,

Muti ,

Liebeskind ,

Litvak ,

Nozza ,

S. Shah

Khan ,

Takeshita ,

Vanetik ,

A. A.

Ayele ,

Schneider ,

Wang ,

S. M.

Yimam ,

Elnagar ,

Mukherjee ,

Panchenko , Overview of the multilingual text detoxication task at pan 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[5] Z. N. , Ways of formation of invective vocabulary (based on the material of the piraeus national variant of the spanish language and the american national variant of the english language ), https://cyberleninka.ru/article/n/puti -formirovaniya-invektivnoy-leksiki-na-materiale-pireysk ogo-natsionalno-varianta-ispanskogo-yazyka-i-amerikanskogo- natsionalnogo , 2010 .

[6]

D. K.

Vakhitova , Invective Vocabulary of the Tatar Language: Functional and Ethnomental Aspects , Ph.d. dissertation , Kazan Federal University, Kazan, Russia, 2013 .

[7]

Hugging

Face , Multilingual toxic lexicon, https://huggingface.co/datasets/textdetox/multilingual _toxic_lexicon, 2025 . Accessed: 2025 -07-06.

[8]

Ra ‌el ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unied text-to-text transformer , Journal of Machine Learning Research 21 ( 2020 ) 1 - 67 .

[9]

Hosking , hrq-vae, https://github.com/tomhosking/hrq-vae, 2025 . Accessed: 2025 -07-06.

[10]

D. P.

Kingma ,

Welling , Auto-encoding variational bayes , https://arxiv.org/abs/1312.6114, 2013 . ArXiv preprint arXiv: 1312 . 6114 .

[11]

Logacheva ,

Dementieva ,

Ustyantsev ,

Moskovskiy ,

Dale , I. Krotova,

Semenov , A . Panchenko, ParaDetox: Detoxi cation with parallel data, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Dublin, Ireland, 2022 , pp. 6804 - 6818 . URL: https://aclanthology.org / 2022 . acl-long . 469 .

[12]

Dementieva ,

Ustyantsev ,

Dale ,

Kozlova ,

Semenov ,

Panchenko ,

Logacheva , Crowdsourcing of parallel corpora: the case of style transfer for detoxication , in: Proceedings of the 2nd Crowd Science Workshop: Trust , Ethics, and Excellence in Crowdsourced Data Management at Scale co-located with 47th International Conference on Very Large Data Bases (VLDB 2021 (https://vldb .org/2021/)), CEUR Workshop Proceedings , Copenhagen, Denmark, 2021 , pp. 35 - 49 . URL: http://ceur-ws. org/ Vol- 2932 /paper2.pdf.

[13] T.-Y. Lin , M.

Maire , S.

Belongie , L.

Bourdev , R.

Girshick , J.

Hays , P.

Perona , D.

Ramanan , C. L.

Zitnick , P. Dolla´r, Microsofi coco: Common objects in context, 2015 . URL: https://arxiv.org/abs/1405.0312. arXiv: 1405 . 0312 .

[14]

Fader ,

Zettlemoyer ,

Etzioni , Paraphrase-driven learning for open question answering , in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) , 2013 , pp. 1608 - 1618 .

[15]

Wang ,

Hamza ,

Florian , Bilateral multi-perspective matching for natural language sentences , in: Proceedings of the 26th International Joint Conference on Articial Intelligence (IJCAI) , 2017 , pp. 4144 - 4150 .