=Paper=
{{Paper
|id=Vol-3395/T7-8
|storemode=property
|title=Multi-Lingual Contextual Hate Speech Detection Using Transformer-Based Ensembles
|pdfUrl=https://ceur-ws.org/Vol-3395/T7-8.pdf
|volume=Vol-3395
|authors=Maria Luisa Ripoll,Fadi Hassan,Joseph Attieh,Guillen Collell,Abdessalam Bouchekif
|dblpUrl=https://dblp.org/rec/conf/fire/RipollHACB22
}}
==Multi-Lingual Contextual Hate Speech Detection Using Transformer-Based Ensembles==
Multi-Lingual Contextual Hate Speech Detection Using Transformer-Based Ensembles Maria Luisa Ripoll1 , Fadi Hassan1 , Joseph Attieh1 , Guillem Collell1 and Abdessalam Bouchekif1 1 Huawei Technologies Oy., Finland Abstract We present three different transformer-based approaches to contextual hate speech detection, submitted respectively to the three tasks of the HASOC 2022 competition [1]. To deal with the scarce dataset we use an ensemble of cross-validation trained models (CV ensemble) that makes use of all available data while still having validation sets to apply early stopping. We try different methods such as ensemble of ensembles, cross validation trained ensembles and embedding isotropy optimization. Furthermore, we compare different base models for the ensembles as well as different solution proposals to the tasks. Our team, hate-busters, ranked 3rd in task 1, 5th in task 2, 3rd in task 3A, 1st in task 3B and 4th in task 3C. Keywords Ensemble, Multi-Lingual NLP, Hate Speech, HASOC 2022 1. Introduction The automatic detection of hate speech in internet forums and platforms has the capacity to significantly improve the online safety of millions of users. Yet, abusive text detection is a challenging task. Firstly, language is dynamic, especially conversational language in social media, where words are often generated or given new meanings. The constantly changing vocabulary requires an offensive language detection system to have an architecture capable of dealing with unknown words. Secondly, the rise of social media in recent decades has also led to an increase in global inter-connectivity and a rich exchange of cultures and languages. It is not uncommon to find "macaronic" texts, namely texts written in hybrid mixtures of more than one language. While some natural language processing systems are already able to recognise and classify comments in multi-lingual settings, this code-mixed language in single text samples is an added difficulty for current algorithms. Furthermore, dialects and what are known as low-resource languages, often lack the large amounts of annotated data required to train deep neural networks, which leads to unbalanced levels of moderation and protection against hate speech across languages. Thirdly, language and semantics are both very complex and highly contextual. Sometimes, external knowledge is needed to understand the meaning of a sentence. In the case of a Forum for Information Retrieval Evaluation, December 9-13, 2022, India $ maria.luisa.ripoll@huawei.com (M. L. Ripoll); fadi.hassan@huawei.com (F. Hassan); joseph.attieh@huawei.com (J. Attieh); guillem.collell1@huawei.com (G. Collell); abdessalam.bouchekif@huawei.com (A. Bouchekif) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) conversation, a machine should take into account the previous text inputs when classifying a sample since they could have an effect on the samples label. For instance, a text with only neutral content should be moderated if it is a reply agreeing with a hateful comment. Because of this, a good hate speech detection system should include contextual knowledge. Finally, an extra task for abusive text detection is identifying the target of the hate speech, which provides more fine-grained information of the text. Facing all of these issues is important if we are to fully solve the problem of automatic hate speech detection and moderation. This is one of the main purposes behind the HASOC 2022 competition [2] assignments presented this year. The competition is motivated by the discussed challenges and asks participants to solve tasks using twitter data: The first task is the (1) Identification of Conversational Hate-Speech in Code-Mixed Languages (ICHCL) [3]. This task tests the model’s ability to identify hate speech in a multi-lingual setting with the languages being german and hinglish (macaronic text of hindi and english). For this task we test different ensemble architectures and compare results. The second task is (2) multiclass ICHCL for hinglish text and it consists in identifying whether a sample is standalone hate, contextual hate or no hate [3]. Thus, the task adds on the challenges related to context. Our second system uses an ensemble of cross-validation trained models to solve the problem. The final task is (3) Offensive Language Identification in Marathi and it addresses the problems for low-resource languages as well as the added complexity of identifying targeted hate speech [1, 4]. This task is divided into three sub-tasks. Each of which dedicated to different levels of multi-class classification of targeted hate speech. We present a single model capable of solving all three sub-tasks. By pre-processing the data and merging the multiple labels per sample into one, we convert the multi-class multi-label problem into a multi-class, single-label problem. This work is structured into six sections. Section 1 is the introduction. Section 2 outlines the related work. Section 3 describes the data for every task and section 4 gives an overview of the processing steps and system architectures per task. The results are presented in section 5, followed by the conclusion in section 6 and the Appendix. 2. Related Work The automatic detection of hate speech is a research area that has been growing over the last few years. There have been numerous surveys detailing the available data and methodologies used in hate speech detection [5, 6, 7, 8] and most of them conclude that the definition of hate speech itself is still unclear [5, 8]. This uncertainty leads to disparate annotations guidelines across datasets, languages and domains and makes the task of hate speech detection very complex. Due to its complexity and relevance, there have been a series of data challenges posted online to encourage researchers to solve particular hate-speech detection tasks. Among these challenges were: Prolifing Hate Speech Spreaders on Twitter - PAN 2021 [9], SemEval-2019 Task 5: Detection of Hate Speech Against Immigrants and Women in Twitter (hateval) [10] or HASOC 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech [11]. This year we participate in HASOC 2022 [1] which builds upon the tasks presented in HASOC 2021. As detailed by Modha et al [12], the top submissions from last years challenges included various technologies such as graph convolutional networks [13], transformers [14] and ensembles of transformers [15]. HASOC 2021 concluded that the transformer architecture is state-of-the-art for automatic hate speech detection. 3. Data The data used was provided by HASOC 2022 [2]. The text samples for the first and second tasks were organized in a 3 level tree structure with tweets, tweet comments and comment replies. Task 1 included 6284 samples in two languages: german and hinglish and had binary labels for hateful (HOF) and non-hateful text (NOT). Task 2 used the same hinglish text samples as task 1, excluding the german training data and included three sets of labels for none hateful data (NONE), standalone hate (SHOF) and contextual hate (CHOF), the latter being those comments that are only hateful due to context with its parent text. The data for task 3 [16, 17] is no longer in a tree structure. All sample tweets and respective labels are independent from each other. There are a total of 3013 training samples, all of which are used in subtask A to identify offensive posts (HOF) from non-offensive (NOT) ones. Subtask B, however, only uses those 1068 samples that are classified as offensive in subtask A and determines whether they are targeted (TIN) or untargeted (UNT) insults. The 740 data samples that are classified as targeted insults in subtask B form the training data for subtask C, which consists in classifying the target of the insult as an individual (IND), a group (GRP) or other (OTH). Table 1 presents a data summary of the languages, number of train samples, number of test samples and class labels per task and subtask. Task Language Train Samples Test Samples Full Dataset Classes German 307 81 388 1 Hinglish 4901 995 5896 Binary (NOT/HOF) Both 5208 1076 6284 2 Hinglish 4901 995 5896 Multi (NONE/SHOF/CHOF) 3A Marathi 3013 510 3523 Binary (NOT/OFF) 3B Marathi 3013 (1068) 510 1578 Binary (TIN/UNT) 3C Marathi 3013 (740) 510 1250 Multi (IND/GRP/OTH) Table 1 Data Summary of data language, number of samples and class labels per task. Train samples in parenthesis are the subset of samples which do not have NaN label values for the corresponding subtask. 4. Methodology 4.1. Task 1: ICHCL Binary Classification For the first task, the team worked on 3 different solutions in parallel and then merged them into one ensemble model as can be seen in Figure 1. The first solution used an XLM-Roberta huggingface model [18, 19] as a base and added an extra layer as a head. The extra layer was trained to jointly optimize the loss function on the classification task as well as the isotropy property of the last hidden layer’s embeddings. In a non isotropic space, the information in the embeddings is not uniformly distributed amongst all directions in the space. This is not desirable as these representations vary the most in top directions, which limits the expressiveness of the space. To improve the isotropy property of the space, a penalty term is added to the loss which improves the embedding space, making the embeddings less similar and as shown in previous works [20], better prepared to learn the downstream task. The input data was pre-processed by concatenating the text with any possible parents (the tweet in the case of the comment and the tweet and comment in the case of the reply), then removing all emojis, redundant spaces, URLs, underscores and hashtags and finally tokenizing the sample using the models corresponding tokenizer. The data was divided into training and validation sets using an 75 - 25 split in order to be able to use early stopping. For the rest of this work we refer to this model as the Isotropy Model. 5 fold CV Ensemble XLM Ensemble Isotropy Model Figure 1: System Architecture for Task 1. In this figure, cylinders represent data. Smaller cylinders are color coded subsets with blue representing train data and red representing evaluation data. Rectangles correspond to models, squares to individual layers and circles to scores. The second solution used the same XLM-Roberta huggingface model and tokenizer as the ones used in the isotropy model. In this case, the samples were also concatenated with all parent texts yet the data was not normalized - nothing was removed. Furthermore, since the amount of data for training such a large model was quite scarce, it was decided that this model would use the entirety of the available data for training and spare no samples for validation. The model was trained for 5 epochs. This model architecture was then run using 5 different seeds. The scores of the five trained models were averaged resulting in an ensemble, which we refer to as the XLM Ensemble. The final solution used a different XLM-Roberta huggingface base model and tokenizer, namely Microsoft’s InfoXLM [21]. We use InfoXLM since it is a framework that focuses on improving cross-lingual representations, which is relevant for task 1’s multilingual classification assignment. Using the InfoXLM tokenizer, every text was aggregated with its parents using separator tokens. In this case scenario, the aim was to be able to apply early stopping on the models to avoid overfitting, while also using all the available data for training. To do this, we applied a form of cross-validation training where the dataset was divided into 5 folds. We trained 5 models using every fold once for validation and the other 4 for training. Then we averaged the scores of all 5 models together into an ensemble. This ensemble is hereafter referred to as a K-fold CV (Cross Validation) Ensemble. The scores of all three solutions are averaged into one ensemble. Since two out of the three models are ensembles by themselves we call this last ensemble the ensemble of ensembles. 4.2. Task 2: ICHCL Multiclass Classification The system architecture for the model submitted to task 2 is a 5-fold CV Ensemble as described in section 4.1 and uses the same data pre-processing steps. Diagram 4 in the appendix shows Task 2’s model architecture. 4.3. Task 3: Offensive Language Identification in Marathi The third task is divided into three subtasks as described in section 3. Our aim is to train a joint model capable of solving all three subtasks at once. To achieve this we can look at every subtask as a categorical label, and the labels within that subtask as classes. From this point of view task 3 is reduced to a multi-label, multi-class problem. We can further simplify the task by merging all labels into a single categorical label, thus converting the multi-label, multi-class problem into a single-label, multi-class problem. Since only the HOF class is used in subtask B and only TIN is used in subtask C, the combinations of labels across subtasks is small and the new label will only have five possible classes: NOT, OFF-UNT, OFF-TIN-IND, OFF-TIN-GRP and OFF-TIN-OTH. Figure 2 shows the data distribution across classes and subtasks with the original labels. We can compare this distribution to that of Figure 3, which shows the data distribution after merging the labels into one. Once the labels are merged, we tokenize the data and feed it into our CV Ensemble architecture. For task 3 we tested two different huggingface models: InfoXLM [21] and L3Cube [22] with the latter being a language-specific model trained on Hindi as well as the target language Marathi. We also compared results when using 5 and 10 fold cross validation training. 5. Results The results for task 1 are shown in table 2. The 5 fold CV ensemble received the best scores among the different architectures. The two most notable differences in this system are on the Label Distribution in Task 3 Label Distribution of Merged Labels 2000 NOT 2000 OFF TIN 1750 UNT 1750 Number of samples per label Number of samples per label IND GRP 1500 OTH 1500 1250 1250 1000 1000 750 750 500 500 250 250 0 0 Subtask A Subtask B Subtask C NOT OFF-TIN-IND OFF-UNT OFF-TIN-GRP OFF-TIN-OTH Category labels per Subtask Merged Labels Figure 2: Label Distribution for Task 3 Figure 3: Label Distribution of Merged Labels first hand the model used - InfoXLM and on the other hand, the cross validation training format, which allows the use of early stopping while still applying the entire dataset for training. Due to a lack of submission tries we were not able to test the performance for the cross validation training across different models, which would give us more insight into the architectures effects on the task. Also due to the lack of submission tries we did not test the isotropy model by itself. The other systems tested in task 1 were also ensembles and achieved very similar scores to the 5 fold CV ensemble results. The final ensemble model (the ensemble of ensembles) included the 5 fold CV ensemble, the XLM ensemble and the isotropy model yet it did not achieve better results than the 5 fold CV ensemble by itself (+0.3% Macro F1), proving that this form of ensemble stacking is not effective. Our team achieved 3rd position in this task with a F1 difference of 4.73% to the top submission and a difference of 0.11% to the second position. Table 2 Test Results for Task 1 and 2. Run Names can be found in table 4 in the Appendix. Task Model Macro F1 Macro Precision Macro Recall 1 XLM ensemble 65.69% 65.74% 65.71% Ensemble of ensembles 65.79% 65.80% 65.79% 5 fold CV ensemble 66.10% 66.34% 66.19% 2 Single XLM (InfoXLM) 43.87% 48.03% 43.87% 5 fold CV ensemble 46.51% 53.48% 47.31% Table 2 also shows the results for task 2. The overall results of the submissions presented by all teams are lower than those from task 1 due to the increased complexity of the task. Our team ranked 5th and achieved a score of 46.51%, which was 2.88% lower than the top score (49.39%). Within our submissions, we compare our 5 fold CV ensemble method with a single InfoXLM model and we observe an increase of almost 3% when using the cross validation ensemble. For task 3 we compare between three different architectures tested on all three subtasks. When comparing the performance of the 5 fold CV ensemble using InfoXLM as base against the L3cube we see a relatively significant increase in performance for the marathi trained L3Cube across all tasks. The increase is of +4.21%, +12.78% and +8.19% in F1 for tasks A, B, and C respectively. The second comparison we can observe is the difference between the 5 fold CV ensemble and the 10 fold CV ensemble. Increasing the value of k was beneficial across all tasks, yielding improvements of +1.17%, +8.71% and +12.13% in F1 for tasks A, B and C. In this case, the difference in the magnitude of the improvements could also be due to the change in the number of samples available per task (see figure 3). Subtask A has almost 3 times as many samples as subtask B and almost 4 times as many as subtask C. Taking away a fifth of the data for validation vs a tenth makes a significant difference for such scarce data. Another observation is the degrading performance as tasks become more fine-grained with the overall performance of submissions for task A being higher than those for task B, who are themselves higher than those for task C. Withal, the results from task 3 indicate clearly that both language-specific training and a higher value of k in the k-fold CV ensemble consistently improve model performance for said tasks. For subtask A our team ranked 3rd scoring 1.77% F1 lower than the top submission and 0.21% F1 lower than the second submission. For subtask B, our team ranked 1st with an improvement of 0.58% F1 over the second position. And finally, for subtask C, our team ranked 4th. The variation of results for this task was larger than for the other tasks with the top result achieving a score 16.78% F1 higher than the second result. Our team scored 23.52% F1 lower than the top rank. Table 3 Test Results for Task 3. The run names for every model can be found in table 5 in the Appendix. Subtask Model Macro F1 Macro Precision Macro Recall A 5 fold CV ensemble - InfoXLM 90.39% 90.39% 90.40% 5 fold CV ensemble - L3cube 94.51% 94.52% 94.50% 10 fold CV ensemble - L3cube 95.68% 95.92% 95.62% B 5 fold CV ensemble - InfoXLM 70.58% 71.11% 70.27% 5 fold CV ensemble - L3cube 83.36% 83.10% 84.86% 10 fold CV ensemble - L3cube 92.07% 91.14% 93.42% C 5 fold CV ensemble - InfoXLM 52.23% 47.16% 59.90% 5 fold CV ensemble - L3cube 60.42% 54.89% 67.77% 10 fold CV ensemble - L3cube 72.55% 79.99% 77.05% 6. Conclusion We have presented transformer-based ensemble solutions for each of the tasks in the HASOC 2022 competition. Our method of cross validation (CV) training has shown to be beneficial compared to our XLM Ensemble approach in this low resource setting for hate speech detection. The advantage of this CV method is that - by alternating the validation and training sets - it allows us to use the entire available data for training as well as to implement early stopping. From our experiments with InfoXLM and L3Cube in task 3 we can also conclude that using the evaluation language in the training set increases performance significantly, therefore using language-specific models instead of state of the art cross lingual and multilingual models is still in several cases a better option. References [1] T. Ranasinghe, K. North, D. Premasiri, M. Zampieri, Overview of the HASOC subtrack at FIRE 2022: Offensive Language Identification in Marathi, in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, CEUR, 2022. [2] S. Satapara, P. Majumder, T. Mandl, S. Modha, H. Madhu, T. Ranasinghe, M. Zampieri, K. North, D. Premasiri, Overview of the HASOC Subtrack at FIRE 2022: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages, in: FIRE 2022: Forum for Information Retrieval Evaluation, Virtual Event, 9th-13th December 2022, ACM, 2022. [3] S. Modha, T. Mandl, P. Majumder, S. Satapara, T. Patel, H. Madhu, Overview of the HASOC Subtrack at FIRE 2022: Identification of Conversational Hate-Speech in Hindi- English Code-Mixed and German Language , in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, CEUR, 2022. [4] M. Zampieri, T. Ranasinghe, M. Chaudhari, S. Gaikwad, P. Krishna, M. Nene, S. Paygude, Predicting the type and target of offensive social media posts in marathi, Social Network Analysis and Mining 12 (2022) 77. URL: https://doi.org/10.1007/s13278-022-00906-8. doi:10. 1007/s13278-022-00906-8. [5] F. E. Ayo, O. Folorunso, F. T. Ibharalu, I. A. Osinuga, Machine learning techniques for hate speech classification of twitter data: State-of-the-art, future challenges and research directions, Computer Science Review 38 (2020) 100311. [6] N. Chetty, S. Alathur, Hate speech review in the context of online social networks, Aggression and violent behavior 40 (2018) 108–118. [7] M. A. Paz, J. Montero-Díaz, A. Moreno-Delgado, Hate speech: A systematized review, Sage Open 10 (2020) 2158244020973022. [8] F. Alkomah, X. Ma, A literature review of textual hate speech detection methods and datasets, Information 13 (2022). URL: https://www.mdpi.com/2078-2489/13/6/273. doi:10. 3390/info13060273. [9] F. Rangel, G. L. De la Peña Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling hate speech spreaders on twitter task at pan 2021., in: CLEF (Working Notes), 2021, pp. 1772–1789. [10] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. Rangel Pardo, P. Rosso, M. Sanguinetti, SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 54–63. URL: https://aclanthology.org/S19-2007. doi:10.18653/v1/S19-2007. [11] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri, Overview of the hasoc subtrack at fire 2021: Hate speech and offensive content identi- fication in english and indo-aryan languages and conversational hate speech, in: FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December 2021, ACM, 2021. [12] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri, Overview of the hasoc subtrack at fire 2021: Hate speech and offensive content identifica- tion in english and indo-aryan languages and conversational hate speech, in: Forum for Information Retrieval Evaluation, 2021, pp. 1–3. [13] N. Bölücü, P. Canbay, Hate speech and offensive content identification with graph convo- lutional networks, in: Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org, 2021. [14] M. Nene, K. North, T. Ranasinghe, M. Zampieri, Transformer models for offensive lan- guage identification in marathi, in: Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org, 2021. [15] A. Glazkova, M. Kadantsev, M. Glazkov, Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in english and marathi, arXiv preprint arXiv:2110.12687 (2021). [16] T. Mandl, S. Modha, G. K. Shahi, H. Madhu, S. Satapara, P. Majumder, J. Schäfer, T. Ranas- inghe, M. Zampieri, D. Nandini, A. K. Jaiswal, Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Lan- guages, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. URL: http://ceur-ws.org/. [17] S. Gaikwad, T. Ranasinghe, M. Zampieri, C. M. Homan, Cross-lingual offensive language identification for low resource languages: The case of marathi, in: Proceedings of RANLP, 2021. [18] J. Plu, tf-xlm-roberta-large, 2020. URL: https://huggingface.co/jplu/tf-xlm-roberta-large. [19] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [20] D. Biś, M. Podkorytov, X. Liu, Too much in common: Shifting of embeddings in trans- former language models and its implications, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 5117–5130. URL: https://aclanthology.org/2021.naacl-main.403. doi:10.18653/v1/2021. naacl-main.403. [21] Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X.-L. Mao, H. Huang, M. Zhou, InfoXLM: An information-theoretic framework for cross-lingual language model pre-training, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 3576–3588. URL: https://aclanthology.org/ 2021.naacl-main.280. doi:10.18653/v1/2021.naacl-main.280. [22] R. Joshi, L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based hindi and marathi languages (2022). doi:10.13140/RG.2.2.14606.84809. A. Appendix XLM Roberta S XLM Roberta S XLM Roberta S S Averaged Data XLM Roberta S scores XLM Roberta S 5-fold Cross Validation sampling of Scores train and validation sets. Figure 4: System Architecture of a 5 fold CV Ensemble. This was the main architecture submitted for tasks 2 as well as one of the models used in the ensemble solution for task 1. The final and best system for task 3 was a 10 fold version of this architecture. Table 4 Run names for tasks 1 and 2 Task Model Run Name 1 XLM ensemble traineval ensemb Ensemble of ensembles ensemble_3models_task1.csv 5 fold CV ensemble infoxlm-5folds-fixed-joint_avg_task1 2 Single XLM (InfoXLM) infoxlm_large-v2-single 5 fold CV ensemble infoxlm-5folds-joint_fixed_task2 Table 5 Run names for task 3 Subtask Model Run Name A 5 fold CV ensemble - InfoXLM infoxlm_large-5folds-joint 5 fold CV ensemble - L3cube task3-5fold-concat_a 10 fold CV ensemble - L3cube task3-10fold-concat_avg_a B 5 fold CV ensemble - InfoXLM infoxlm_large-5folds-concat_b 5 fold CV ensemble - L3cube task3-5fold-concat_avg_b 10 fold CV ensemble - L3cube task3-10fold-concat_avg_b C 5 fold CV ensemble - InfoXLM infoxlm_large-5folds-concat_c 5 fold CV ensemble - L3cube task3-5fold-concat_avg_c 10 fold CV ensemble - L3cube task3-10fold-concat_avg_c