1. Introduction

P. v. Däniken);

ZHAW-CAI at CheckThat! 2023: Ensembling using Kernel Averaging

Pius von Däniken

Jan Deriu

deri@zhaw.ch 0

Mark Cieliebak

ciel@zhaw.ch 0 0 Zurich University of Applied Sciences, Centre for Artificial Intelligence , Winterthur , Switzerland

2023

000 0 0002

We describe our approaches to sub-task 1A on multi-modal check-worthiness classification of the CheckThat! Lab 2023 in English. The goal was to determine whether a tweet is worth fact-checking based on its text and image content. Our submission was based on a kernel ensemble of diferent uni-modal and multi-modal classifiers. It achieved second place out of 7 teams with an F1 score of 0.708. multi-modal, claim check-worthiness, multiple kernel learning, CheckThat! The CheckThat! Lab 2023 [1] included five tasks targeting various aspects of misinformation. We describe our approach to Task 1 Check-Worthiness in Multimodal and Unimodal Contents, which contained two sub-tasks. Of the two sub-tasks, we participated specifically in sub-task 1A targeting multi-modal content. The goal was to classify a tweet consisting of both text and an image as check-worthy or not. The sub-task was ofered both in Arabic and English. We only developed methods for the English data.

1. Introduction 2. Related Work

The general problem of misinformation in social media has recieved a lot of interest from the community in recent years. Apart from the CheckThat! Lab tasks there have been tasks focusing on identifying the veracity of a claim or rumour, such as RumourEval [ 7, 8 ] and FEVEROUS [ 9 ].

The first modern systems for check-worthiness detection include ClaimBuster [ 10 ] and ClaimRank [ 11 ]. Their main focus is on identifying check-worthy claims in political debates. The various CheckThat! Lab check-worthiness tasks have targeted diferent text genres, including social media and tweets in particular. While TF-IDF features are a staple of any text classification task and have been included in systems such as ClaimBuster, many successful previous participants [ 12, 13 ] used fine-tuned masked language models such as BERT [ 14 ] and RoBERTa [ 15 ] in their solutions. We include both approaches in our solution. In terms of analysis of multi-modal social media content, the Hateful Memes challenge [ 16 ] has sparked a lot of interest in the community. For the challenge of multi-modality for disinformation in particular, we refer the reader to a recent survey [ 17 ]. The MM-Claims dataset [ 18 ] is a recent multi-modal claim detection dataset, on which this shared task is based. Our multi-modal sub-component is most similar to systems such as [ 19 ] that use cross-attention between modalities. However, we use a full transformer [ 20 ] encoder to fuse the modalities. Of course, an important recent development involves the use of large language models such as the GPT family [ 21 ] and LLaMa [ 22 ] that exhibit astonishing zero-shot classification capabilities. We include this approach in our solutions as well. Finally, we use a multiple kernel learning [ 23 ] approach to combine these disparate classifiers into a unified ensemble model.

3. Method

3.1. Data The multi-modal check-worthiness sub-task is a binary classification task where a tweet consisting of a short text and an image has to be classified as check-worthy or not. During the development phase of the shared task, the organizers released training data ( ), validation data ( ) and a dev-test set to be used for evaluation during development ( − ). The test data was released shortly before the submission deadline and its labels were only released after the submission deadline. For all our experiments, we combine the and − sets into a single validation set . The individual systems are trained on and evaluated on . The sizes of these sets and their label distributions are shown in Table 1. We note that each sample contained both text and image data. The training and development data came from the MM-Claims dataset [ 18 ] and for the full description of the task data, we refer the reader to the task overview [ 24 ]. 3.2. Systems We will now describe the diferent uni-modal and multi-modal systems we trained and our method to combine them using a kernel-based ensemble.

3.2.1. Text N-gram Classifier

Our first uni-modal system is based on the tweet text only. We first pre-process the texts by replacing URLs 1, user handles, and sequences of emoji 2 by placeholder tokens. The text was then lower-cased and tokenized by splitting on white-space. Tokens shorter than 2 characters were discarded. Based on this we computed TF-IDF [ 25 ] vectors for each text. This means counting the uni-grams and bi-grams of tokens for each sample. We count only one occurrence for each n-gram, meaning we ignore repetitions. We also ignore n-grams that appear in fewer than 3 samples in . Based on these counts one can compute the inverse document frequency (IDF) for each token. The resulting feature vectors are normalized to have unit euclidean length. We used the Tfidf Vectorizer implementation provided by scikit-learn [ 26 ]. We call the resulting feature vectors − .

We then use these feature vectors to train a linear Support Vector Machine (SVM) [ 27 ] with regularization strength of 1. We again rely on the implementation provided by scikit-learn. In particular we also employ their implementation of reweighing the classes based on their frequency in the training data which was inspired by [ 28 ]. We will call this model text-ngram.

3.2.2. MLM Classifier

Next, we trained another text-only system. For this we fine-tuned an electra-base-discriminator [ 29 ] model on the training data. Electra models have the same architecture as BERT [ 14 ] but follow a diferent pre-training setup. During masked language modelling (MLM) pre-training there is both a generator network and a discriminator network . During pre-training a certain number of input tokens are masked and has to predict the original token. The masked tokens are then replaced by those predicted by and has to determine whether a token was the original or has been replaced.

For our experiments we use the provided discriminator model checkpoint from Huggingface 3 [ 30 ]. We show the training hyper-parameters in Table 2. We will call the resulting model electra-clf.

In section 3.2.5 we will need access to a feature vector extracted from electra-clf. For this we remove the final dense layer of electra-clf and use the model activations as feature vectors and

1For this we use the urlextract package: https://github.com/lipoja/URLExtract.

2For this we use the emoji package: https://github.com/carpedm20/emoji/. 3https://huggingface.co/google/electra-base-discriminator scale them to unit length. We will refer to these feature vectors as .

3.2.3. Multi-Modal Classifier

Our multi-modal model relies on pre-trained encoder models for each modality. For text, we use the twitter-roberta-base 4 checkpoint from Huggingface. This is a RoBERTa [ 15 ] model that has been pre-trained on 58M tweets [ 32 ]. The output of this text encoder has dimensions × where is the number of tokens and the dimension of the token embedding.

For images, we use a Vision Transformer (ViT ) [33] that has been pre-trained on ImageNet21k [34]. We again use a checkpoint provided by Huggingface5. The model takes images at a 224 × 224 pixel resolution as input and processes them as a sequence of 16 × 16 pixel patches. This results in an output representation of size × where is the number of patches and the patch embedding dimension.

We first project both representations into a shared space of dimension ℎ using a dense layer and relu activation for each representation. This results in representations of sizes × ℎ and × ℎ . We then concatenate them to get a new representation of size × ℎ where = + . We then feed this representation through a transformer encoder [ 20 ] and a relu activation. The transformer encoder preserves the size of the representation and we use mean pooling across the sequence length to get an embedding − of size ℎ . Finally, we normalize − to unit length and feed it through a final dense layer for classification.

We fine-tune this model on but keep the weights of both the RoBERTa and the ViT encoders frozen. We call the resulting model multi-modal-clf and show its hyper-parameters in Table 3.

3.2.4. LLM Classifier

Recent Large Language Models (LLMs) such as the GPT family [ 21 ] have shown impressive fewshot and even zero-shot classification capabilities. In particular, chain-of-thought prompting [35], where the model is asked to generate a step-by-step explanation how it arrives at a certain prediction, has shown much promise. 4https://huggingface.co/cardiffnlp/twitter-roberta-base 5https://huggingface.co/google/vit-base-patch16-224-in21k ℎ Epochs Batch Size Optimizer Learning Rate Weight Decay Transformer Encoder Layers Attention Heads Transformer Feedforward Dimension

Value We use the Language Model Query Language (LMQL) [36] to formulate the prompt and constrain the answers. We show the prompt written in LMQL in Listing 1.

Listing 1: LMQL Prompt C o n s i d e r t h e f o l l o w i n g Tweet : { c l a i m } Do you t h i n k t h i s Tweet c o n t a i n s a c l a i m t h a t i s worth argmax

from where

f a c t − c h e c k i n g ?

Answer : [ ANSWER]

R e a s o n i n g : [ REASON ] o p e n a i / t e x t − d a v i n c i − 0 0 3 STOPS_AT ( REASON , ” . ” ) and ANSWER i n [ ’ Yes ’ , ’ No ’ ]

The placeholder claim is where we insert the tweet text. The placeholders ANSWER and REASON are filled in by the model. In our case we use OpenAI ’s text-davinci-003 6 model. The answer is constrained to the words Yes and No which we can directly use as predictions, which we will call gpt-answer. The reasoning is constrained to be one sentence, since it should stop generating when it produces the first full stop. We apply a similar feature extraction procedure as for text-ngram in Section 3.2.1 to these reasoning sentences. We forgo any special token replacements and use n-grams up to length 3 but keep the other parameters the same. The resulting feature vectors will be called − and call it gpt-ngram. 6https://platform.openai.com/docs/models/gpt-3-5 . We then train a linear SVM on −

3.2.5. Kernel Ensemble

We have seen that all our base models have an associated feature vector: − , , − , and − . For each of these we can define a linear kernel. The kernel value for two samples and for a given system is then defined as (, ) = () () , where () is the feature vector of system for sample . Given such a kernel , we can then train an SVM. For − and − this is equivalent to their associated classifiers text-ngram and gpt-ngram. On the other hand, for and − we will call the resulting SVM classifiers electra-kernel and multi-modal-kernel respectively.

We will include an additional ViT encoder based feature vector − . It is based on the same ViT encoder as multi-modal-clf, which also provides a pooled representation for classification, which we will use as − . We will call the resulting kernel-based SVM classifier img-untrained-kernel.

Next, we show how we combine these kernels into an ensemble. Given a set of systems , we can define their average kernel as: (, ) = ∑ 1 ∈ || (, ) This is known as a fixed rule multiple kernel learning method [ 23 ]. We can then use to train an SVM. Our main submission was based on this method and used an average kernel using text-ngram, gpt-ngram, electra-kernel, and multi-modal-kernel as components. We will also show results for all-kernels which additionally includes img-untrained-kernel in the average.7 All kernel-based SVMs were trained using a regularization strength of 1 and frequency based class weights.

4. Results

In Table 4 we show our main results. Our submission achieved an F1 score of 0.708 on the test set. We note that if we use the default classification threshold 8 electra-kernel and all-kernels achieve that exact same score. This could indicate that our ensemble method is redundant. In practice, F1 scores can be sensitive to the decision threshold. In Figure 1 we show the Precision and Recall Curves for each system. They show the Precision and Recall of a system for all potential thresholds. In the plot we include lines of constant F1 in light gray. We can see that the default thresholds (black cross marks) tend to select sub-optimal operating points.

We could therefore try to find a better classification threshold. For this we can use the validation set and use the threshold which maximizes the F1 score on . The results are shown as red cross marks in Figure 1 and in the column called Tuned Threshold in Table 4. Since gpt-answer provides only binary outputs we can not change its threshold. The values for electra-clf and multi-modal-clf are missing since we did not compute their output on 9. We can see that for most systems this method selects an even worse threshold. We had already

7The diference between submission and all-kernels was due to time constraints.

8For SVM-based systems the default threshold is 0, for classifiers trained using cross-entropy to produce class probabilities, the default threshold is 0.5. 9This was due to time constraints. noticed this during development, where system performance varied greatly between and − , and therefore we chose the default classification threshold.

Finally, in Table 4 we also include the scores that could be achieved if we had access to the ideal threshold. We computed it by selecting the threshold which maximizes the F1 score on . Of course, in reality one never has access to this knowledge, but we include it here to show how much influence threshold selection can have on the system comparison.

In Figure 1 we can also see that the curve for submission lies above the individual kernel based systems over the most recall values. Meaning that for most fixed recalls it achieves higher precision. This indicates that our ensembling method indeed yields an improved classifier. On the other hand, we can also see that electra-clf and multi-modal-clf perform even better.

In Figure 2 we show the Receiver Operating Characteristic (ROC) curves for all our systems. We can again see that electra-clf and multi-modal-clf have the highest area under the curve (AUC), meaning that for most fixed false positive rates they have a higher true positive rate than other systems. We can also see that our ensembling method outperforms individual kernel methods.

5. Conclusion

We have laid out our solution to the CheckThat! Lab 2023 sub-task 1A on multi-modal checkworthiness classification. Our solution includes diverse components that we combine using a multiple kernel learning approach. Our submission achieved second place out of 7 teams with an F1 score of 0.708. While analysing our results, we noted that the performance measure can vary drastically based on the selected decision threshold. When considering threshold-free methods such as ROC and PR curves, we find that our ensemble indeed seems to perform better than its individual components. Nevertheless, we note that the directly fine-tuned models outperform our submission under this lens. The performance gap between electra-clf and electra-kernel as well as multi-modal-clf and multi-modal-kernel is an open question requiring further study.

Acknowledgments

This work has been funded by the Hamison project supported by the EU ERA-Net CHIST-ERA; the Swiss National Science Foundation [20CH21_209672]. tics, Online, 2020, pp. 1644–1650. URL: https://aclanthology.org/2020.findings-emnlp.148. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . f i n d i n g s - e m n l p . 1 4 8 . [33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, CoRR abs/2010.11929 (2020).

URL: https://arxiv.org/abs/2010.11929. a r X i v : 2 0 1 0 . 1 1 9 2 9 . [34] T. Ridnik, E. Ben-Baruch, A. Noy, L. Zelnik-Manor, Imagenet-21k pretraining for the masses, 2021. a r X i v : 2 1 0 4 . 1 0 9 7 2 . [35] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, Q. Le, D. Zhou, Chain of thought prompting elicits reasoning in large language models, CoRR abs/2201.11903 (2022). URL: https://arxiv.org/abs/2201.11903. a r X i v : 2 2 0 1 . 1 1 9 0 3 . [36] L. Beurer-Kellner, M. Fischer, M. Vechev, Prompting is programming: A query language for large language models, PLDI ’23 (2022).

[1]

Barrón-Cedeño ,

Alam ,

Caselli , G. Da San Martino, T. Elsayed,

Galassi ,

Haouari ,

Ruggeri ,

J. M.

Struß ,

R. N.

Nandi ,

G. S.

Cheema ,

Azizov ,

Nakov , The clef-2023 checkthat! lab: Checkworthiness, subjectivity, political bias, factuality, and authority, in: J. Kamps , L.

Goeuriot , F.

Crestani , M.

Maistro , H.

Joho , B.

Davis , C.

Gurrin , U.

Kruschwitz , A . Caputo (Eds.), Advances in Information Retrieval , Springer Nature Switzerland, Cham, 2023 , pp. 506 - 517 .

[2]

Nakov ,

Barrón-Cedeño , G. Da San Martino,

Alam ,

Míguez ,

Caselli ,

Kutlu ,

Zaghouani ,

Li ,

Shaar ,

Mubarak ,

Nikolov ,

Y. S.

Kartal ,

Beltrán , Overview of the CLEF-2022 CheckThat! lab task 1 on identifying relevant claims in tweets , in: Working Notes of CLEF 2022- Conference and Labs of the Evaluation Forum , CLEF ' 2022 , Bologna, Italy, 2022 .

[3]

Shaar ,

Hasanain ,

Hamdan ,

Z. S.

Ali ,

Haouari ,

Nikolov ,

Kutlu ,

Y. S.

Kartal ,

Alam , G. Da San Martino, A. Barrón-Cedeño , R.

Míguez , J.

Beltrán , T.

Elsayed , P.

Nakov , Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates , 2021 .

[4]

Shaar ,

Nikolov ,

Babulkov ,

Alam ,

Barrón-Cedeño ,

Elsayed ,

Hasanain ,

Suwaileh ,

Haouari , G. Da San Martino, P. Nakov, Overview of CheckThat! 2020 English: Automatic identification and verification of claims in social media , CEUR Workshop Proceedings , 2020 .

[5]

Atanasova ,

Nakov , G. Karadzhov,

Mohtarami , G. Da San Martino, Overview of the CLEF-2019 CheckThat! lab on automatic identification and verification of claims. Task 1: Check-worthiness , CEUR Workshop Proceedings , 2019 .

[6]

Atanasova ,

Marquez ,

Barrón-Cedeño ,

Elsayed ,

Suwaileh ,

Zaghouani ,

Kyuchukov , G. Da San Martino, P. Nakov, Overview of the CLEF-2018 CheckThat! lab on automatic identification and verification of political claims . Task 1 : Check-worthiness, CEUR Workshop Proceedings , 2018 .

[7]

Derczynski ,

Bontcheva ,

Liakata ,

Procter ,

Wong Sak Hoi , A. Zubiaga, SemEval2017 task 8: RumourEval: Determining rumour veracity and support for rumours , in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , Association for Computational Linguistics , Vancouver, Canada, 2017 , pp. 69 - 76 . URL: https://aclanthology.org/S17-2006. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 7 - 2 0 0 6 .

[8]

Gorrell , E. Kochkina,

Liakata ,

Aker ,

Zubiaga ,

Bontcheva , L. Derczynski, SemEval -2019 task 7: RumourEval, determining rumour veracity and support for rumours , in: Proceedings of the 13th International Workshop on Semantic Evaluation , Association for Computational Linguistics , Minneapolis, Minnesota, USA, 2019 , pp. 845 - 854 . URL: https://aclanthology.org/S19-2147. doi: 1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 1 4 7 .

[9]

Aly ,

Guo ,

M. S.

Schlichtkrull ,

Thorne ,

Vlachos ,

Christodoulopoulos ,

Cocarascu ,

Mittal , The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task , in: Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER) , Association for Computational Linguistics , Dominican Republic, 2021 , pp. 1 - 13 . URL: https://aclanthology.org/ 2021 .fever -1.1. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . f e v e r - 1 . 1 .

[10]

Hassan ,

Li ,

Tremayne , Detecting check-worthy factual claims in presidential debates , in: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management , CIKM '15, Association for Computing Machinery, New York, NY, USA, 2015 , p. 1835 - 1838 . URL: https://doi.org/10.1145/2806416.2806652. doi:1 0 . 1 1 4 5 / 2 8 0 6 4 1 6 . 2 8 0 6 6 5 2 .

[11]

Jaradat ,

Gencheva ,

Barrón-Cedeño ,

Màrquez , P. Nakov, ClaimRank: Detecting check-worthy claims in Arabic and English, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Association for Computational Linguistics , New Orleans, Louisiana, 2018 , pp. 26 - 30 . URL: https://aclanthology.org/N18-5006. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 8 - 5 0 0 6 .

[12]

Savchev , AI Rational at CheckThat! 2022 : using transformer models for tweet classification , in: N. Faggioli, Guglielmo andd Ferro,

Hanbury , M. Potthast (Eds.), Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum , CLEF ' 2022 , Bologna, Italy, 2022 .

[13] R. M. Buliga Nicu , Zorros at CheckThat! 2022: ensemble model for identifying relevant claims in tweets , in: N. Faggioli, Guglielmo andd Ferro,

Hanbury , M. Potthast (Eds.), Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum , CLEF ' 2022 , Bologna, Italy, 2022 .

[14]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .

[15]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . a r X i v : 1 9 0 7 . 1 1 6 9 2 .

[16]

Kiela ,

Firooz ,

Mohan ,

Goswami ,

Singh ,

Ringshia ,

Testuggine , The hateful memes challenge: Detecting hate speech in multimodal memes , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 2611 - 2624 . URL: https://proceedings. neurips.cc/paper_files/paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf.

[17]

Alam ,

Cresci ,

Chakraborty ,

Silvestri ,

Dimitrov ,

G. D. S.

Martino ,

Shaar ,

Firooz ,

Nakov , A survey on multimodal disinformation detection , in: Proceedings of the 29th International Conference on Computational Linguistics , International Committee on Computational Linguistics , Gyeongju, Republic of Korea, 2022 , pp. 6625 - 6643 . URL: https://aclanthology.org/ 2022 .coling- 1 . 576 .

[18]

G. S.

Cheema ,

Hakimov ,

Sittar ,

Müller-Budack ,

Otto ,

Ewerth , MM-claims: A dataset for multimodal claim detection in social media, in: Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics , Seattle, United States, 2022 , pp. 962 - 979 . URL: https://aclanthology.org/ 2022 .findings-naacl. 72. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . f i n d i n g s - n a a c l . 7 2 .

[19] K. D. N. , A. Patil , Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks , in: Proc. Interspeech 2020 , 2020 , pp. 4243 - 4247 . doi: 1 0 . 2 1 4 3 7 / I n t e r s p e e c h . 2 0 2 0 - 1 1 9 0 .

[20]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , L. u. Kaiser, I. Polosukhin , Attention is all you need , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , Curran

Associates

, Inc., 2017 . URL: https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

[21] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , S.

Agarwal , A.

Herbert-Voss , G. Krueger, T.

Henighan , R.

Child , A.

Ramesh , D. M.

Ziegler , J.

Wu , C.

Winter , C.

Hesse , M.

Chen , E. Sigler, M.

Litwin , S.

Gray , B.

Chess , J.

Clark , C.

Berner , S.

McCandlish , A.

Radford , I.

Sutskever , D.

Amodei , Language models are few-shot learners , CoRR abs/ 2005 .14165 ( 2020 ). URL: https://arxiv. org/abs/ 2005 .14165. a r X i v : 2 0 0 5 . 1 4 1 6 5 .

[22]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient foundation language models , 2023 . a r X i v : 2 3 0 2 . 1 3 9 7 1 .

[23]

Gönen , E. Alpaydın, Multiple kernel learning algorithms , J. Mach. Learn. Res . 12 ( 2011 ) 2211 - 2268 .

[24]

Alam ,

Barrón-Cedeño ,

G. S.

Cheema ,

Hakimov ,

Hasanain ,

Li ,

Míguez ,

Mubarak ,

G. K.

Shahi ,

Zaghouani ,

Nakov , Overview of the CLEF-2023 CheckThat! lab task 1 on check-worthiness in multimodal and multigenre content , in: M. Aliannejadi , G. Faggioli, N. Ferro , Vlachos, Michalis (Eds.), Working Notes of CLEF 2023- Conference and Labs of the Evaluation Forum , CLEF 2023 , Thessaloniki, Greece, 2023 .

[25] C. D. Manning , P.

Raghavan , H.

Schütze , Introduction to Information Retrieval, Cambridge University Press, 2008 . URL: https://nlp.stanford.edu/IR-book/.

[26]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikit-learn: Machine learning in Python , Journal of Machine Learning Research 12 ( 2011 ) 2825 - 2830 .

[27]

Cortes ,

V. N.

Vapnik , Support-vector networks , Machine Learning 20 ( 1995 ) 273 - 297 .

[28]

King ,

Zeng , Logistic regression in rare events data , Political Analysis 9 ( 2001 ) 137 - 163 .

[29]

Clark ,

Luong ,

Q. V.

Le ,

C. D.

Manning , ELECTRA: pre-training text encoders as discriminators rather than generators , in: 8th International Conference on Learning Representations, ICLR 2020 ,

Addis

Ababa , Ethiopia, April 26-30 , 2020 , OpenReview.net, 2020 . URL: https://openreview.net/forum?id=r1xMH1BtvB.

[30]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Brew , Huggingface's transformers: State-of-the-art natural language processing , CoRR abs/ 1910 .03771 ( 2019 ). URL: http://arxiv.org/abs/ 1910 .03771. a r X i v : 1 9 1 0 . 0 3 7 7 1 .

[31]

Loshchilov ,

Hutter , Decoupled weight decay regularization , in: 7th International Conference on Learning Representations, ICLR 2019 , New Orleans , LA, USA, May 6- 9 , 2019 , OpenReview.net, 2019 . URL: https://openreview.net/forum?id= Bkg6RiCqY7 .

[32]

Barbieri ,

Camacho-Collados ,

L. Espinosa

Anke , L. Neves, TweetEval: Unified benchmark and comparative evaluation for tweet classification , in: Findings of the Association for Computational Linguistics: EMNLP 2020 , Association for Computational Linguis-