SemanticCuetSync at CheckThat! 2024: Finding
                         Subjectivity in News Articles using Llama
                         Notebook for the CheckThat! Lab at CLEF 2024

                         Ashraful Islam Paran1,† , Md. Sajjad Hossain1,† , Symom Hossain Shohan1,† , Jawad Hossain1 ,
                         Shawly Ahsan1 and Mohammed Moshiul Hoque1,*
                         1
                             Chittagong University of Engineering and Technology, Chattogram-4349, Bangladesh


                                        Abstract
                                        This study introduces an LLM-based technique for detecting subjectivity and objectivity in English and Arabic
                                        news articles. Although several transformers, deep learning (DL), and machine learning (ML)- based techniques
                                        were exploited for the task, the LLM (Llama-3-8b) outperformed other models, obtaining the highest F1-scores
                                        of 72.6% (Arabic) and 50.36% (English). The suggested LLM-based solution provides a rank of 4th (Arabic) and
                                        12th (English) in the task competition. The research emphasizes the potential of advanced LLMs like Llama-3-8b
                                        in achieving high subjectivity and objectivity detection accuracy, which is essential for applications in media
                                        analysis, sentiment analysis, and automated content moderation. This study contributes to developing robust
                                        multilingual text classification systems, paving the way for more sophisticated and accurate linguistic analysis
                                        tools.

                                        Keywords
                                        Natural Language Processing, Subjectivity, Large Language Model (LLM), Llama, Objectivity


                         1. Introduction
                         In the era of technology, the internet is a constant source of textual information, including news articles,
                         social media posts, blogs, and reviews. These texts offer a diverse range of information, opinions, and
                         narratives. The ability to distinguish between subjective and objective content, especially in news
                         articles, is crucial. Subjective text often includes personal opinions, emotions, and biases, significantly
                         influencing the reader’s perception. Objective text, on the other hand, presents factual and impartial
                         observations. The automatic classification of text sequences into subjective or objective categories has
                         wide-ranging applications, including media analysis, sentiment analysis, and information retrieval. This
                         capability can significantly enhance the quality of information processing and extraction across various
                         domains, leading to more accurate and reliable results.
                            How news editorials address political issues may influence individuals with differing ideological
                         beliefs [1]. Researchers have proposed a variety of methods to categorize subjective and objective news
                         articles [2], [3], [4], [5], [6]. The majority of these approaches focus on high-resource languages. The
                         primary challenge in classifying subjectivity and objectivity is addressing language’s intricate and
                         context-dependent nature. Most of the time, subjective texts incorporate subtle linguistic markers to
                         convey personal viewpoints. The critical contributions of this work are:

                                • Introduced an LLM-based technique for classifying text into subjective and objective categories
                                  in Arabic and English.


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ u1904029@student.cuet.ac.bd (A. I. Paran); u1904031@student.cuet.ac.bd (Md. S. Hossain); u1904048@student.cuet.ac.bd
                         (S. H. Shohan); u1704039@student.cuet.ac.bd (J. Hossain); u1704057@student.cuet.ac.bd (S. Ahsan); moshiul_240@cuet.ac.bd
                         (M. M. Hoque)
                          0009-0001-4795-3816 (A. I. Paran); 0009-0008-8670-8857 (Md. S. Hossain); 0009-0004-0834-2037 (S. H. Shohan);
                         0009-0006-6051-8989 (J. Hossain); 0009-0003-9940-9681 (S. Ahsan); 0000-0001-8806-708X (M. M. Hoque)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    • Investigated the task performance leveraging various ML, DL, transformer, and LLM models
      to discover a reasonable solution for classifying Arabic and English news into subjective and
      objective categories.


2. Related Work
The rise of yellow journalism has made it more crucial than ever to determine an article’s objectivity or
subjectivity. NLP can be crucial in identifying subjectivity and objectivity in news articles. Annotation
rules to identify if an article is subjective or objective were provided by Antici et al. [7]. It is possible to
use these ideas in other languages. An Arabic dataset containing several news item types for subjectivity
and sentiment analysis was presented by Abdul-Mageed et al. [8]. Dey et al. [9] proposed a transformer-
based approach (XLM-RoBERTa large) to identify subjectivity in news articles. Their model recorded
an F1 score of 0.81 in multilingual datasets. Pachov et al. [2] provided an ensemble technique for
detecting subjectivity, which recorded an F1 score of 0.77. AI-generated news from ChatGPT was used
by Shushkevich et al. [10] to balance the dataset, which improved the F1 score by 3% in Italian and 9% in
English by using mBERT. A back translation method in conjunction with a transformer-based solution
(RoBERTa, BERT) was suggested by Tran et al. [11], which achieved an F1 score of 0.69 in English.
   Frick et al. [12] proposed to use ChatGPT to detect subjectivity. They used GPT-3.5, and on the
English test dataset, they obtained an F1 score of 0.73. Using GPT-3.5, they obtained F1 values of 0.68
on the German and 0.73 on the English datasets. Furthermore, ChatGPT can be applied in a few-shot
and zero-shot manner to detect subjectivity in news articles [13]. This work leverages the LLMs for
classifying texts into objective and subjective.


3. Dataset and Task Description
The dataset used in this work includes two classes (SUBJ and OBJ) and features sentences in English
and Arabic. Table 1 illustrates the distribution of train, dev, dev-test, and test sets. We trained all models
using the training set and evaluated the model’s performance based on the test set.

Table 1
Dataset statistics for Task-2, where TW stands for total words and UW stands for unique words.
                    Language    Train    Dev    Dev-Test    Test   Total   TW         UW
                     English      830    219      243       484    1776    30821      5785
                     Arabic      1185    297      445       748    2675    53041     17477
                      Total      2015    516      688      1232    4451    83862     23262


   CLEF 2024 - CheckThat! Lab [14, 15, 16] consists of six tasks [17, 18, 19, 20, 21]. We participated in
task-2 of this shared task. Task-2 [18] focused on distinguishing whether a sentence from a news article
expresses the subjective view of the author behind it or presents an objective view on the covered topic
instead. Table 2 depicts an example of training data for the different languages.


4. System Overview
The ML techniques employed include linear regression (LR), support vector machine (SVM), multinomial
naive Bayes (MNB), k-nearest neighbors (KNN), and random forest (RF). The DL techniques involved
CNN, CNN+LSTM, and CNN+BiLSTM. Finally, two LLMs are fine-tuned for each language to address
the given task. Figure 1 illustrates the schematic process of subjectivity detection.
  Textual Feature Extraction: Textual feature extraction is a crucial step in natural language process-
ing, involving converting raw text data into numerical formats. This numerical format helps models
Table 2
Task-2 sample with the text and corresponding label


Figure 1: Schematic process of subjectivity detection in the best performing model.


interpret and process textual information. A Count Vectorizer is employed in the ML models explored
in this study. It is a popular method for textual feature extraction that converts text data into a matrix
of token counts. In DL models, tokenization and padding transform raw texts into structured numerical
data. These numerical formats are fed through an embedding layer, which captures more sophisticated
features such as semantic relationships.
   ML Models: This study explores several ML models, including LR, SVM, MNB, KNN, and RF. The
hyperparameter configurations for these models are detailed in Table 3.

Table 3
Parameters of the employed ML models.
                                    Classifier    Parameters     Value
                                                     solver       lbfgs
                                        LR
                                                   max_iter      20000
                                                      alpha        1.0
                                       MNB
                                                    fit-prior     False
                                                     kernel      linear
                                       SVM
                                                    gamma         auto

   CNN: This study utilizes a CNN model, starting with an embedding layer with an output dimension
of 200. The architecture includes two Conv1D layers, containing 64 and 128 filters, respectively,
employing a kernel size of 2 and ReLU activation. For downsampling, a GlobalMaxPooling1D layer is
used. Following this, a dense layer with 128 units and ReLU activation is added, along with a dropout
layer at a rate of 0.5 to mitigate overfitting. The final output layer consists of a single unit with sigmoid
activation. The model is trained using the ‘binary_crossentropy’ loss function and the ‘Nadam’ optimizer,
with a batch size of 32 over three epochs.
   CNN+LSTM: The CNN-LSTM model implemented in this study shares a similar architecture with
the CNN model but includes an LSTM layer with 64 units and a 0.2 dropout rate for sequence modeling.
Additionally, the dense layer in this design has 64 units and uses the ReLU activation function. The
other hyperparameter settings remain the same as those used in the CNN model.
   CNN+BiLSTM: This model has an architecture similar to the CNN+LSTM model but replaces LSTM
with BiLSTM.
   Transformer-based models: In this study, three transformer-based models were fine-tuned using
the English dataset, while another three were fine-tuned using the Arabic dataset. The models employed
in English were MdeBERTav3 [22], BERT-base-uncased [23], and RoBERTa [24]. Several text prepro-
cessing measures were implemented to reduce noise in the dataset and concentrate on meaningful
words, including lowercasing, emoji removal, stop word removal, stemming, contraction expansion,
simple spelling correction using Unicode, and HTML tag elimination. For stopword removal, the NLTK
stopwords list was employed.
MdeBERTav3 is a multilingual BERT-based architecture specifically designed for multilingual tasks.
This model demonstrates superior linguistic comprehension across various languages and performs
excellently in this study. The BERT-base-uncased model, another pre-trained transformer architecture,
has previously shown exceptional performance in various natural language processing (NLP) tasks and
delivered satisfactory results. Finally, RoBERTa, another optimized version of BERT, is used here for
the specified task and obtained comparative results.
In Arabic tasks, AraBERTv2 [25] is additionally used besides MdeBERTav3 and RoBERTa. AraBERTv2
leverages its prior training on an Arabic dataset and has demonstrated its usefulness in various nat-
ural language processing tasks, such as sentiment analysis, named entity recognition, and question
answering. This model has achieved notable results in the task at hand.
   Table 4 shows the learning rate (LR), weight decay (WD), warmup steps (WS), and epochs (EP) used
for training the large language models.

Table 4
Hyperparameters for the transformers.
                       Language     Models                 LR     WD     WS     EP
                                    MdeBERTav3            5𝑒−5      0    200    10
                       English      BERT-base-uncased     5𝑒−5    0.01   200    10
                                    RoBERTa               5𝑒−5    0.01   200    10
                                    MdeBERTav3            3𝑒−5    0.30   500     4
                       Arabic       RoBERTa               3𝑒−5    0.30   500     4
                                    AraBERTv2             3𝑒−5    0.30   500     4

   Mixtral-7b: In this study, the Mixtral-7b [26] was fine-tuned for subjectivity and objectivity detection
in news articles in both English and Arabic. Mixtral-7b was chosen due to its advanced capabilities
in handling multilingual data. The ability to effectively comprehend and process English and Arabic
texts helped this model perform better in the given task. To achieve the desired results, the model was
trained on labeled datasets in both English and Arabic, which enabled it to discern between subjective
and objective content.
   Llama-3-8b: Llama-3-8b [27], another multilingual large language model, is used in this task. Llama-
3-8b is a versatile and practical model for performing multilingual NLP tasks. It demonstrated its
potential to handle complex subjectivity and objectivity detection in English and Arabic.
   Table 5 shows the learning rate (LR), weight decay (WD), warmup steps (WS), max-length, Lora-alpha
(LA), gradient accumulation steps (GAS), and epochs (EP) used for training the large language models.


5. Results and Analysis
Table 6 illustrates an in-depth analysis of the performance of machine learning (ML), deep learning
(DL), transformer-based models, and large language models across English and Arabic languages on the
test set.
Table 5
Hyperparameters for the LLMs.
             Language       Models       LR    WD      WS    Max_len     LA    GAS    EP
                          Mixtral-7b    5𝑒−5   1𝑒−3     5      50        16     4     12
               English
                          Llama-3-8b    5𝑒−5   1𝑒−3     5      50        16     4     12
                          Mixtral-7b    6𝑒−5   1𝑒−3    10      50        16     4     10
               Arabic
                          Llama-3-8b    5𝑒−5   1𝑒−3    10      50        16     4     10

Table 6
Performance of the employed models on the test set. The bold rows denote the performance of the best-
performing model
                                                  .
 Language Method            Classifier           Pr(%) Re(%) Ac(%) Macro-F1(%) SUBJ-F1(%)
                            LR                    60.82  59.34    71.69      59.83          38.01
             ML Models      SVM                  56.11 56.43 66.12           56.23         35.43
                            MNB                   56.52  57.63    64.26      56.63          38.43
                            CNN+LSTM              37.40  50.00    74.79      42.79           0.00
             DL Models
                            CNN+BiLSTM           52.22 51.56 68.18           51.11         22.22
 English
                            MdeBERTav3            70.45  65.11    77.89      66.65          86.01
             Transformers BERT-base-uncased 69.68        66.47    77.48      67.63          85.49
                            RoBERTa              70.01 66.61 77.69           67.82         85.64
                            Mixtral-7b            37.40  50.00    74.79      42.79           0.00
             LLMs
                            Llama-3-8b           76.96 70.32 81.61           72.46         56.59
                            LR                    51.83  50.53    56.28      42.42          14.17
             ML Models      SVM                  49.73 49.87 54.81           44.53         20.66
                            MNB                   53.75  51.26    56.82      44.08          17.39
                            CNN+LSTM              28.41  50.00    56.82      36.23           0.00
             DL Models
                            CNN+BiLSTM           52.41 50.54 56.55           41.33         11.44
 Arabic
                            AraBERTv2            51.28 51.06 54.01           49.91         64.24
             Transformers RoBERTa                 28.41  50.00    56.82      36.23          72.46
                            MdeBERTav3            49.99  49.99    53.48      48.04          31.23
                            Mixtral-7b            49.17  49.20    50.80      49.07          39.67
             LLMs
                            Llama-3-8b           51.40 51.20 53.88           50.36         37.16


   On English test data, among ML models, SVM emerged as the top-performing model with a precision
of 56.11%, recall of 56.43%, accuracy of 66.12%, macro-F1 score of 56.23%, and SUBJ-F1 score of 35.43%.
Among the DL models, CNN+BiLSTM demonstrated the best performance with a precision of 52.22%,
recall of 51.56%, accuracy of 68.18%, macro-F1 score of 51.11%, and SUBJ-F1 score of 22.22%. In the
transformer category, RoBERTa outperformed others with a precision of 70.01%, recall of 66.61%, an
accuracy of 77.69%, macro-F1 score of 67.82%, and SUBJ-F1 score of 85.64%.
   In the Arabic test data, Llama-3-8b demonstrated the best performance with a precision of 76.96%,
recall of 70.32%, accuracy of 81.61%, macro-F1 score of 72.46%, and SUBJ-F1 score of 56.59%. The
large language models dominated other ML, DL, and transformer-based models in English and Arabic.
Llama-3-8b is the best-performing LLM in both languages.

5.1. Error Analysis
A comprehensive quantitative and qualitative error analysis is conducted to provide detailed insights
into the proposed model’s performance.

Quantitative Analysis
Figure 2 illustrates the confusion matrix of Llama-3-8b for English and Arabic. Out of 484 English
test cases, the model successfully detects the positive class, with 337 True Positives and 25 False
Positives. This reflects high precision, meaning the model is accurate when predicting ‘SUBJ.’ Moreover,
Figure 2: Confusion matrix of Llama-3-8b for English and Arabic.


it correctly identifies many negative instances with 58 True Negatives. However, 64 False Negatives
indicate some positive instances are missed. Overall, the model displays a balanced approach with
significant proficiency in minimizing incorrect positive predictions, leading to a high F1 score for
positive samples.
   In contrast, for 748 Arabic test cases, the model exhibits a different pattern in its confusion matrix.
It accurately identifies instances from both classes with 301 True Positives and 102 True Negatives.
However, many False Positives (221) and False Negatives (124) exist. This indicates that while the model
can detect positive instances, it tends to misclassify many negative instances as positive, resulting in
lower precision. Additionally, the high number of False Negatives suggests better recall, emphasizing
the need for improved distinction between the two classes.

Qualitative Analysis
Table 7 presents some actual labels (AL) and predicted labels (PL) of the developed models.

Table 7
Few predictions with actual and predicted labels.


  It is evident that the models accurately predicted the labels for samples 1, 2, and 4, but made errors
with samples 3 and 5. For the third sample, the sentence’s intent is ambiguous, leading to an incorrect
prediction by the model. In the case of the fifth sample, although the sentence is subjective, the model
mislabels it due to the large language models being trained on insufficient Arabic data, and also, the
provided training dataset is small.
6. Conclusion
This study evaluated several techniques to detect subjectivity in news articles, including ML, DL,
transformer, and LLMs. Transformer-based solutions score better than ML and DL-based models.
However, Llama-3-8b performed better than all these models, obtaining the highest F1 scores of 72.46%
and 50.36% in Arabic and English, respectively. This study demonstrates how effective LLMs are in
identifying subjectivity in articles. Even with limited resources, such as Arabic, LLMs outperform other
models regarding results. Future improvements can be made using GPT-4, Llama-3-70b, or other LLMs
with significant parameters.


References
 [1] R. El Baff, H. Wachsmuth, K. Al Khatib, B. Stein, Analyzing the persuasive effect of style in news
     editorial argumentation, Association for Computational Linguistics, 2020.
 [2] G. Pachov, D. Dimitrov, I. Koychev, P. Nakov, Gpachov at checkthat! 2023: a diverse multi-approach
     ensemble for subjectivity detection in news articles, arXiv preprint arXiv:2309.06844 (2023).
 [3] E. V. Tunyan, T. Cao, C. Y. Ock, Improving subjective bias detection using bidirectional encoder
     representations from transformers and bidirectional long short-term memory, International
     Journal of Cognitive and Language Sciences 15 (2021) 329–333.
 [4] H. Huo, M. Iwaihara, Utilizing bert pretrained models with various fine-tune methods for subjec-
     tivity detection, in: Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020,
     Tianjin, China, September 18-20, 2020, Proceedings, Part II 4, Springer, 2020, pp. 270–284.
 [5] I. B. Schlicht, L. Khellaf, D. Altiok, Dwreco at checkthat! 2023: enhancing subjectivity detection
     through style-based data sampling, arXiv preprint arXiv:2307.03550 (2023).
 [6] H. T. Sadouk, F. Sebbak, H. E. Zekiri, Es-vrai at checkthat! 2023: Enhancing model performance
     for subjectivity detection through multilingual data aggregation (2023).
 [7] F. Antici, A. Galassi, F. Ruggeri, K. Korre, A. Muti, A. Bardi, A. Fedotova, A. Barrón-Cedeño,
     A corpus for sentence-level subjectivity detection on english news articles, arXiv preprint
     arXiv:2305.18034 (2023).
 [8] M. Abdul-Mageed, M. Diab, Subjectivity and sentiment annotation of modern standard arabic
     newswire, in: Proceedings of the 5th linguistic annotation workshop, 2011, pp. 110–118.
 [9] K. Dey, P. Tarannum, M. A. Hasan, S. R. H. Noori, Nn at checkthat!-2023: Subjectivity in news
     articles classification with transformer based models., in: CLEF (Working Notes), 2023, pp. 318–328.
[10] E. Shushkevich, J. Cardiff, Tudublin at checkthat! 2023: Chatgpt for data augmentation, Working
     Notes of CLEF (2023).
[11] S. Tran, P. Rodrigues, B. Strauss, E. Williams, Accenture at checkthat! 2023: Impacts of back-
     translation on subjectivity detection, Working Notes of CLEF (2023).
[12] R. A. Frick, Fraunhofer sit at checkthat! 2023: can llms be used for data augmentation & few-shot
     classification? detecting subjectivity in text using chatgpt, Working Notes of CLEF (2023).
[13] M. D. Türkmen, G. Coşgun, M. Kutlu, Tobb etu at checkthat! 2023: Utilizing chatgpt to detect
     subjective statements and political bias (2023).
[14] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli,
     G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of
     the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities
     and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M.
     Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
     Conference of the CLEF Association (CLEF 2024), 2024.
[15] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
     M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-worthiness,
     subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel-
     lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
     Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
[16] G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF
     2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024.
[17] M. Hasanain, R. Suwaileh, S. Weering, C. Li, T. Caselli, W. Zaghouani, A. Barrón-Cedeño, P. Nakov,
     F. Alam, Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of
     multigenre content, in: [16], 2024.
[18] J. M. Struß, F. Ruggeri, A. Barrón-Cedeño, F. Alam, D. Dimitrov, A. Galassi, G. Pachov, I. Koychev,
     P. Nakov, M. Siegel, M. Wiegand, M. Hasanain, R. Suwaileh, W. Zaghouani, Overview of the
     CLEF-2024 CheckThat! lab task 2 on subjectivity in news articles, in: [16], 2024.
[19] J. Piskorski, N. Stefanovitch, F. Alam, R. Campos, D. Dimitrov, A. Jorge, S. Pollak, N. Ribin, Z. Fijavž,
     M. Hasanain, N. Guimarães, A. F. Pacheco, E. Sartori, P. Silvano, A. V. Zwitter, I. Koychev, N. Yu,
     P. Nakov, G. Da San Martino, Overview of the CLEF-2024 CheckThat! lab task 3 on persuasion
     techniques, in: [16], 2024.
[20] F. Haouari, T. Elsayed, R. Suwaileh, Overview of the CLEF-2024 CheckThat! Lab Task 5 on Rumor
     Verification using Evidence from Authorities, in: [16], 2024.
[21] P. Przybyła, B. Wu, A. Shvets, Y. Mu, K. C. Sheang, X. Song, H. Saggion, Overview of the CLEF-
     2024 CheckThat! lab task 6 on robustness of credibility assessment with adversarial examples
     (incrediblae), in: [16], 2024.
[22] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with gradient-
     disentangled embedding sharing, 2021. arXiv:2111.09543.
[23] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers
     for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/1810.04805.
     arXiv:1810.04805.
[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL:
     http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[25] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-based model for arabic language understanding,
     arXiv preprint arXiv:2003.00104 (2020).
[26] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
     T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825.
[27] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint
     arXiv:2302.13971 (2023).