1. Introduction

Leveraging Advanced Prompting Strategies in Llama-8b for Enhanced Hyperpartisan News Detection

Michele Joshua Maggini

Erik Bran Marino

Pablo Gamallo Otero

0 0 Centro Singular de Investigación en Tecnoloxías Intelixentes da USC , Spain, Galicia, Santiago de Compostela, 15782 1 Universidade de Évora , Évora , Portugal

This paper explores advanced prompting strategies for hyperpartisan news detection using the Llama3-8b-Instruct model, an open-source LLM developed by Meta AI. We evaluate zero-shot, few-shot, and Chain-of-Thought (CoT) techniques on two datasets: SemEval-2019 Task 4 and a headline-specific corpus. Collaborating with a political science expert, we incorporate domain-specific knowledge and structured reasoning steps into our prompts, particularly for the CoT approach. Our findings reveal that some prompting strategies work better than others, specifically on LLaMA, depending on the dataset and the task. This unexpected result challenges assumptions about ICL eficacy on classification tasks. We discuss the implications of these ifndings for In-Context Learning (ICL) in political text analysis and suggest directions for future research in leveraging large language models for nuanced content classification tasks.

eol>natural language processing large language models hyperpartisan detection disinformation

1. Introduction For this reason, hyperpartisan news detection is closer to propaganda.

The proliferation of hyperpartisan news content in digi- Recent advancements in large language models (LLMs) tal media has become a significant challenge for modern have opened new avenues for tackling complex NLP societies, potentially undermining democratic processes tasks, including detecting nuanced linguistic phenomand social cohesion. Hypepartisan news consists of po- ena such as bias and partisanship. Among these models, litically polarized content presented through the usage LLama3 [ 6 ], developed by Meta AI. of rhetorical bias. In the media landscape, news outlets This research makes use of the new LLM recently redisseminate information using proprietary websites and leased by Meta AI, Llama3-8b-Instruct, fine-tuned and social networks. Each news outlet frames the narratives optimized for dialogue/chat use cases, to explore its appliof the facts based on their political leaning, influencing cation in the detection of both hyperpartisan news headthe content with rhetorical biases, emotional purposes, lines and articles. LLMs can be prompted with instrucideology, and reporting the facts while omitting parts tions to perform classification tasks. Thus, we intend to [ 1, 2 ]. To improve the virality of the news, even main- use this open source model. In our case, by prompting stream journalists adopted click-bait practices like eye- the model with instructions and context, we are in the catching titles [ 3 ]. Furthermore, the news not only stands In-Context Learning (ICL) domain, a learning approach for one opinion but could have an underlying political diferent from fine-tuning that does not require to upbackground that manifests through a specific vocabulary date models’ weights [ 7 ]. The study aims to investigate or assumptions against the opposite political leaning [ 4 ]. the eficiency and compare the performances of the folThis type of news could radicalize the voters because lowing ICL techniques: 0-shot with a general prompt of their emotional language [ 5 ]. When there is a mas- and a specific prompt, few-shot with a diferent number sive usage of these techniques, we can consider news of examples and Multi-task Guided CoT. We investigate extremely partisan toward a particular political leaning. how carefully crafted prompts with the help of a political Although hyperpartisan news can share traits with mis- expert can guide the model to identify subtle indicators information and disinformation, it cannot be classified of extreme political bias in news articles, leveraging the within these domains because the intent is not deceptive. model’s deep understanding of language and context. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Our approach aims to overcome the limitations of tradiDec 04 — 06, 2024, Pisa, Italy tional machine learning methods, which often struggle * Corresponding author. with the complex and evolving nature of partisan lan$ michelejoshua.maggini@usc.es (M. J. Maggini); guage. Furthermore, we can include definitions of the erik.marino@uevora.pt (E. B. Marino); pablo.gamallo@usc.es political phenomena of our interest in the prompt to fur(P. G00. 0O9-te0r0o0)1-9230-9202 (M. J. Maggini); 0009-0008-4757-7540 ther define the task and narrow the application domain. (E. B. Marino); 0000-0002-5819-2469 (P. G. Otero) By focusing on ICL to provide context and background © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License information, we seek to:

Attribution 4.0 International (CC BY 4.0).

The structure of the paper is as follows. In section 2 we discuss the related literature; section 3 describes the experimental set-up we adopted and the methodology; section 4 covers the findings of our experiment comparing them based on the method used and highlights the limitation of our approach; section 5 reports the main ifndings and future research.

The main contributions of the paper are the following: • Develop a flexible and adaptable system that can the articles. This confirmed the relevance of the topicidentify hyperpartisan content across various top- based approach in distinguishing between hyperpartisan ics and writing styles without the need for exten- left- and right-wing articles, aligned with the results of sive retraining; Potthast et al. [ 2 ]. Building on these works, we choose • Reduce ambiguity and guide the model towards to focus on controversial topics because, by definition, the desired outcome; they are polarizing and often characterized by extreme • Minimize the influence of biases in the training language [ 1 ]. We believe that by leveraging generative data, by incorporating diverse perspectives and models, we can address efectively at the same time both examples. This research not only contributes to the content and the style. the field of automated content analysis but also In the literature, researchers used diferent parts of the aims to compare the eficiency of prompting tech- articles for the classification task: Lyu et al. [ 1 ] focused niques and to analyze if LLMs are valuable tools on the titles; quotes in the body were investigated by for classification task via ICL. Pérez-Almendros et al. [ 10 ]; while others encompassed both titles and body content [ 5, 11 ]. Other works focused on meta-information, such as the political leaning of the journalist [12], or the hyperlinks between diferent media ecosystems [13]. In our study, we will focus on entire articles and headlines, to evaluate model performance across inputs of varying lengths.

2.2. In-Context Learning • We evaluated the state-of-the-art model Llama38b-Instruct on two benchmark datasets in political domain; • We assessed how well the model performs under diferent inference approaches: zero-shot learning, few-shot learning, and Multi-task Guided

Chain-of-Thought reasoning • Introduction of external in-domain knowledge in the prompt and segmentation of reasoning steps in the CoT considering the dificulty of the microtasks.

2. Related Work 2.1. Hyperpartisan News and Political Leaning Detection Hyperpartisan news detection has overlapped with simi

lar tasks like fake news and political orientation detection.

In this section, we report the main contributions in the ifeld. Two main approaches were identified related to content analysis: topic- and stylistic-based [ 8, 2, 9 ]. Particularly, by comparing which of these features contributed the most to making news hyperpartisan or fake, Potthast et al. [ 2 ] found that stylistic traits difer between hyperpartisan and mainstream news and that both extreme left-wing and right-wing articles show similar writing styles. Along the same research line, Sánchez-Junquera et al. [ 9 ] applied masking techniques to distinguish the best methodology among these. They trained the model to focus separately on the writing style or topics within Recently, generative models with billions of parameters have been released and perform not only generative tasks, but also more discriminative ones, such as named entity recognition, sentiment classification, or even unseen tasks [14]. Users directly interact with them using prompts, which are specific textual templates containing instructions written in natural language. Their structure varies depending on the model being used. Thus, by leveraging the instructions, even with diferent degrees of complexity, the model can perform a task without prior training on it [15]. While interacting with the model, we can distinguish between the following prompting techniques: zero-shot, few-shot, and guided CoT [16].

ICL has emerged as a crucial technique in natural language processing, particularly with the advent of recently decoder-only LLMs. This field builds upon earlier work in transfer learning and few-shot learning [17], but focuses specifically on optimizing input prompts to elicit desired behaviors from language models. Early work in ICL primarily focused on manual prompt design. Kojima et al. [18] demonstrated the efectiveness of CoT prompting, which encourages step-by-step reasoning in language models. Building on this, Wei et al. [16] introduced the concept of zero-shot CoT prompting, further improving model performance on complex reasoning tasks without task-specific examples. More recent research has explored automated methods for prompt optimization.

AutoPrompt [19] introduced a gradient-based approach to automatically generate prompts, while Prefix-Tuning [20] proposed a method to learn task-specific continuous prompts. Lester et al. [21] further developed this idea with their work on prompt tuning, demonstrating that in some cases tuning only with soft prompts can be as efective as fine-tuning the entire model. Both Prefix-Tuning of these datasets are tailored for hyperpartisan classificaand prompt tuning are actually fine-tuning techniques, tion. The former consists of 1,273 news articles collected as they imply to retrain the model, even though only in by hyperpartisan and mainstream news outlets and mana partial way. The development of zero-shot and few- ually labeled by 3 annotators. The latter is a collection of shot prompting techniques has significantly expanded 2,200 news headlines manually labeled. The datasets are the capabilities of LLMs. Zero-shot prompting, as demon- described in Table 1. strated by Brown et al. [17], allows models to perform tasks without any task-specific training examples, rely- 3.2. Model selection ing solely on the task description in the prompt. Few-shot prompting, on the other hand, provides a small number We performed the classification as a text generation of examples in the prompt to guide the model’s behav- task, by inferencing the LLMs on the hyperpartisan ior. Rafel et al. [22] explored these approaches in their dataset via ICL. We adopted a SOTA model: Llama3work on T5 model, showing how diferent prompting 8b-Instruct quantized in 4-bit with the QLoRA configustrategies can afect model performance across various ration [25]. The temperature of the model was fixed at tasks. Furthermore, Lu et al. [23] investigated the im- 0.1 and max_tokens=1 to lower randomness in the outpact of prompt format and example selection in few-shot puts and maximizing the consistency. As a counterefect, learning, highlighting the importance of careful prompt the generated reasoning might become overly simplistic design in maximizing model performance. These aspects or stereotypical, lacking the nuance that slightly higher reflect the critical role that well-crafted prompts play randomness could provid. Our computing infrastructure in unlocking the potential of large language models for consisted of two Tesla P40 and one NVIDIA GeForce RTX tasks with limited or no task-specific training data. 2080 Ti. Each experiment was run on a single GPU. With our approach, the class label predicted is modeled based on the previous tokens given as textual inputs through 3. Experimental Setting the prompts.

3.3. Prompt design

Prompt Optimization

Earlier studies like Wei et al. [16], Jung et al. [26], Mishra

Zero-Shot et al. [14] have demonstrated the efectiveness of using

task-specific prompts. Therefore, following Edwards and HCylapsesrpifiacratitsioann Prompting Few-Shot sCtarumcatcehdot-hCeopllraodmospt[s27c]onancadteLnaabtriankg ethtealf.o[l7lo],wwinegceolne-Chain-of-Thought ments: 1) an instruction detailing the task and describing the label; 2) the input argument, supplying essential information for the task; 3) the constraints on the output space, namely inserting special symbols ” as place holders Figure 1: Pipeline of the experiment. for the label, guiding the model during output generation. To improve the coherence, the specificity of the prompt and the fine-grained reasoning in CoT for the political 3.1. Datasets domain, we collaborated with a Ph.D student in Political Science.

For our experiment, we selected datasets tailored for For this purpose, we designed the experimental binary classification. The datasets focus distinctly on pipeline depicted in Figure 1. We test diferent prompting headlines and the whole article. Specifically, we selected strategies such as zero-shot, few-shot with n numbers the SemEval-2019 by-article dataset [24] and the hyper- of examples (1, 2, 3, 5, 10), and a variant of guided CoT partisan news headlines dataset by Lyu et al. [ 1 ]. Both [ 28 ], namely Multi-task Guided CoT. We will compare the results given by prompting the models with instructions containing diferent levels of complexity: general analysis [ 32 ], rhetorical bias, framing bias [33], ideology instructions and specialized instructions with more con- detection [ 2 ], and political positioning. text provided. By introducing complexity and dividing hyperpartisan detection into these related subtasks, we aim to enhance 3.4. Method explainability, as the final output, namely the step-bystep generated explanation, is based on previously genTo investigate the ability of LLMs on hyperpartisan de- erated tokens. We provided the article or headline as tection, we audit Llama3-8b-Instruct by prompting it. In context, along with instructions to analyze various asthe n-shot configuration, we adopted the General Prompt pects of the text—ranging from word-level features to along with examples and labels from the dataset. Exam- meta-semantic reasoning—that could indicate partisan ples of these prompts can be found in the Appendices. content. This method encourages the model to consider multiple factors and explicitly articulate its thought pro0-Shot cess, potentially leading to more robust and explainable classifications. • 0-shot General Prompt: In this setting, we pro- By guiding the model through a structured reasoning vided as context to the model the hyperpartisan path, we aim to mitigate hasty judgments and foster a article or the headline and asked the model to more nuanced analysis of the content. This technique classify the text with the correct label. With this allows us to observe how the model weighs diferent configuration, we leverage the internal knowl- textual elements in its decision-making process, that is edge of the model to predict the answer, aware how it uses the existing internal knowledge [34], and it that it can sufer from political bias [ 29 ]. also provides the opportunity to identify any biases or • 0-shot Specific Prompt: In this case, we pro- limitations in the model’s reasoning.

vided as context to the model the article or the To develop the step-by-step chain-of-thought (CoT) headline. In the instruction, we introduced a po- reasoning and the specific prompt, we collaborated with litical definition of the phenomenon analyzed and a third-year Ph.D. student in Political Science. We presome knowledge regarding the biases in partisan liminarily tested various prompts and configurations to texts and asked the model to classify the text with craft the one used in this experiment, which led to the the correct label. With this, we insert external best results. Notably, the prompt optimization process knowledge and introduce a political definition to was manual rather than automated.

narrow the task and improve the output.

Few-shot: In this circumstance, we evaluated the fewshot learning capabilities of LLMs across five k-shot settings and with the 0-shot General Prompt instruction: 1-shot, 2-shot, 3-shot, 5-shot, and 10-shot. In each setting, we sampled K examples from the dataset balancing the classes. Additionally, when an odd number of examples were provided, the hyperpartisan class was more represented.

Multi-task Guided Chain-of-Thought: In this approach, we prompted the model to break down its reasoning process step-by-step before arriving at a final classification [ 30 ]. Each step corrispond to a classification task. Previous works have treated hyperpartisan detection as a binary classification task [ 24, 12]. However, hyperpartisan detection can also be approached through methodologies that focus on distinct parts of the text [ 31 ]. Thus, while we frame the macro-task as binary classification, our goal is to investigate whether the model could benefit from incorporating reasoning steps into its process. These reasoning steps align with various NLP tasks that have been used to tackle hyperpartisan detection. The subtasks we focused on include sentiment

4. Results and Discussions The results shown in Table 2ofer valuable insights into the performance of Llama3-8b-Instruct on the hyperpartisan classification task using various ICL techniques and few-shot learning approaches.

Table 2 compares the model’s performance using 0shot techniques with General (G), Specific (S) prompts, as well as Few-shot and guided CoT prompting. On the Hyperpartisan news dataset [ 1 ], 0-shot with general prompts slightly outperforms the other techniques, achieving an accuracy of 0.756 and an F1 score of 0.758.

The 0-shot with Specific prompts follows closely, with an accuracy of 0.733 and an F1 score of 0.734. The CoT approach shows a slight decrease in performance, with an accuracy of 0.712 and an F1 score of 0.704. These findings suggest that for the Hyperpartisan news dataset, simpler prompting techniques may be more efective than more complex ones like CoT. This could indicate that the model already has a good grasp of the task without requiring additional reasoning steps.

With regards to the SemEval-2019 dataset [24], we observe low performance across all techniques, with the best results achieved by CoT (Acc: 0.647, F1: 0.696). This discrepancy between datasets highlights the importance 4.1. Limitations of dataset characteristics in model performance.

Table 2 presents the results of few-shot learning ex- Outputs’ inconsistency We observed unexpected beperiments, ranging from 1-shot to 10-shot. For the Hy- haviors from the model despite providing clear instrucperpartisan news dataset, we observe an unstable perfor- tions and a specific output template. The model generated mance as the number of shots increases, with the best extra text that wasn’t requested in the instructions. We results achieved at 1-shot (Acc: 0.752, F1: 0.742). The tackle this, by specifying a placeholder for the label. Adperformance increase is not linear, with some fluctua- ditionally, it misspelled output labels, deviating from the tions observed, such as a slight increase at 3-shot. For format specified in the prompt. These issues highlight the SemEval-2019 dataset, we see a general trend of de- the challenges in controlling language model outputs, creasing performance as the number of shots increases, even with explicit guidelines. When the output did not with the best results at 1-shot (Acc: 0.639, F1: 0.614). correspond to our instructions, we considered this output

Taken this into account, with Hyperpartisan news as misclassified. dataset, the model not always benefit from additional Order of examples During few-shot learning experiexamples, suggesting that it rarely can leverage this in- ments, we noticed that the model performance was sensiformation to improve its understanding of the task. Fur- tive to examples’ order [35, 23]. This fact raises concerns thermore, additional examples and context do not im- about the stability and reproducibility of few-shot learnprove the performance with 0-shot (G) prompt configu- ing techniques with LLMs. To quantify this efect, we ration. Conversely, for SemEval-2019, the performance conducted controlled experiments with systematically degradation with increased shots could indicate poten- permuted example orders. Results revealed substantial tial overfitting or confusion introduced by the additional lfuctuations in performance metrics, with variations in acexamples. curacy and F1 scores exceeding 5-6% in some cases. This

We hypothesize that the inefectiveness of introduc- variability underscores the need for careful consideration ing external knowledge and additional context stems of example selection and ordering in few-shot prompting from the Llama-3-8b-instruct model’s optimization for strategies, highlighting a critical area for future research. dialogue and instruction-following tasks. This special- Limited context window Llama3-8b-Instruct has a ization enables the model to excel in zero-shot scenar- context window of 8,200 tokens. This limitation preios. Consequently, the few-shot setting may introduce vented us from performing 10-shot learning with the complexity that exceeds the model’s current capabilities, SemEval-2019 dataset due to the length of the articles. potentially interfering with its performance rather than The combined size of the articles and the necessary inenhancing it. structions exceeded the model’s maximum context ca

These findings underscore the complexity of ICL in pacity. the context of hyperpartisan classification. The results Quantizied model In this study, we exclusively emsuggest that the optimal approach may vary depending ployed 4-bit quantized models to optimize computational on the specific dataset, the length of input-tokens, com- eficiency. While this approach significantly reduced plexity of the instructions and task characteristics. memory requirements and inference time, we acknowledge its potential impact on model performance. Quantization, particularly at the 4-bit level, can lead to a compression of the model’s parameters, potentially resulting in a trade-of between eficiency and accuracy.

5. Conclusion In this paper, we study the reliability of a SOTA model like

Llama3-8b-Instruct for classification tasks in the political domain, namely to detect hyperpartisan articles and headlines comparing diferent prompting techniques. We cast the problem of the classification task using the generative capabilities of LLMs. Experiment results contradict the hypothesis that feeding the model with more context could lead to better performances [16]. Indeed, in our case, the 0-shot approach was the most eficient. An interesting future direction would be building a new dataset of instructions to improve models’ capability in zeroshot [36] in identifying hyperpartisan news, inspired by datasets used for false information detection, such as Truthful-QA [37]. Indeed, this dataset could be used to fine-tune generative models to enhance their performance. Additionally, we plan to explore more sophisticated prompting techniques in zero-shot and few-shot settings like prompt tuning in the political domain [38]. Finally, we would like to investigate Retrieval-Augmented Generation (RAG) and implement neuro-symbolic strategies, to incorporate retrieved documents or knowledge bases into the process. By pursuing these research directions, we aim to develop more efective and reliable systems for detecting hyperpartisan news and promoting media literacy.

Acknowledgments

This work is supported by the EUHORIZON2021 European Union’s Horizon Europe research and innovation programme (https://cordis.europa.eu/project/id/ 101073351/es) the Marie Skłodowska-Curie Grant No.: 101073351. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Executive Agency (REA). Neither the European Union nor the granting authority can be held responsible for them. The authors have no relevant financial or nonifnancial interests to disclose. guistic features for hyperpartisan news detection, soners, ArXiv abs/2205.11916 (2022). URL: https: in: Proceedings of the 13th International Workshop //api.semanticscholar.org/CorpusID:249017743. on Semantic Evaluation, Association for Compu- [19] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, tational Linguistics, 2019, pp. 929–933. doi:https: S. Singh, AutoPrompt: Eliciting Knowledge from //doi.org/10.18653/v1/S19-2158. Language Models with Automatically Generated [11] D.-V. Nguyen, T. Dang, N. Nguyen, NLP@UIT at Prompts, in: B. Webber, T. Cohn, Y. He, Y. Liu SemEval-2019 task 4: The paparazzo hyperparti- (Eds.), Proceedings of the 2020 Conference on san news detector, in: Proceedings of the 13th Empirical Methods in Natural Language ProcessInternational Workshop on Semantic Evaluation, ing (EMNLP), Association for Computational LinAssociation for Computational Linguistics, 2019, pp. guistics, Online, 2020, pp. 4222–4235. URL: https: 971–975. doi:https://doi.org/10.18653/v1/ //aclanthology.org/2020.emnlp-main.346. doi:10.

S19-2167. 18653/v1/2020.emnlp-main.346. [12] K. M. Alzhrani, Political ideology detec- [20] X. L. Li, P. Liang, Prefix-tuning: Optimizing contintion of news articles using deep neural net- uous prompts for generation, in: C. Zong, F. Xia, works, Intelligent Automation & Soft Comput- W. Li, R. Navigli (Eds.), Proceedings of the 59th ing 33 (2022) 483–500. doi:https://doi.org/10. Annual Meeting of the Association for Computa32604/iasc.2022.023914. tional Linguistics and the 11th International Joint [13] A. Hrckova, R. Moro, I. Srba, M. Bielikova, Quan- Conference on Natural Language Processing (Voltitative and qualitative analysis of linking pat- ume 1: Long Papers), Association for Computaterns of mainstream and partisan online news me- tional Linguistics, Online, 2021, pp. 4582–4597. URL: dia in central europe, Online Information Re- https://aclanthology.org/2021.acl-long.353. doi:10. view 46 (2021) 954–973. doi:https://doi.org/ 18653/v1/2021.acl-long.353. 10.1108/OIR-10-2020-0441, publisher: Emer- [21] B. Lester, R. Al-Rfou, N. Constant, The power ald Publishing Limited. of scale for parameter-eficient prompt tuning, [14] S. Mishra, D. Khashabi, C. Baral, H. Hajishirzi, in: M.-F. Moens, X. Huang, L. Specia, S. W.Cross-task generalization via natural language t. Yih (Eds.), Proceedings of the 2021 Concrowdsourcing instructions, in: S. Muresan, ference on Empirical Methods in Natural LanP. Nakov, A. Villavicencio (Eds.), Proceedings guage Processing, Association for Computational of the 60th Annual Meeting of the Association Linguistics, Online and Punta Cana, Dominifor Computational Linguistics (Volume 1: Long can Republic, 2021, pp. 3045–3059. URL: https: Papers), Association for Computational Linguis- //aclanthology.org/2021.emnlp-main.243. doi:10. tics, Dublin, Ireland, 2022, pp. 3470–3487. URL: 18653/v1/2021.emnlp-main.243. https://aclanthology.org/2022.acl-long.244. doi:10. [22] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, 18653/v1/2022.acl-long.244. M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the [15] A. Efrat, O. Levy, The turking test: Can limits of transfer learning with a unified text-tolanguage models understand instructions?, text transformer, Journal of Machine Learning ReArXiv abs/2010.11982 (2020). URL: https: search 21 (2020) 1–67. URL: http://jmlr.org/papers/ //api.semanticscholar.org/CorpusID:225062157. v21/20-074.html. [16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. [23] Y. Lu, M. Bartolo, A. Moore, S. Riedel, P. Stenetorp, hsin Chi, F. Xia, Q. Le, D. Zhou, Chain of thought Fantastically ordered prompts and where to find prompting elicits reasoning in large language mod- them: Overcoming few-shot prompt order sensitivels, ArXiv abs/2201.11903 (2022). URL: https://api. ity, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), semanticscholar.org/CorpusID:246411621. Proceedings of the 60th Annual Meeting of the As[17] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, sociation for Computational Linguistics (Volume 1: J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, Long Papers), Association for Computational LinG. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, guistics, Dublin, Ireland, 2022, pp. 8086–8098. URL: G. Krueger, T. Henighan, R. Child, A. Ramesh, https://aclanthology.org/2022.acl-long.556. doi:10. D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, 18653/v1/2022.acl-long.556. E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, [24] J. Kiesel, M. Mestre, R. Shukla, E. Vincent, C. Berner, S. McCandlish, A. Radford, I. Sutskever, P. Adineh, D. Corney, B. Stein, M. Potthast, D. Amodei, Language models are few-shot learn- Semeval-2019 task 4: Hyperpartisan news detecers, ArXiv abs/2005.14165 (2020). URL: https://api. tion, in: International Workshop on Semantic Evalsemanticscholar.org/CorpusID:218971783. uation, 2019. URL: https://api.semanticscholar.org/ [18] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwa- CorpusID:120224153.

sawa, Large language models are zero-shot rea- [25] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettle

[1]

Lyu ,

Pan ,

Wang ,

Luo , Computational assessment of hyperpartisanship in news titles , 2023 . doi:https://doi.org/10.48550/arXiv. 2301.06270.

[2]

Potthast ,

Kiesel ,

Reinartz ,

Bevendorf ,

Stein , A stylometric inquiry into hyperpartisan and fake news , in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , 2018 , pp. 231 - 240 . doi:https: //doi.org/10.18653/v1/ P18 -1022.

[3]

Pierri ,

Artoni , S. Ceri, HoaxItaly: a collection of italian disinformation and fact-checking stories shared on twitter in 2019 , 2020 . doi:https://doi. org/10.48550/arXiv. 2001 . 10926 .

[4]

G. K. W.

Huang ,

J. C.

Lee , Hyperpartisan news and articles detection using BERT and ELMo , in: 2019 International Conference on Computer and Drone Applications (IConDA), IEEE, 2019 , pp. 29 - 32 . doi:https://doi.org/10. 1109/IConDA47345. 2019 . 9034917 .

[5]

N. R.

Naredla ,

F. F.

Adedoyin , Detection of hyperpartisan news articles using natural language processing technique , International Journal of Information Management Data Insights 2 ( 2022 ) 100064 . doi:https://doi.org/10.1016/ j.jjimei. 2022 . 100064 .

[6]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient foundation language models , ArXiv abs/2302 .13971 ( 2023 ). URL: https://api. semanticscholar.org/CorpusID:257219404.

[7]

Labrak ,

Rouvier ,

Dufour , A zero-shot and few-shot study of instruction-finetuned large language models applied to clinical and biomedical tasks , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 2049 - 2066 . URL: https: //aclanthology.org/ 2024 .lrec-main. 185 .

[8]

Liu ,

X. F.

Zhang ,

Wegsman ,

Beauchamp ,

Wang , POLITICS : Pretraining with same-story article comparison for ideology prediction and stance detection , in: M. Carpuat , M.-C. de Marnefe , I. V. Meza Ruiz (Eds.), Findings of the Association for Computational Linguistics: NAACL 2022 , Association for Computational Linguistics , 2022 , pp. 1354 - 1374 . doi:https://doi.org/10. 18653/v1/ 2022 .findings-naacl. 101 .

[9]

Sánchez-Junquera ,

Rosso ,

Montes ,

Ponzetto , Masking and transformerbased models for hyperpartisanship detection in news, 2021 , pp. 1244 - 1251 . doi: 10 .26615/ 978 -954-452-072-4_ 140 .

[10]

Pérez-Almendros ,

Espinosa-Anke ,

Schockaert , Cardif university at SemEval-2019 task 4: Linmoyer, Qlora: Eficient finetuning of quantized est model , in: 2019 2nd international conference llms, Advances in Neural Information Processing on intelligent communication and computational Systems 36 ( 2024 ). techniques (ICCT) , IEEE, 2019 , pp. 146 - 151 .

[26]

Jung ,

Qin ,

Welleck ,

Brahman , C. Bhaga- [33]

Roy ,

Goldwasser , Weakly supervised learnvatula, R. Le Bras,

Choi , Maieutic prompting: ing of nuanced frames for analyzing polarization Logically consistent reasoning with recursive ex- in news media , in: B. Webber , T. Cohn, Y. He, planations, in: Y. Goldberg , Z. Kozareva , Y. Zhang Y. Liu (Eds.), Proceedings of the 2020 Conference (Eds.) , Proceedings of the 2022 Conference on Em- on Empirical Methods in Natural Language Propirical Methods in Natural Language Processing , cessing (EMNLP), Association for Computational Association for Computational Linguistics , Abu Linguistics, Online, 2020 , pp. 7698 - 7716 . URL: https: Dhabi, United Arab Emirates, 2022 , pp. 1266 - 1279 . //aclanthology.org/ 2020 .emnlp-main. 620 . doi:10. URL: https://aclanthology.org/ 2022 .emnlp-main. 82 . 18653/v1/ 2020 .emnlp-main. 620 . doi: 10 .18653/v1/ 2022 .emnlp-main. 82 . [34]

Ren ,

Cao ,

Lin ,

Liu , X. Han, K . Zeng,

[27]

Edwards ,

Camacho-Collados , Language mod- W. Guanglu,

Cai ,

Sun , Learning or selfels for text classification: Is in-context learning aligning? rethinking instruction fine-tuning , in: enough?, in: N. Calzolari , M.- Y.

Kan , V.

Hoste , L.-W.

Ku , A.

Martins , V. Srikumar (Eds.), ProceedA. Lenci, S.

Sakti , N. Xue (Eds.), Proceedings of ings of the 62nd Annual Meeting of the Associathe 2024 Joint International Conference on Com- tion for Computational Linguistics ( Volume 1: Long putational Linguistics, Language Resources and Papers), Association for Computational LinguisEvaluation (LREC-COLING 2024), ELRA and ICCL, tics , Bangkok, Thailand, 2024 , pp. 6090 - 6105 . URL: Torino, Italia, 2024 , pp. 10058 - 10072 . URL: https: https://aclanthology.org/ 2024 . acl-long . 330 . doi:10. //aclanthology.org/ 2024 .lrec-main. 879 . 18653/v1/ 2024 . acl-long . 330 .

[28]

Lee ,

Yang ,

Tran ,

Hu ,

Barut , K.-W. [35]

Zhao ,

Wallace ,

Feng ,

Klein ,

Singh , Chang, Can small language models help large Calibrate before use: Improving few-shot perforlanguage models reason better?: LM-guided chain- mance of language models , in: M. Meila , T. Zhang of-thought, in: N. Calzolari , M.- Y.

Kan , V.

Hoste , (Eds.), Proceedings of the 38th International ConferA . Lenci,

Sakti , N. Xue (Eds.), Proceedings of ence on Machine Learning , volume 139 of Proceedthe 2024 Joint International Conference on Com- ings of Machine Learning Research, PMLR , 2021 , pp. putational Linguistics , Language Resources and 12697-12706 . URL: https://proceedings.mlr.press/ Evaluation (LREC-COLING 2024 ), ELRA and ICCL, v139/zhao21c .html. Torino, Italia, 2024 , pp. 2835 - 2843 . URL: https: [36]

Wei ,

Bosma ,

Zhao ,

Guu ,

A. W.

Yu , //aclanthology.org/ 2024 .lrec-main.252. B. Lester , N.

Du , A. M.

Dai , Q. V.

Le , Fine-

[29]

Bang ,

Chen ,

Lee ,

Fung , Measuring tuned language models are zero-shot learners, political bias in large language models: What is ArXiv abs/ 2109 .01652 ( 2021 ). URL: https://api. said and how it is said, ArXiv abs/2403 .18932 ( 2024 ). semanticscholar.org/CorpusID:237416585. URL: https://api.semanticscholar.org/CorpusID: [37]

Tafjord ,

Dalvi ,

Clark , Entailer: Answer268732713. ing questions with faithful and truthful chains of

[30]

Yang ,

Kim ,

Ho ,

Thorne , S.-Y. reasoning, in: Conference on Empirical Methods Yun , HARE: Explainable hate speech detection in Natural Language Processing , 2022 . URL: https: with step-by-step reasoning , in: H. Bouamor, //api.semanticscholar.org/CorpusID:253097865. J. Pino , K. Bali (Eds.), Findings of the Association [38] K.-M. Kim , M.

Lee , H.-S.

Won , M.-J. Kim , Y. Kim, for Computational Linguistics: EMNLP 2023 , Asso- S. Lee, Multi-stage prompt tuning for political perciation for Computational Linguistics, Singapore, spective detection in low-resource settings , Applied 2023 , pp. 5490 - 5505 . URL: https://aclanthology. Sciences 13 ( 2023 ) 6252 . doi:https://doi.org/ org/2023.findings-emnlp. 365 . doi: 10 .18653/v1/ 10.3390/app13106252. 2023 .findings-emnlp. 365 .

[31]

Michele Joshua ,

Davide ,

Paloma ,

Gaël ,

G. O.

Pablo , A systematic review 6. Appendices of hyperpartisan news detection: A comprehensive framework for definition, de- Prompt templates In this section we show the prompt tection, and evaluation, 2024. doi:https: used in the diferent tasks . //doi.org/10.21203/rs.3.rs- 3893574 /v1.

[32]

Hitesh ,

Vaibhav ,

Y. A.

Kalki ,

S. H.

Kamtam ,

Kumari , Real-time sentiment analysis of 2019 General prompt: System message: "role": "system", election tweets using word2vec and random for- "content": "You have been provided with an instruction