<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">On the Limitations of Zero-Shot Classification of Causal Relations by LLMs (Work in Progress)</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Vani</forename><surname>Kanjirangat</surname></persName>
							<email>vani.kanjirangat@idsia.ch</email>
							<affiliation key="aff0">
								<orgName type="department">Istituto Dalle Molle di Studi sull&apos;Intelligenza Arti ciale (IDSIA)</orgName>
								<orgName type="institution">USI-SUPSI</orgName>
								<address>
									<settlement>Lugano</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
							<affiliation key="aff0">
								<orgName type="department">Istituto Dalle Molle di Studi sull&apos;Intelligenza Arti ciale (IDSIA)</orgName>
								<orgName type="institution">USI-SUPSI</orgName>
								<address>
									<settlement>Lugano</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Antonucci</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Istituto Dalle Molle di Studi sull&apos;Intelligenza Arti ciale (IDSIA)</orgName>
								<orgName type="institution">USI-SUPSI</orgName>
								<address>
									<settlement>Lugano</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marco</forename><surname>Za</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Istituto Dalle Molle di Studi sull&apos;Intelligenza Arti ciale (IDSIA)</orgName>
								<orgName type="institution">USI-SUPSI</orgName>
								<address>
									<settlement>Lugano</settlement>
									<country key="CH">Switzerland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">On the Limitations of Zero-Shot Classification of Causal Relations by LLMs (Work in Progress)</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">F653647F1CE1B2553F148656ADB326FF</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:31+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large language models</term>
					<term>zero-shot classi cation</term>
					<term>few-shot classi cation</term>
					<term>causal inference</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We aim to explore and analyze the capabilities and limitations of the large language models in understanding and distinguishing causal sentences under a zero-shot setting. We experiment on a multi-class dataset of direct causal, conditional causal, and correlational sentences. In the experiments, the GPT and Falcon models are validated against a ne-tuned BERT model under di erent settings to explore zero-shot capabilities in causality detection. Zero-shot approaches exhibit good performance in other classi cation tasks, such as sentiment analysis or question answering. Yet, for this task, the ne-tuned approach seems superior, and the situation does not change if language cues are added or a few-shot setting is considered. This is a preliminary analysis of a work in progress. Still, the results suggest that identifying causal relations is a particularly challenging task that is hard to address in a zero-shot setup.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The adoption of large language models (LLMs) <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> is rapidly growing, primarily because of the zero-shot capabilities exhibited by these tools in a wide range of natural language processing tasks, such as sentiment analysis or recommendations and knowledge-intensive tasks, such as question answering and domain-speci c entity recognition <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. Despite such popularity, it is essential to understand the limitations and address questions such as: where can these models fall back? What are the possibilities of such fallbacks? How can we improve their performance beyond prompting engineering <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref>?</p><p>This paper is a preliminary report on our work (in progress) on evaluating the potential of state-of-the-art LLMs in the eld of causal inference. More speci cally, we investigate the performance of LLMs in a classi cation task with sentences possibly involving causal relations. Our analysis focuses on zero-and few-shot capabilities of LLMs compared against a ne-tuning setting with encoder-based BERT models, which are nowadays the most common choice for classi cation tasks <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>. Our tests show some limitations of LLM approaches in the causal domain, this being the case either for zero-shot and few-shot setups. Notably, the situation remains the same even if language cues are provided. Such negative results are in line with some recent works presenting LLMs as causal parrots <ref type="bibr" target="#b13">[14]</ref>, not yet capable of genuine causal reasoning <ref type="bibr" target="#b14">[15]</ref>, beyond just distinguishing between causes and e ects <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Recently, a plethora of research has been going on in the direction of exploiting the zeroshot and few-shot capabilities of LLMs. Because of the vast amount of pre-trained data they have been exposed to, large (&gt;10B parameters) language models are considered to have an inherent ability to generalise across unseen tasks <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20]</ref>. For instance, the number of parameters of the recent GPT-3 and GPT-4 models is about, respectively, 175B and 1.76T. Zeroshot and few-shot techniques have been tried with di erent prompting strategies (e.g., the chain of thought) for both classi cation and generation tasks. In many knowledge-intensive tasks (e.g., question answering), translations, classi cation tasks (e.g., sentiment analysis) and recommendations, those approaches seem compelling, provided that an adequate, prompt engineering e ort is achieved <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b23">24]</ref>. These techniques may be inaccurate for many other tasks, signi cantly when the complexity increases, such as multi-task classi cations and hard sequence labelling tasks, especially in domain-speci c problems <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b7">8]</ref>. Researchers have come up with soft prompting approaches and parameter e cient tuning (PEFT) <ref type="bibr" target="#b25">[26]</ref> approaches such as P-tuning <ref type="bibr" target="#b26">[27]</ref>, prompt-tuning <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b28">29]</ref> and variations of prompt infusions to overcome these problems, while trying to achieve ne-tuning-based performances. The causal reasoning ability of LLMs has been initially investigated in <ref type="bibr" target="#b29">[30]</ref>. The authors observe a good performance with a pairwise causal discovery task, counterfactual reasoning task and actual causality by conducting experiments on datasets of cause-e ect pairs. A critical review of the causality inference and reasoning with LLMs on benchmark datasets is reported in <ref type="bibr" target="#b30">[31]</ref>. The authors specify the requirements of causal datasets and problems of evaluations with LLMs, such as memorisation (the dataset could be part of LLM pre-training). They also indicate that LLMs can answer many datasets by simply computing similarities between options and questions in a vector space. Further, they also indicate that the good performance of LLMs can sometimes be due to spurious language cues in the datasets. In the rest of the paper, we explore the capability of LLMs with simple prompt-based approaches in identifying causality on a multi-class dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dataset Settings</head><p>For our analysis, we focus on the dataset from <ref type="bibr" target="#b31">[32]</ref>, developed to automate the identi cation of causal language use in the scienti c literature. The data source was a collection of PubMed<ref type="foot" target="#foot_0">1</ref> abstracts with ve main health topics -nutrition, diabetes, obesity, breast cancer, and cholesterol. Two domain experts were asked to annotate the sentences manually. A good agreement (Cohen's kappa = 0.98) was reported. The original dataset refers to a multi-class setup with four options: correlational, direct causal, conditional causal, and one without any relations <ref type="bibr" target="#b32">[33]</ref>. The entities possibly involved in the causal and correlational relations are not provided. Thus, the identi cation depends on the speci c language patterns used in the input sentences. In the correlational case, the sentence describes some association between variables. With direct causal sentences, the cause and e ect are directly mentioned, while in the conditional case, the relation de nition carries out an element of doubt. Finally, there are sentences with neither causation nor correlation.</p><p>We use the original dataset in the native multi-class setting and a binary classi cation task. For the binary class, we drop the correlational sentences and combine the direct and conditional causal sentences, thus having only two classes, one with no relations and the other with causal relations. This is intended to allow for a focus on causal relation discrimination. The multi-class dataset includes 1356 no-relation, 494 direct causal, 213 conditional and 998 correlational cases, which makes up 3061 cases. In the binary class setting, we have 1356 no-relation cases and 707 cases of causal relations (which combines direct and conditional cases), with 2063 cases overall.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Methodology</head><p>To test the capability of LLM models in classifying causal and non-causal sentences under zero-shot settings, we initially design suitable prompts to tackle the task with LLMs. We use both binary and multi-class settings with the prompt including the text input in Fig. <ref type="figure" target="#fig_1">1</ref> and some variations. For the binary settings, we just need to change the classes in the prompt.  Following the indications from <ref type="bibr" target="#b31">[32]</ref> and ndings from <ref type="bibr" target="#b30">[31]</ref>, we create another prompt including language cues intended to help the LLM in providing more accurate classi cations. In ne-tuning approach, we assume the model automatically captures these patterns given the training data. In a zero-shot setting, with the absence of such training information, we want to see the impact on model performance when some explicit domain knowledge is available. We added the following cues -association, associated with, predictor for correlational, increase, decrease, lead to, e ective in, contribute to, reduce for causal and along with may, might, appear to, probably for conditional causal. These cues were then added to the zero-shot prompt (ZS-Cues). Further, we tried them in a few-shot setup (FS-Cues) with some examples from each class (e.g., two samples for each class). Finally, we also consider a 500-shot experiment with labelled samples, used also to train the BERT model under the same settings (500 samples for training).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Discussion</head><p>For the ne-tuning approach, we use the best-base-cased model <ref type="bibr" target="#b33">[34]</ref> in both binary and multiclass settings with k-fold cross-validation. We use SimpleTransformers<ref type="foot" target="#foot_1">2</ref> with four epochs and a learning rate of 2E-5. We experiment with GPT (3.5 Turbo) and the open-source Falcon model (falcon-7b-instruct <ref type="foot" target="#foot_2">3</ref> and falcon-40b-instruct) in zero-shot settings. Falcon-7b-instruct is a 7B parameters causal decoder-only model ne-tuned on a mixture of chats and instructions, while falcon-40b-instruct is a bigger model with 40B parameters. From Tab. 1, it can be observed that Falcon-7b and 40b give inferior performance compared to GPT. Comparing the two Falcon models, the 40b outperformed the 7b model in multi-class and binary settings. This expected result motivates us to focus our experiments on GPT models only. For further experiments, we use GPT to analyse the performance under di erent prompt settings and compare them with ne-tuned BERT-based models. Tab. 2 shows that, in both settings, the performance of GPT under zero-shot settings is poor and the ne-tuned BERT model performs better. In the multi-class case, many conditional causal relations are misclassi ed as direct causal. Yet, the accuracy does not improve signi cantly in the binary setting. The addition of cues (ZS-Cues) improves the performance, showing the importance of speci c patterns that help in classi cation, especially with multi-class settings. In both cases, ZS-Cues performed better than FS-Cues. This could be because the sentences in this dataset are quite varied (extracted from the scienti c literature) and we cannot pre-assume that the selected sentences for few-shot experiments could be the best representative for a class. Tab. 3 reports more details on the zero-shot results. The relatively high recall values for causal sentences denote a good ability of the model in detecting direct causal relation. Yet, the same does not happen with conditional causal relations, typically misclassi ed as direct ones. This also explains the higher performance in the binary class case. For a deeper comparison against the FFT model, we prompt the GPT model with more examples. An option would be ne-tuning the GPT model, but we keep this as a future study, as here the focus is on prompting approaches. Further, there are restrictions on the number of prompt tokens processed by GPT 3.5 model. As a reasonable prompting solution, we use 500 samples (corresponding to a 1:4 train-test split). The same samples are used to train BERT under multi-class settings. This proportion makes the BERT performance comparable with the one with k-fold cross validation (F1=0.81), while a drastic drop is obtained with a 1:9 split (F1=0.36). For GPT, this setup requires a slight change in the prompt (Fig. <ref type="figure" target="#fig_1">1</ref>), to include a list of input text and give the corresponding predictions as a list. We then chunked the remaining 2449 test samples, each containing ten samples, to be passed on to the prompt. These steps are intended to optimise prompt e ciency in terms of costs and time. The results are in Tab. 4. It can be observed that, with 500 samples, the performance of GPT model was better than its zero-shot counterpart, but is not comparable with BERT model ne-tuned with the same 500 training samples. This seems to con rm, in the causal domain, the general ndings discussed in <ref type="bibr" target="#b34">[35]</ref>. At the same time, it is also notable that simply adding pattern information, like in ZS-Cues and FS-Cues, makes the LLM performance better than the 500-shot model.</p><p>Moreover, in some cases, GPT gives predictions not explicitly mentioned in the prompts. For instance, the prediction was multi-label (neither of the labels included true prediction) or the prediction was categorised into a new class (not described in the prompt). For the evaluations, we had to remove such samples. With zero-shot multi-class, nally, we had 3055 instances and 2016 instances with binary class. Some explicit results from the zero-shot experiments are in Figs. 2, 3 and 4. Tab. 5 reports examples of misclassi cation from the zero-shot prompting of GPT. Some of these instances are hard to classify even for human experts. GPT almost always classi es a negative causality as a no-relation case (I2, 3, and 4). Some other misclassi cations (I8) predicted as causal while conditional in practice are too obvious. Similar patterns are observed in correlational cases (I10). It can be also observed that the addition of cues helped in the correct classi cation of some instances. The correct predictions are in bold. E.g., I7 and I9 show that the cue may could have helped in the correct classi cation as conditional causal, and for I12 the cue associated with.</p><p>text: However, obesity seems to be associated with more wound complications.</p><p>The text states that "obesity seems to be associated with more wound complications."</p><p>This suggests a potential causal relationship between obesity and wound complications. To determine the type of causal relationship, we need to analyze the information provided.</p><p>1. Direct Causal Relation: This type of causal relation implies that one event directly causes another event. In this case, if obesity directly causes wound complications, we would expect to see a clear cause-effect relationship between the two.</p><p>2. Conditional Causal Relation: This type of causal relation implies that one event causes another event under certain conditions or circumstances. In this case, if obesity causes wound complications only under specific conditions, we would need more information to determine the specific conditions.</p><p>3. Correlational Relation: This type of relation implies that two events are statistically related or occur together, but one does not necessarily cause the other. In this case, if obesity and wound complications are correlated but not causally related, we would expect to see a statistical association between the two.</p><p>Based on the information provided, it is not clear whether the relationship between obesity and wound complications is a direct causal relation, a conditional causal relation, or a correlational relation.</p><p>The text only states that obesity is associated with more wound complications, but it does not provide information on the specific nature of the relationship or any conditions under which the relationship occurs.</p><p>Therefore, the appropriate categorization for the causal relationship in this text is: &lt;Answer&gt;3&lt;/Answer&gt; (correlational) text: This study is one of the first proofs of concept showing that a nutrition-sensitive agriculture value chain approach can contribute to improved child nutrition in a remote pastoralist population.</p><p>To analyze the causal relationship in the provided text, we need to identify the entities involved and understand their interactions.</p><p>Entities: 1. Study 2. Nutrition-sensitive agriculture value chain approach 3. Improved child nutrition 4. Remote pastoralist population Based on the information in the text, let's analyze the causal relationship: 1. The study is one of the first proofs of concept. 2. The study shows that a nutrition-sensitive agriculture value chain approach can contribute to improved child nutrition. 3. The improved child nutrition is in a remote pastoralist population.</p><p>From the above analysis, we can conclude that there is a direct causal relationship between the nutrition-sensitive agriculture value chain approach and improved child nutrition in a remote pastoralist population. Therefore, the causal relationship in the text can be categorized as a direct causal relation. &lt;Answer&gt;1&lt;/Answer&gt; The provided text states that there are no indications that endogenous and exogenous gonadal hormones affect the radiation dose-response relationship. Based on this information, we can conclude that there is no clear causal relationship between endogenous and exogenous gonadal hormones and the radiation dose-response relationship.</p><p>Therefore, the answer is &lt;Answer&gt;0&lt;/Answer&gt; (no relation). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Limitations</head><p>Finally, as our paper presents the results of a work in progress, let us discuss the limitations of the present work and the possible enhancements we might consider for the ongoing work.</p><p>We have used only one dataset; hence, whether our ndings remain valid in the general case might be questionable. The dataset facilitates understanding how well LLMs identify the causal descriptions embedded in scienti c literature under a more challenging multi-class setting, including correlative and causal relations. Distinguishing between direct and conditional causation is especially di cult. To the best of our knowledge, there are no datasets with analogous characteristics, at least for multi-class settings. Yet, manually annotating scienti c abstracts and creating new benchmarks for deeper validation is a realistic and necessary e ort. Moreover, in the current paper, we have focused on the GPT model and compared it with the open-sourced Falcon and BERT-based models. This can be enhanced by comparing with di erent LLMS. Further, the focus was on using prompt-based techniques, which have a broad scope to be explored. Based on the nding from <ref type="bibr" target="#b30">[31]</ref>, we investigate techniques such as incorporating language cues while prompting LLMs. One major problem is that LLMs can be sensitive to manually engineered prompt designs; hence, automating prompts and using soft prompt techniques would be the way forward.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and Outlooks</head><p>This work is a preliminary exploration to understand the capabilities and limitations of the GPT model in causality identi cations, speci cally in multi-class settings. The experiments show that GPT has limited zero-shot and few-shot capabilities in capturing such causal relations, subject to the data in consideration. Focusing on the limitations, in the future, we would like to enhance our experiments on a range of causal data to have conclusive generalisations on the studied facts. Prompt engineering as such has a lot of potential to be explored, while hard-core engineering of prompts may not be always bene cial. Hence, we also plan to explore PEFT techniques such as soft prompting for causal detection and further extractions of causal graphs.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>system_msg = You are a helpful assistant for causal reasoning and cause-and-effect relationship discovery. Your aim is to identify the entities and to categorize the input sentences into either direct causal relation or conditional causal relation or correlational relation or no relationship exist intro_msg = You will be provided with a text. Text: &lt;Text&gt;{text}&lt;/Text&gt; instructions_msg = Please read the provided text carefully to comprehend the context and content. Examine the roles, interactions, and details surrounding the entities within the text. Based only on the information in the text, categorize the causal relation as 0. no relation 1. direct causal 2. conditional causal 3. correlational Your response should analyze the situation in a step-by-step manner, ensuring the correctness of the ultimate conclusion, which should accurately reflect the likely causal connection based on the information presented in the text. If no clear causal relationship is apparent, select the appropriate option accordingly, i.e., 'no relation'. option_choice_msg = Your response should analyze the situation in a step-by-step manner, ensuring the correctness of the ultimate conclusion, which should accurately reflect the likely causal connection between the two entities based on the information presented in the text.If no clear causal relationship is apparent, select the appropriate option accordingly.Then provide your final answer within the tags &lt;Answer&gt;[answer]&lt;/Answer&gt;, (e.g. &lt;Answer&gt;1&lt;/Answer&gt;).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A zero-shot prompt for a causal recognition task.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: An example of correct correlational classi cation.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: An example of correct classi cation of a direct causal relation.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: An example of a direct causal relation misclassi ed as no relation.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Zero-shot (F1) performance of GPT and Falcon LLMs.</figDesc><table><row><cell>Model</cell><cell>Approach</cell><cell cols="2">Binary class Multi-class</cell></row><row><cell>GPT 3.5 turbo</cell><cell>Zero shot (ZS)</cell><cell>0.59</cell><cell>0.37</cell></row><row><cell>Falcon-7b-instruct</cell><cell>Zero shot (ZS)</cell><cell>0.19</cell><cell>0.27</cell></row><row><cell cols="2">Falcon-40b-instruct Zero shot (ZS)</cell><cell>0.26</cell><cell>0.38</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Comparison of zero-shot, with and without cues, and few-shot LLMs against fine-tuned BERT.</figDesc><table><row><cell>Model</cell><cell>Approach</cell><cell cols="2">Binary class Multi-class</cell></row><row><cell>GPT 3.5 turbo</cell><cell>Zero shot (ZS)</cell><cell>0.59</cell><cell>0.37</cell></row><row><cell>GPT 3.5 turbo</cell><cell>Zero shot with Cues (ZS-Cues)</cell><cell>0.66</cell><cell>0.51</cell></row><row><cell>GPT 3.5 turbo</cell><cell>Few shot with Cues (FS-Cues)</cell><cell>0.62</cell><cell>0.50</cell></row><row><cell cols="2">BERT-base-cased Full Fine Tuning (FFT)</cell><cell>0.92</cell><cell>0.87</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Results on multi-class and binary settings for zero-shot classifications.</figDesc><table><row><cell></cell><cell></cell><cell cols="2">Multi-class</cell><cell></cell><cell cols="2">Binary</cell></row><row><cell></cell><cell cols="6">No Rel. Causal Cond. Causal Corr. No Rel. Causal</cell></row><row><cell>F1-score</cell><cell>0.45</cell><cell>0.39</cell><cell>0.12</cell><cell>0.54</cell><cell>0.59</cell><cell>0.58</cell></row><row><cell>Precision</cell><cell>0.68</cell><cell>0.27</cell><cell>0.10</cell><cell>0.60</cell><cell>0.85</cell><cell>0.45</cell></row><row><cell>Recall</cell><cell>0.34</cell><cell>0.70</cell><cell>0.14</cell><cell>0.48</cell><cell>0.45</cell><cell>0.85</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>BERT-Based FFT vs. GPT models with 500-shot training samples.</figDesc><table><row><cell>Model</cell><cell cols="4">No Rel. Causal Cond. Causal Corr. Avg.</cell></row><row><cell cols="2">BERT-base-cased 0.86</cell><cell>0.77</cell><cell>0.74</cell><cell>0.86 0.81</cell></row><row><cell>GPT 3.5 turbo</cell><cell>0.61</cell><cell>0.42</cell><cell>0.12</cell><cell>0.55 0.43</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Sample sentences predictions from di erent prompts and ground truth (GT) values.</figDesc><table><row><cell>I</cell><cell>Samples</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://pubmed.ncbi.nlm.nih.gov.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://simpletransformers.ai.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://huggingface.co/tiiuae/falcon-7b-instruct.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Improving Language Understanding by Generative Pre-Training</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Narasimhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Salimans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>OpenAI</publisher>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Large language models are zero-shot reasoners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kojima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Reid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Iwasawa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="22199" to="22213" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Shu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2202.01924</idno>
		<title level="m">Zero-shot aspect-based sentiment analysis</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tripathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Singh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1906.04914</idno>
		<title level="m">From fully supervised to zero shot settings for twitter hashtag recommendation</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Teney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V D</forename><surname>Hengel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1611.05546</idno>
		<title level="m">Zero-shot visual question answering</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Zero-shot commonsense question answering with cloze translation and consistency optimization</title>
		<author>
			<persName><forename type="first">Z.-Y</forename><surname>Dou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Peng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Arti cial Intelligence</title>
				<meeting>the AAAI Conference on Arti cial Intelligence</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="10572" to="10580" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Katiyar</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.02405</idno>
		<title level="m">Simple and e ective few-shot named entity recognition with structured nearest neighbor learning</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">GPT-3: Its nature, scope, limits, and consequences</title>
		<author>
			<persName><forename type="first">L</forename><surname>Floridi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chiriatti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Minds and Machines</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="681" to="694" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Can GPT-3 pass a writer&apos;s Turing test?</title>
		<author>
			<persName><forename type="first">K</forename><surname>Elkins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Cultural Analytics</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A survey of zero-shot learning: Settings, methods, and applications</title>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">W</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Miao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Intelligent Systems and Technology (TIST)</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="1" to="37" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Large-scale multi-modal pre-trained models: A comprehensive survey</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X.-Y</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Gao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="447" to="482" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Causal BERT: Language models for causality detection between events expressed in text</title>
		<author>
			<persName><forename type="first">V</forename><surname>Khetan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ramnani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Anand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sengupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">E</forename><surname>Fano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Intelligent Computing: Proceedings of the 2021 Computing Conference</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="965" to="980" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A survey on BERT and its applications</title>
		<author>
			<persName><forename type="first">S</forename><surname>Aftan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Shah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2023 20th Learning and Technology Conference (L&amp;T), IEEE</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="161" to="166" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Causal parrots: Large language models may talk causality but are not causal</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ze Evi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Willig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Dhami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kersting</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions on Machine Learning Research</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Causality in the time of LLMs: Round table discussion results of CLeaR</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Janzing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Der Schaar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Locatello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Spirtes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">7</biblScope>
			<date type="published" when="2023">2023. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Can large language models distinguish cause from e ect?</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zhiheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mihalcea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sachan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schölkopf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">UAI 2022 Workshop on Causal Representation Learning</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Antonucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Piqué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Za Alon</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.14670</idno>
		<title level="m">Zero-shot causal graph extrapolation from text via LLMs</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Almazrouei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Alobeidli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alshamsi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cappelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cojocaru</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Debbah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">É</forename><surname>Go Net</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hesslow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Launay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Malartic</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.16867</idno>
		<title level="m">The falcon series of open language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Koopman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zuccon</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.13243</idno>
		<title level="m">Open-source large language models are strong zero-shot query likelihood models for document ranking</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.01044</idno>
		<title level="m">Large language models are zero-shot text classi ers</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Joo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Seo</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.14045</idno>
		<title level="m">The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of-thought ne-tuning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.10703</idno>
		<title level="m">Regen: Zero-shot text classication via training data generation with progressive dense retrieval</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">An empirical study of GPT-3 for few-shot knowledge-based vqa</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Gan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Arti cial Intelligence</title>
				<meeting>the AAAI Conference on Arti cial Intelligence</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="3081" to="3089" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Moradi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Blagec</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Haberl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Samwald</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2109.02555</idno>
		<title level="m">GPT-3 models are poor few-shot learners in the biomedical domain</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">On the e ectiveness of parametere cient ne-tuning</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename></persName>
		</author>
		<author>
			<persName><forename type="first">.-C</forename><surname>So</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Collier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Arti cial Intelligence</title>
				<meeting>the AAAI Conference on Arti cial Intelligence</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="12799" to="12807" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Prompt tuning can be comparable to ne-tuning across scales and tasks</title>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Tam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P-Tuning</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Short Papers</title>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="61" to="68" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2101.00190</idno>
		<title level="m">Pre x-tuning: Optimizing continuous prompts for generation</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<title level="m">GPT understands, too</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>AI Open</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Kıcıman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ness</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.00050</idno>
		<title level="m">Causal reasoning and large language models: Opening a new frontier for causality</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">A critical review of causal inference benchmarks for large language models</title>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Clivio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Shirvaikar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Falck</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AAAI 2024 Workshop on &quot;Are Large Language Models Simply Causal Parrots?</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Detecting causal language use in science ndings</title>
		<author>
			<persName><forename type="first">B</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4664" to="4674" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">The association between exaggeration in health related science news and academic press releases: retrospective observational study</title>
		<author>
			<persName><forename type="first">P</forename><surname>Sumner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vivian-Gri Ths</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Boivin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Venetis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Davies</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ogden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Whelan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hughes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dalton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMJ</title>
		<imprint>
			<biblScope unit="volume">349</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">BERT: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">It&apos;s not just size that matters: Small language models are also few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Schick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="2339" to="2352" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
