1. Introduction

Semantic Priming in GPT: Investigating LLMs Through a Cognitive Psychology Lens

Filippo Colombi

Carlo Strapparava

0 0 Fondazione Bruno Kessler , Via Sommarive, 18, 38123 TN, Trento , Italy 1 University of Trento , Via Calepina, 14, 38122 TN, Trento , Italy

2025

Understanding whether large language models (LLMs) capture human-like semantic associations remains an open challenge. This study investigates semantic priming within GPT-4o Mini by analyzing probabilistic responses to psycholinguistically validated prime-target pairs. Prime-target stimuli were extracted from the Semantic Priming Project database, embedding target words within masked sentence contexts preceded by semantically related or unrelated primes. Model responses were quantified using log-probabilities associated with predicted tokens, allowing comparative evaluation of semantic priming efects. Results reveal that the model's predictive outputs reflect priming efects when analysis is restricted to fully reconstructed data, yet these efects diminish significantly under data imputation strategies addressing extensive missingness. This discrepancy highlights critical issues regarding data preprocessing, tokenization, and the management of missing values in computational semantic experiments. Implications for future research in cognitive modeling and the refinement of LLM architectures to better approximate human semantic processing are discussed.

eol>semantic priming large language models GPT-4o language modelling experimental psycholinguistics

1. Introduction

stronger priming efects. Furthermore, Neely [ 3] differentiated between automatic and controlled semantic Semantic priming, a fundamental phenomenon in psy- priming processes in 1977. Automatic priming occurs cholinguistics and cognitive neuroscience, provides crit- rapidly and unconsciously at short stimulus onset asynical insights into how the human brain organizes and chronies (SOAs), reflecting the passive spread of actiretrieves semantic knowledge. It refers to the facilitation vation within the semantic network. In contrast, conof a target word’s recognition or processing when it is trolled priming involves conscious, strategic processes preceded by a semantically related prime. This efect that emerge at longer SOAs, where participants anticwas first empirically demonstrated by Meyer and Schvan- ipate certain responses based on contextual cues. The eveldt in 1971 [1] using the lexical decision task where neural correlate of semantic priming was clarified by participants identified words more quickly when pro- with the discovery of the N400 event-related potential ceeded by related primes (e.g., bread-butter) compared (ERP) component [4]. It is a negative deflection of the to unrelated pairs (e.g., guitar-butter). This finding sug- brain electrical activity that peaks approximately 400 gested that related concepts in the mental lexicon are ms after the presentation of a semantically incongruent interconnected, enabling more eficient retrieval. Build- stimulus. In their study, unexpected sentence endings ing on this, Collins and Loftus [2] proposed the spreading elicited larger N400 responses compared to congruent activation model of semantic memory in 1975. Accord- completions, providing neurophysiological evidence that ing to this model, the mental lexicon is structured as a semantic priming modulates brain activity during lannetwork of interconnected nodes representing concepts. guage comprehension. Recent work has started to inWhen a prime word is processed, activation spreads vestigate priming phenomena in large language models, to related nodes, reducing the activation threshold re- showing parallels with human language processing. For quired to recognize semantically connected targets. This structural priming, Michaelov et al. [5] demonstrate that framework accounts for the graded nature of semantic LLMs exhibit human-like inverse frequency efects and priming, where more closely related concepts exhibit that prime-target dependencies influence prediction preferences, revealing systematic parallels with production tCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- preferences in humans. Similarly, semantic activation * Corresponding author. patterns—akin to classical semantic priming in psycholin† These authors contributed equally. guistics—have been explored both in humans and LLMs, $ filippo.colombi@studenti.unitn.it (F. Colombi); strappa@fbk.eu highlighting ways in which contextual cues modulate (C. Strapparava) internal representations. These findings motivate situat0009-0000-1307-7857 (F. Colombi); 0000-0002-9365-0242 ing our methodology within this emerging line of work (C. Strapparava) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License and clarifying how our approach compares and contrasts Attribution 4.0 International (CC BY 4.0). with prior operationalizations. toregressive nature, where each word is predicted based on the preceding context. This mechanism inherently mirrors aspects of the human predictive processing in language comprehension, making it a suitable ground for examining whether priming emerges from the model’s output.

Motivations. This foundational framework informs the present study, which investigates whether similar semantic priming efects manifest in large language models (LLMs) like GPT-4o. By comparing the probabilistic output of the model in related and unrelated prime-target conditions, this research explores whether LLMs exhibit Research Question and Hypotheses. The present cognitive-like patterns of semantic association, bridging work proposes to investigate whether LLMs, such as GPTcomputational modelling with traditional psycholinguis- 4o1, exhibit semantic priming efects similar to those tic paradigms. The motivation behind this study stems observed in human cognition, exploring if semantic asfrom a broader interest in cognitive modelling using AI. sociations emerging from their probabilistic outputs reThese systems ofer a convenient starting point for mod- lfect transferable cognitive mechanisms. This research is elling and exploring human language processing due to situated within a growing field that compares AI to hutheir architecture and training on vast amounts of linguis- man cognition, exploring parallels and divergences. The tic data. A critical question is whether the behaviours aim is to assess whether the model not only reflects simthey exhibit are unique to their training processes or if ple statistical learning but also develops semantic structhey mirror transferable cognitive mechanisms inherent tures resembling human semantic networks. In other to human language processing. Understanding this could words, the goal is to determine whether the autoregrescontribute to the debate of whether LLMs merely reflect sive behaviour of the model generates priming efects statistical learning or if they approximate the cognitive comparable to those observed in traditional psychological structure that governs human semantic memory. Neural paradigms. Therefore, the research question we propose networks like GPT are trained on massive datasets, cap- is the following: Does GPT-4o mini model exhibit a signifturing statistical regularities, co-occurrence patterns and icant diference in the probability values of target words semantic relationships present in human language. While when presented in related priming conditions compared these models are not biological in nature, the structured to unrelated conditions? statistical patterns they learn often mimic human-like associations. This raises intriguing questions: do these Expected Outcomes. It is hypothesized that targets models, through exposure to language data, develop se- will exhibit higher probabilities values in the related conmantic networks akin to those observed in the human dition compared to those presented in unrelated conbrain? And if so, can they serve as valid proxies for study- ditions. This structure allows for the investigation of ing cognitive processes like semantic priming? Beyond whether the emergent cognitive traits of LLMs can be theoretical interests, there are significant practical ap- considered analogous to the dynamics of human semanplications to this line of inquiry. These systems could tic memory and whether traditional psycholinguistic be employed to predict and model human behaviours paradigms can be employed to evaluate the validity of in various linguistic tasks, providing a new tool for psy- these models as devices for cognitive research. cholinguistic research. Moreover, understanding how closely they align with human cognitive processes could inform the refinement of AI architectures, enabling the 2. Methodology development of models that better capture human-like semantic organization. GPT-4o is a state-of-the-art (SOTA) In autoregressive systems as GPT-4o, text generation model in numerous linguistic domains, including natural is fundamentally modelled as a conditional probability language understanding, text generation, translation and problem. The model predicts the next word in a sequence dialogue systems. Its ability to produce highly coherent, based on the preceding context, represented mathematihuman-like linguistic artifacts makes it an ideal candidate cally as for investigating semantic priming efects. Beyond the (|1, 2...−1 ) (1) mere scarcity of experiments on priming, there remains where () is the probability of generating a word a broader and more fundamental question: To what ex- given the previous ones. This probabilistic framework tent do LLMs, particularly closed-source models, exhibit underpins how the model processes language and genersemantic processing mechanisms that align with human ates outputs, making it a suitable foundation for invespsycholinguistic assessments? While extensive research tigating semantic priming efects. In the context of this has been conducted on model performance and generative capabilities, little is known about whether their 1The experiment was run with GPT-4o mini. However, we will often response to such assessments parallel those reported in refer to it as GPT-4o or GPT throughout the text. This is just to human. This is particularly relevant given GPT-4o’s au- make reading as smooth as possible. experiment, the target word is presented after a prime that is either semantically related or unrelated. To assess whether GPT-4o exhibits priming efects, the following contrast was applied (|_)

In this experiment, GPT-4o mini was presented with prime-target pairs, where the prime word was either semantically related or unrelated to the masked target word embedded within a sentence. For each trial, the model received a prompt consisting of the prime followed by a sentence with the target word omitted and was instructed to generate a single word to fill the blank. (|_) (2)

Stimuli Presentation. The stimuli were presented to

If semantic priming is present, the model should as- GPT through 500 structured API calls designed to simusign a higher probability to the target word in the related late an experimental paradigm of cognitive psychology. condition, reflecting an internal representation of seman- Each stimulus consisted of a prime word (semantically tic association similar to those of humans. GPT models related or unrelated to the target) and a sentence conoutput not only the predicted tokens but also the log- taining a masked target word. The API was configured probabilities (log-probs) associated with each token to prompt the model with both the prime and the incomplete sentence as input text: [Prime Word]. [Sentence with the target masked as ". . . "]. () = [ (|1, 2, ..., −1 )]

(3)

A log-prob closer to 0 indicates a higher predicted probability, while more negative values indicate lower confidence in the prediction. In this experiment, we use log-probs to quantify the model’s confidence in predicting the target word. Thus, semantic priming is operationalized as () > ()

(4)

2.1. The Experiment

Our operationalization of priming diverges from the maybe more familiar formulation of computing prim- For example, in a related condition, the prime “being as the diference in the log probability of a fixed low” may precede the sentence “The Ferrari finished six target given congruent versus incongruent primes [6, 7] places . . . the Mercedes”, where the target is ”above”. In because we aim to isolate semantic activation in contexts the unrelated condition, the same sentence would be where the is not trivially predicable and to control for preceded by an unrelated prime such as “postage”. This context-dependent insertion efects. In particular, the structure allowed for direct comparison of the model’s ifll-in-the-gap setup we use allows us to: (i) position predictions across priming conditions. To ensure conthe target in a controlled environment so that its activa- trolled responses, the model was provided with a system tion can be assessed relative to a specific semantic cue instruction to return a single-word completion for the (the prime), and (ii) avoid conflating efects due to target masked portion of the sentence. The temperature was set salience or surface-form predictability that a straight- to zero to minimize randomness and enforce determinisforward target-diference formulation might implicitly tic outputs, and finally log-probs were requested for the include. We evaluated the design quantitatively and en- predicted token, together with the top 15 alternatives. sured that it produces a signal consistent with priming as a contextual modulation of likelihood, without relying on Retrieval of Log-Probabilities. Log-probs provide the assumption that the target sentence is equally well- an exhaustive measure of the model’s confidence in preformed or equally predictable across conditions. Concep- dicting a given token because they reflect the probatual comparisons suggest that our pipeline captures the bility distribution over multiple possible continuations, same directional priming influence while ofering control rather than just the most likely one. They allow for a over the insertion context and over cases where native nuanced comparison of how strongly the model favours target continuity would otherwise introduce ambiguity. certain predictions, making them particularly useful for A schematic of the pipeline and an illustrative example assessing semantic priming efects. However, retrieving are provided below. log-probs for the intended target posed a computational challenge due to the tokenization structure of GPT outputs, requiring a sophisticated reconstruction algorithm.

When GPT generates a response, it predicts the single most likely token (i.e., the actual completion), but it can also return log-prob values for multiple alternative predictions—if explicitly requested in the API call. These values are stored in a structure that contains the predicted token along with a ranked set of alternatives, each associated with its probability. An additional complication arose because GPT often predict sub-word units, meaning that a target word might be split into multiple tokens2. Such level of complexity necessitated a reconstruction system capable of piecing together each “brick” to retrieve the log-probability of the intended word. The retrieval system operated by matching the original target word against the set of alternative completions of the model. If the target appeared in its entirety among the predictions, its associated log-prob was directly extracted. Conversely, when the model provided sub-word tokens, a beam search strategy was employed to reconstruct the word step-by-step. At each stage, candidate sequences were expanded by adding predicted tokens, ensuring that only those maintaining a valid morphological match with the target were retained. Once a valid reconstruction was found, the sum of the probabilities of constituent tokens was computed, and the least negative candidate (i.e., the most probable one) was selected as the best match. Where no reconstruction matched the original target, no log-prob was assigned (NaN), leaving its interpretation for later stages of analysis.

Data Construction. The stimuli set was built following previous research [8] and was designed to ensure that semantic associations were robustly controlled. A total of 250 triplets (target, related prime, unrelated prime) were selected from the Semantic Priming Project (SPP), a widely used database containing highly validated prime-target association from human behavioural studies. The rationale behind using SPP was its empirical grounding—these prime-target pairs have been extensively tested in psycholinguistic experiments, making them an ideal starting point for evaluating whether LLMs, like GPT, exhibit cognitive processes akin to those observed in human behavioural tests. Given that GPT is trained on massive linguistic corpora, it has probably internalized complex semantic structures, making it a suitable model for priming-based investigations. To construct the experimental dataset, the following procedure was applied:

1. Selection of prime-target pairs: 2All GPT models leverage a Byte Pair Encoding (BPE) tokenizer, which allows for flexible and semantically complete processing of linguistic data 2. Pairing process: 3. Contextual sentence construction: • A randomly chosen prime-target pair was

selected from SPP in the related condition. • The corresponding prime-target pair was selected to contrast with the related condition. • Only first-associate (most common) target was considered, ensuring strong semantic links for the related condition. • Each related and unrelated prime was paired with the same target word, creating a contrastive pair. • A sentence was invented to serve as a con

textual frame for the target word. • The target word was removed from the sentence and replaced with a placeholder ("...") creating a fill-in-the-blank format for the model. • The entire dataset was stored in a structured tabular format, with each stimulus set organized as follow.

4. Tabular data representation: 2.2. Statistical Testing

To determine whether GPT-4o exhibits semantic priming efects, a statistical approach was designed to compare the log-probabilities of target words across related vs. unrelated priming conditions. Since log-probs are continuous numerical values, they provide a measure of the model’s confidence in predicting a given word, making them suitable for inferential statistical analysis. The key objective of this analysis was to assess whether log-probs were significantly higher (closer to 0) in the related condition compared to the unrelated condition, mirroring the facilitatory mechanism observed in human priming studies. Given the paired nature of the data—where each target word appears in both conditions with the same sentence context—the statistical analysis was designed to compare log-probs at the within-item level. Statistical tests often require that data distribution meets certain Multiple Imputation Approach. The first strategy involved multiple imputation, a statistical technique that estimates missing log-probs based on the distribution of observed data. Imputation is considered a reasonable approach to retain a larger dataset while minimizing bias.

Here, an assumption of near-random data missingness had been adopted, although similar hypotheses are often dificult to verify. assumptions. Specifically, normality was a key consider- Multiple Imputation Results. Before conducting hyation: if the distribution of log-probs followed a normal pothesis testing, missing values in log-probs were adpattern, a paired t-test would be appropriate; if not, a dressed using multiple imputation (MI). Out of 500 total Wilcoxon signed-rank test, a popular non-parametric al- observations, 201 (40%) were missing, requiring imputernative, would be used instead. Following this strategy, tation to allow for a complete dataset. Five imputed an initial assessment of normality was planned, ensur- datasets were generated using a multivariate imputer ing that the choice of statistical test was applied ad-hoc, that estimates each value from all the others. Pooled rather than arbitrary. This decision was crucial because estimates were finally derived. To assess how impulog-probs are inherently skewed measures, often concen- tation afected the distribution of log-probs, summary trated around certain thresholds, and the dataset was statistics were calculated before and after imputation. expected to contain NaN values where the model failed The only relevant variation is over standard deviation to predict (or the retrieval algorithm failed to recompose) (std). To determine whether a parametric test or a nonthe target word. To maintain statistical rigor, missing parametric alternative was appropriate, normality of the values would be handled through imputation, but this imputed log-probs was assessed using the Shapiro-Wilk step also had the potential to afect normality, requiring test. This evidenced a significant departure from nora flexible approach. mality ( = 0.891, < 0.05) indicating that a nonparametric test was required for hypothesis testing. A Wilcoxon signed-rank test showed that there is no strong evidence that GPT-4o mini assigned significantly higher log-probs to targets in the related condition vs. the unrelated condition ( = 441.0, = 0.088). This contrasts with expectations, as human studies typically show a clear priming efect in reaction times and lexical decision tasks.

Complete-Case Results. The complete-case analy

Complete Case Analysis. Precisely because it is difi- sis was conducted using only full retrieved prime-target cult to determine with certainty whether the data is miss- pairs, ensuring that all statistical comparisons were based ing for largely random reasons, it is also useful to perform on directly observed data. Out of 500 total trials, 298 logthe test on the dataset without imputation. Therefore, prob values were successfully retrieved, but only 127 the second approach involved analysing the subset of contrastive pairs could be reconstructed for direct comthe results where log-probabilities for each condition parison. This represents a substantial reduction in sample were reconstructed. Both approaches were then tested size, which afects statistical power but ensures that no following the statistical decision tree: if normality was assumptions were made about missing values. Congrupreserved, a paired t-test would be applied; if not, the ently to what was done with imputed data, a normality Wilcoxon signed-rank would be used instead. assessment was conducted to confirm a strong deviation from normality ( = 0.789, < 0.05). Since normality assumption was violated, a Wilcoxon signed-rank 3. Results test was conducted to compare the survived log-probs. Unlike multiple imputation, the complete-case yielded a significant result ( = 1793.0, < 0.05). This provides evidence that GPT-4o mini exhibits a semantic priming effect, with significantly higher log-probabilities for target words in related conditions than in unrelated conditions.

The aim of this results section is to determine whether

GPT-4o mini exhibits semantic priming efects, measured as diferences in log-probabilities of target words in related vs. unrelated priming conditions. Given the presence of missing—cases where the experiment failed to generate the expected target word—two complementary analytical approaches were adopted. Summarizing from 4. Discussion the previous section: (a) Multiple Imputation, which estimates missing values to maintain the statistical power, The findings of this study ofer an interesting perspecand (b) Complete-Case Analysis, which restricts the tive on the challenges of using LLMs in cognitive moddataset to instances where log-probs were successfully elling. While complete-case analysis detected a signifiretrieved in both conditions, ensuring pairwise compar- cant priming efect, the multiple imputation approach did isons. not, raising important methodological and conceptual inquiries. The discussion is divided into two sections: (a) methodological considerations, focusing on missing data challenges, tokenization artifacts, statistical sensi- retrieval was the format of the model’s output, which tivity, and potential imputation biases that may have returns a ranked list of predicted tokens along with their influenced the results and (b) conceptual implications, log-probs. In cases where the model generated the target addressing whether LLMs exhibit cognitive-like prim- as a single token extraction was straightforward. Howing, how predictive mechanisms compare to biological ever, when the model split the target across multiple tosemantic encoding and retrieval and what these findings kens, its overall log-prob had to be reconstructed from its mean for cognitive modelling. individual components—a process that introduces uncertainty. To tackle this challenge, a beam search algorithm 4.1. Methodological Considerations was implemented to iteratively reconstruct multi-token targets from the list of predicted sub-word tokens. While Handling Missing Data beam search improved reconstruction, it also introduced potential artifacts: (a) some reconstructions may not have perfectly matched the intended target, leading to incorrect log-prob values, and (b) certain targets may have been tokenized inconsistently. If tokenization patterns difered systematically between conditions, this could have biased log-prob retrieval, introducing a confound.

In this experiment, a critical methodological challenge was posed by missing data—40% of the log-prob values—requiring the use of multiple imputation to reconstruct a complete dataset. MI is generally preferred over list-wise deletion, as it preserves statistical power by estimating missing values based on the observed distribution. However, when such a substantial portion of data is missing, MI may not fully recover the real distribution, Statistical Sensitivity and Priming Detection. That raising questions about representativeness. One conse- being said, divergent findings in MI and complete-case quence is the arousal of variance compression in log- results likely arise from two interrelated factors: (a) variprobs values, testified by a shrink in standard deviation. ance compression introduced by imputation, which may This phenomenon likely occurs predicting missing val- have diluted the contrast between related and unrelated ues based on observed ones, pulls extreme values toward conditions, and (b) tokenization and reconstruction inthe mean. While this can stabilize estimates in smaller consistencies, which could have added noise to log-prob datasets, it may have unintentionally smoothed meaning- retrieval, particularly in cases where targets were split ful variability in the log-probs, afecting true distribution. into multiple tokens. The takeaway is that priming sigIndeed, normality test showed a significant departure nal drawn from next-word probability retrieval in LLMs from normality after imputation was performed. Since may be relatively weak, making it overtly susceptible to semantic priming efects are often subtle, any reduction distortions introduced by data pre-processing. in variance could have diminished the contrast between related and unrelated conditions, thereby weakening the 4.2. LLMs and Cognitive Modelling observable efects. This is consistent with the Wilcoxon test result in the MI dataset, whereas the complete-case The methodological considerations discussed so far analysis did detect a significant efect. The divergence demonstrated how data pre-processing choices and tokbetween imputed and complete-case results raises an im- enization can influence statistical sensitivity in LLM cogportant methodological question: did MI impoverish the nitive experiments. However, these findings also raise priming efect, preventing statistical detection, rather than deeper conceptual questions: To what extent do LLMs recover lost information? If the missing data was missing exhibit semantic priming efects comparable to those obnot at random (MNAR)3 but instead systematic then MI served in human cognition? And if LLMs capture statisticould have incorrectly smoothed meaningful distinctions, cal relationship between words, does this also means that masking an efect that was present in the raw data. they can replicate the cognitive mechanisms underlying human semantic memory? To answer such questions, it Tokenization and Target Reconstruction Bias. A is possible to draw insights from the two dominant theosignificant challenge in the experiment was retrieving retical frameworks that have shaped our understanding log-probabilities for target words due to GPT’s sub-word on semantic processing: spreading activation theory, as tokenization. Like other transformer models, it does not already presented in the introductory section and in the always generate words as units, instead break less fre- predictive coding theory (Friston, 2005). These models ofquent or morphologically complex words into multiple fer diferent perspectives on how the brain organizes and sub-word tokens via BPE. This posed a serious obsta- retrieves meaning and comparing findings from present cle to probability extraction. Further complicating word work allows to assess the extent to which LLMs approximate cognitive mechanisms. The rest of this section reflects on these themes.

3Unfortunately, there is no surefire way to determine in which cat

egory data will fall. Random missingness is an assumption that need to be made based upon direct knowledge of the data and its collection mechanisms.

Spreading Activation, Semantic Memory and LLMs ing token selection within the fixed-parameters of the The spreading activation theory (Collins & Loftus, 1975) trained model. This means GPT does not actively minisuggests that semantic memory is structured as a network mize uncertainty over time. The experimental findings of interconnected concepts, where activation spreads support this distinction. In human coding models, primfrom one node (a word/concept) to related nodes based on ing efects are expected to persist across diferent noise semantic similarity and association strength. This model conditions because the brain continuously adjust its prohas been widely supported by human psycholinguistic cessing. In contrast, the fragility of GPT’s mechanisms studies. The priming efects detected in the complete-case suggests that the models lack a hierarchical learning proanalysis seems to align with spreading activation frame- cess that adapts to uncertainty over time. This highlights work. LLMs, much like human semantic memory, links a fundamental limitation of LLMs: while they approxiconcept by encoding statistical co-occurrence patterns be- mate prediction-driven behaviours, they do not engage in tween words—though they do it on a considerably larger error-driven learning during inference, a key component scale. However, while human priming efects are driven of human cognition. As a result, while priming in LLMs by neural activation spreading across conceptual net- may superficially resembles predictive coding, it does not works, GPT does not store explicit semantic structures, it capture the adaptive mechanisms that govern biological instead predicts word based on learned probability distri- semantic memory. The results of this study highlight an butions. This distinction is crucial: in human cognition, ongoing debate in cognitive modelling: to what extent spreading is dynamically modulated by context, prior ex- do LLMs exhibit cognitive-like processing? The presence perience, and attentional control, whereas LLMs’ priming of a priming efect suggests that. LLMs capture meanemerges from purely statistical dependencies in language ingful relationships between words, much like spreading data. Current results suggest that semantic priming ef- activation models, but the disappearance of this efect in fects in GPT do not necessarily indicate cognitive-like the imputed dataset suggests that LLMs’ priming is more concept retrieval. The observed priming efect is likely fragile than human priming. Together, these findings a by-product of training, rather than a direct parallel to give the impression that LLMs do not simulate human human conceptual activation. Additionally, the lack of cognition in a mechanistic sense. Instead, they exhibit a significant efect in MI dataset further challenges the statistical properties that resemble cognitive processes idea that LLM-based priming mirrors human spreading at the output level but are not necessarily driven by the activation dynamics. According to human experiments, same underlying computations. priming efects persist despite noise or missing data because activation propagates through associative memory networks. In contrast, the weakening of priming in the imputed dataset suggests a more fragile mechanism.

Final Thoughts and Future Directions. We firmly believe that while LLMs do not currently replicate human semantic cognition, they ofer valuable tools for modelling language-based associations. It is our opinion that the presented approach may be improved and extended: Predictive Coding and the Mechanisms Underlying Priming in LLMs. An alternative perspective for understanding semantic processing is predictive coding theory [9]. This model suggests that the brain functions as a hierarchical predictive system, continuously generating expectations about incoming sensory input and minimizing prediction errors by adjusting internal models. In this framework, priming occurs because a related prime reduces the uncertainty (prediction error) associated with recognizing the target, leading to faster processing. LLMs, particularly autoregressive models like GPT, operate in a manner structurally similar to predictive coding. They generate words one at a time, updating predictions based on past context. This aligns with the core principle of predictive coding. The log-probabilities extracted in this study measure the system’s internal prediction certainty, making them conceptually analogous to prediction error signals in the human brain. The critical diference is that in biological brains, prediction errors lead to adaptive training and belief updating, whereas in LLMs, prediction errors do not modify the model in real-time—they rather influence generation for a short time-window, impact1. Target predictability: controlling for how predictable a target word is in natural language using frequency norms, surprisal values and entropybased estimates. This would help disentangle semantic priming from simple word predictability in LLMs. 2. Word frequency efects: since high-frequency words are easily predicted and low-frequency words may be underrepresented in training data, future experiments should systematically control word frequency to determine its impact in priming strength. 3. Contextual influence: LLMs process meaning based on statistical co-occurrence within a fixed context window, which may amplify or suppress subtle priming efects. Future studies should manipulate prime-target distance to assess if context length and structural dependencies influence results. Additionally, future research should explore alternative token-matching strategies, ensuring log-probs reconstruction does not systematically fail with certain word structures. And finally, it should be also considered if modifying LLM architectures—for example, incorporating mechanisms for hierarchical belief updating similar to predictive coding models—would lead to more cognitively plausible representations of meaning.

Comparative studies relating neural language process

ing signals (e.g., N400 efects) to outputs of LLMs have been increasingly prominent. Heilbron et al. [ 10, 11 ] demonstrated that predictability estimates produced by deep neural language models (e.g., GPT-2) correlate with EEG/MEG components—including N400 and P600—during naturalistic comprehension, providing direct evidence that model-derived surprisal signals track human-like prediction dynamics. Subsequent work has further refined the cognitive plausibility of transformerbased models in this domain, showing that their contextual predictions are closely aligned with neural signatures of semantic facilitation and processing dificulty [5]. While Futrell et al.[12] approach the question from a complementary angle—treating neural language models as psycholinguistic subject to probe their internal syntactic representations—these strands jointly motivate our efort to align LLM-based priming metrics with known neural phenomena.

Acknowledgments References Code Availability Code and data for reproducing the results are publicly available on GitHub at https://github.com/fico/ semantic-priming-in-LLMs We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), under the NRRP MUR program.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI), Grammarly, and DeepL Write / DeepL Translate in order to: Drafting content, Text translation, Paraphrase and reword, Improve writing style, and Peer review simulation. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

18653 /v1/ 2024 .findings-acl. 877 . [7]

B.-D.

Oh ,

Schuler , Leading whitespaces of lan-

ceedings of the 2024 Conference on Empirical Meth-

2024 , pp. 3464 - 3472 . [8]

Hutchison ,

Balota ,

Neely , M. Cortese,

ing project , Behavior research methods 45 ( 2013 ).

doi:10 .3758/s13428-012-0304-z. [9]

Friston , A theory of cortical responses , Philo-

don. Series B, Biological sciences 360 ( 2005 ) 815 - 836 .

doi:10 .1098/rstb. 2005 . 1622 . [10]

Heilbron ,

Ehinger ,

Hagoort , F. de Lange,

deep neural language models , in: 2019 Confer-

CCN , Cognitive Computational Neuroscience , 2019 .

URL: http://dx.doi.org/10.32470/CCN. 2019 . 1096 - 0 .

doi:10 .32470/ccn. 2019 . 1096 - 0 . [1]

Meyer , R. Schvaneveldt, Facilitation in recog- [11]

Heilbron ,

Armeni , J.-M. Schofelen , P. Ha-

tal psychology 90 ( 1971 ) 227 - 234 . doi: 10 .1037/ Proceedings of the National Academy of Sciences

h0031564. 119 ( 2022 ). doi: 10 .1073/pnas.2201968119. [2]

Collins ,

Loftus , A spreading activation theory [12]

Futrell , E. Wilcox,

Morita ,

Qian , M. Balles-

of semantic processing, Psychological Review 82 teros , R. Levy, Neural language models as psycholin-

( 1975 ) 407 - 428 . doi: 10 .1037//0033- 295X . 82 . 6. guistic subjects: Representations of syntactic state,

407. in: Proceedings of the 2019 Conference of the North [3]

Neely , Semantic priming and retrieval from lexi- American Chapter of the Association for Compu-

cal memory: Roles of inhibitionless spreading ac- tational Linguistics , Minneapolis, Minnesota, 2019 ,

tivation and limited-capacity attention , Journal of pp. 32 - 42 .