<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Priming in GPT: Investigating LLMs Through a Cognitive Psychology Lens</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filippo Colombi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Strapparava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive, 18, 38123 TN, Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>Via Calepina, 14, 38122 TN, Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Understanding whether large language models (LLMs) capture human-like semantic associations remains an open challenge. This study investigates semantic priming within GPT-4o Mini by analyzing probabilistic responses to psycholinguistically validated prime-target pairs. Prime-target stimuli were extracted from the Semantic Priming Project database, embedding target words within masked sentence contexts preceded by semantically related or unrelated primes. Model responses were quantified using log-probabilities associated with predicted tokens, allowing comparative evaluation of semantic priming efects. Results reveal that the model's predictive outputs reflect priming efects when analysis is restricted to fully reconstructed data, yet these efects diminish significantly under data imputation strategies addressing extensive missingness. This discrepancy highlights critical issues regarding data preprocessing, tokenization, and the management of missing values in computational semantic experiments. Implications for future research in cognitive modeling and the refinement of LLM architectures to better approximate human semantic processing are discussed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;semantic priming</kwd>
        <kwd>large language models</kwd>
        <kwd>GPT-4o</kwd>
        <kwd>language modelling</kwd>
        <kwd>experimental psycholinguistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>stronger priming efects. Furthermore, Neely [ 3]
differentiated between automatic and controlled semantic
Semantic priming, a fundamental phenomenon in psy- priming processes in 1977. Automatic priming occurs
cholinguistics and cognitive neuroscience, provides crit- rapidly and unconsciously at short stimulus onset
asynical insights into how the human brain organizes and chronies (SOAs), reflecting the passive spread of
actiretrieves semantic knowledge. It refers to the facilitation vation within the semantic network. In contrast,
conof a target word’s recognition or processing when it is trolled priming involves conscious, strategic processes
preceded by a semantically related prime. This efect that emerge at longer SOAs, where participants
anticwas first empirically demonstrated by Meyer and Schvan- ipate certain responses based on contextual cues. The
eveldt in 1971 [1] using the lexical decision task where neural correlate of semantic priming was clarified by
participants identified words more quickly when pro- with the discovery of the N400 event-related potential
ceeded by related primes (e.g., bread-butter) compared (ERP) component [4]. It is a negative deflection of the
to unrelated pairs (e.g., guitar-butter). This finding sug- brain electrical activity that peaks approximately 400
gested that related concepts in the mental lexicon are ms after the presentation of a semantically incongruent
interconnected, enabling more eficient retrieval. Build- stimulus. In their study, unexpected sentence endings
ing on this, Collins and Loftus [2] proposed the spreading elicited larger N400 responses compared to congruent
activation model of semantic memory in 1975. Accord- completions, providing neurophysiological evidence that
ing to this model, the mental lexicon is structured as a semantic priming modulates brain activity during
lannetwork of interconnected nodes representing concepts. guage comprehension. Recent work has started to
inWhen a prime word is processed, activation spreads vestigate priming phenomena in large language models,
to related nodes, reducing the activation threshold re- showing parallels with human language processing. For
quired to recognize semantically connected targets. This structural priming, Michaelov et al. [5] demonstrate that
framework accounts for the graded nature of semantic LLMs exhibit human-like inverse frequency efects and
priming, where more closely related concepts exhibit that prime-target dependencies influence prediction
preferences, revealing systematic parallels with production
tCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- preferences in humans. Similarly, semantic activation
* Corresponding author. patterns—akin to classical semantic priming in
psycholin† These authors contributed equally. guistics—have been explored both in humans and LLMs,
$ filippo.colombi@studenti.unitn.it (F. Colombi); strappa@fbk.eu highlighting ways in which contextual cues modulate
(C. Strapparava) internal representations. These findings motivate
situat0009-0000-1307-7857 (F. Colombi); 0000-0002-9365-0242 ing our methodology within this emerging line of work
(C. Strapparava)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License and clarifying how our approach compares and contrasts
Attribution 4.0 International (CC BY 4.0).
with prior operationalizations.
toregressive nature, where each word is predicted based
on the preceding context. This mechanism inherently
mirrors aspects of the human predictive processing in
language comprehension, making it a suitable ground for
examining whether priming emerges from the model’s
output.</p>
      <p>Motivations. This foundational framework informs
the present study, which investigates whether similar
semantic priming efects manifest in large language
models (LLMs) like GPT-4o. By comparing the probabilistic
output of the model in related and unrelated prime-target
conditions, this research explores whether LLMs exhibit Research Question and Hypotheses. The present
cognitive-like patterns of semantic association, bridging work proposes to investigate whether LLMs, such as
GPTcomputational modelling with traditional psycholinguis- 4o1, exhibit semantic priming efects similar to those
tic paradigms. The motivation behind this study stems observed in human cognition, exploring if semantic
asfrom a broader interest in cognitive modelling using AI. sociations emerging from their probabilistic outputs
reThese systems ofer a convenient starting point for mod- lfect transferable cognitive mechanisms. This research is
elling and exploring human language processing due to situated within a growing field that compares AI to
hutheir architecture and training on vast amounts of linguis- man cognition, exploring parallels and divergences. The
tic data. A critical question is whether the behaviours aim is to assess whether the model not only reflects
simthey exhibit are unique to their training processes or if ple statistical learning but also develops semantic
structhey mirror transferable cognitive mechanisms inherent tures resembling human semantic networks. In other
to human language processing. Understanding this could words, the goal is to determine whether the
autoregrescontribute to the debate of whether LLMs merely reflect sive behaviour of the model generates priming efects
statistical learning or if they approximate the cognitive comparable to those observed in traditional psychological
structure that governs human semantic memory. Neural paradigms. Therefore, the research question we propose
networks like GPT are trained on massive datasets, cap- is the following: Does GPT-4o mini model exhibit a
signifturing statistical regularities, co-occurrence patterns and icant diference in the probability values of target words
semantic relationships present in human language. While when presented in related priming conditions compared
these models are not biological in nature, the structured to unrelated conditions?
statistical patterns they learn often mimic human-like
associations. This raises intriguing questions: do these Expected Outcomes. It is hypothesized that targets
models, through exposure to language data, develop se- will exhibit higher probabilities values in the related
conmantic networks akin to those observed in the human dition compared to those presented in unrelated
conbrain? And if so, can they serve as valid proxies for study- ditions. This structure allows for the investigation of
ing cognitive processes like semantic priming? Beyond whether the emergent cognitive traits of LLMs can be
theoretical interests, there are significant practical ap- considered analogous to the dynamics of human
semanplications to this line of inquiry. These systems could tic memory and whether traditional psycholinguistic
be employed to predict and model human behaviours paradigms can be employed to evaluate the validity of
in various linguistic tasks, providing a new tool for psy- these models as devices for cognitive research.
cholinguistic research. Moreover, understanding how
closely they align with human cognitive processes could
inform the refinement of AI architectures, enabling the 2. Methodology
development of models that better capture human-like
semantic organization. GPT-4o is a state-of-the-art (SOTA) In autoregressive systems as GPT-4o, text generation
model in numerous linguistic domains, including natural is fundamentally modelled as a conditional probability
language understanding, text generation, translation and problem. The model predicts the next word in a sequence
dialogue systems. Its ability to produce highly coherent, based on the preceding context, represented
mathematihuman-like linguistic artifacts makes it an ideal candidate cally as
for investigating semantic priming efects. Beyond the  (|1, 2...−1 ) (1)
mere scarcity of experiments on priming, there remains where  () is the probability of generating a word
a broader and more fundamental question: To what ex- given the previous ones. This probabilistic framework
tent do LLMs, particularly closed-source models, exhibit underpins how the model processes language and
genersemantic processing mechanisms that align with human ates outputs, making it a suitable foundation for
invespsycholinguistic assessments? While extensive research tigating semantic priming efects. In the context of this
has been conducted on model performance and
generative capabilities, little is known about whether their 1The experiment was run with GPT-4o mini. However, we will often
response to such assessments parallel those reported in refer to it as GPT-4o or GPT throughout the text. This is just to
human. This is particularly relevant given GPT-4o’s au- make reading as smooth as possible.
experiment, the target word is presented after a prime
that is either semantically related or unrelated. To assess
whether GPT-4o exhibits priming efects, the following
contrast was applied
 (|_)</p>
      <p>In this experiment, GPT-4o mini was presented with
prime-target pairs, where the prime word was either
semantically related or unrelated to the masked target
word embedded within a sentence. For each trial, the
model received a prompt consisting of the prime followed
by a sentence with the target word omitted and was
instructed to generate a single word to fill the blank.
  (|_) (2)</p>
      <sec id="sec-1-1">
        <title>Stimuli Presentation. The stimuli were presented to</title>
        <p>If semantic priming is present, the model should as- GPT through 500 structured API calls designed to
simusign a higher probability to the target word in the related late an experimental paradigm of cognitive psychology.
condition, reflecting an internal representation of seman- Each stimulus consisted of a prime word (semantically
tic association similar to those of humans. GPT models related or unrelated to the target) and a sentence
conoutput not only the predicted tokens but also the log- taining a masked target word. The API was configured
probabilities (log-probs) associated with each token to prompt the model with both the prime and the
incomplete sentence as input text: [Prime Word]. [Sentence
with the target masked as ". . . "].
() = [ (|1, 2, ..., −1 )]</p>
        <p>(3)</p>
        <p>A log-prob closer to 0 indicates a higher predicted
probability, while more negative values indicate lower
confidence in the prediction. In this experiment, we use
log-probs to quantify the model’s confidence in
predicting the target word. Thus, semantic priming is
operationalized as
() &gt; ()</p>
        <p>(4)</p>
        <sec id="sec-1-1-1">
          <title>2.1. The Experiment</title>
          <p>Our operationalization of priming diverges from the
maybe more familiar formulation of computing prim- For example, in a related condition, the prime
“being as the diference in the log probability of a fixed low” may precede the sentence “The Ferrari finished six
target given congruent versus incongruent primes [6, 7] places . . . the Mercedes”, where the target is ”above”. In
because we aim to isolate semantic activation in contexts the unrelated condition, the same sentence would be
where the is not trivially predicable and to control for preceded by an unrelated prime such as “postage”. This
context-dependent insertion efects. In particular, the structure allowed for direct comparison of the model’s
ifll-in-the-gap setup we use allows us to: (i) position predictions across priming conditions. To ensure
conthe target in a controlled environment so that its activa- trolled responses, the model was provided with a system
tion can be assessed relative to a specific semantic cue instruction to return a single-word completion for the
(the prime), and (ii) avoid conflating efects due to target masked portion of the sentence. The temperature was set
salience or surface-form predictability that a straight- to zero to minimize randomness and enforce
determinisforward target-diference formulation might implicitly tic outputs, and finally log-probs were requested for the
include. We evaluated the design quantitatively and en- predicted token, together with the top 15 alternatives.
sured that it produces a signal consistent with priming as
a contextual modulation of likelihood, without relying on Retrieval of Log-Probabilities. Log-probs provide
the assumption that the target sentence is equally well- an exhaustive measure of the model’s confidence in
preformed or equally predictable across conditions. Concep- dicting a given token because they reflect the
probatual comparisons suggest that our pipeline captures the bility distribution over multiple possible continuations,
same directional priming influence while ofering control rather than just the most likely one. They allow for a
over the insertion context and over cases where native nuanced comparison of how strongly the model favours
target continuity would otherwise introduce ambiguity. certain predictions, making them particularly useful for
A schematic of the pipeline and an illustrative example assessing semantic priming efects. However, retrieving
are provided below. log-probs for the intended target posed a computational
challenge due to the tokenization structure of GPT
outputs, requiring a sophisticated reconstruction algorithm.</p>
          <p>When GPT generates a response, it predicts the single
most likely token (i.e., the actual completion), but it can
also return log-prob values for multiple alternative
predictions—if explicitly requested in the API call. These
values are stored in a structure that contains the
predicted token along with a ranked set of alternatives, each
associated with its probability. An additional
complication arose because GPT often predict sub-word units,
meaning that a target word might be split into multiple
tokens2. Such level of complexity necessitated a
reconstruction system capable of piecing together each “brick”
to retrieve the log-probability of the intended word. The
retrieval system operated by matching the original
target word against the set of alternative completions of
the model. If the target appeared in its entirety among
the predictions, its associated log-prob was directly
extracted. Conversely, when the model provided sub-word
tokens, a beam search strategy was employed to
reconstruct the word step-by-step. At each stage, candidate
sequences were expanded by adding predicted tokens,
ensuring that only those maintaining a valid
morphological match with the target were retained. Once a valid
reconstruction was found, the sum of the probabilities of
constituent tokens was computed, and the least negative
candidate (i.e., the most probable one) was selected as
the best match. Where no reconstruction matched the
original target, no log-prob was assigned (NaN), leaving
its interpretation for later stages of analysis.</p>
          <p>Data Construction. The stimuli set was built
following previous research [8] and was designed to ensure
that semantic associations were robustly controlled. A
total of 250 triplets (target, related prime, unrelated
prime) were selected from the Semantic Priming Project
(SPP), a widely used database containing highly validated
prime-target association from human behavioural
studies. The rationale behind using SPP was its empirical
grounding—these prime-target pairs have been
extensively tested in psycholinguistic experiments, making
them an ideal starting point for evaluating whether LLMs,
like GPT, exhibit cognitive processes akin to those
observed in human behavioural tests. Given that GPT is
trained on massive linguistic corpora, it has probably
internalized complex semantic structures, making it a
suitable model for priming-based investigations. To
construct the experimental dataset, the following procedure
was applied:</p>
          <p>1. Selection of prime-target pairs:
2All GPT models leverage a Byte Pair Encoding (BPE) tokenizer,
which allows for flexible and semantically complete processing of
linguistic data
2. Pairing process:
3. Contextual sentence construction:
• A randomly chosen prime-target pair was</p>
          <p>selected from SPP in the related condition.
• The corresponding prime-target pair was
selected to contrast with the related
condition.
• Only first-associate (most common) target
was considered, ensuring strong semantic
links for the related condition.
• Each related and unrelated prime was
paired with the same target word,
creating a contrastive pair.
• A sentence was invented to serve as a
con</p>
          <p>textual frame for the target word.
• The target word was removed from the
sentence and replaced with a placeholder
("...") creating a fill-in-the-blank format for
the model.
• The entire dataset was stored in a
structured tabular format, with each stimulus
set organized as follow.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>4. Tabular data representation:</title>
        <sec id="sec-1-2-1">
          <title>2.2. Statistical Testing</title>
          <p>To determine whether GPT-4o exhibits semantic priming
efects, a statistical approach was designed to compare
the log-probabilities of target words across related vs.
unrelated priming conditions. Since log-probs are
continuous numerical values, they provide a measure of the
model’s confidence in predicting a given word, making
them suitable for inferential statistical analysis. The key
objective of this analysis was to assess whether log-probs
were significantly higher (closer to 0) in the related
condition compared to the unrelated condition, mirroring
the facilitatory mechanism observed in human priming
studies. Given the paired nature of the data—where each
target word appears in both conditions with the same
sentence context—the statistical analysis was designed
to compare log-probs at the within-item level. Statistical
tests often require that data distribution meets certain
Multiple Imputation Approach. The first strategy
involved multiple imputation, a statistical technique that
estimates missing log-probs based on the distribution
of observed data. Imputation is considered a reasonable
approach to retain a larger dataset while minimizing bias.</p>
          <p>Here, an assumption of near-random data missingness
had been adopted, although similar hypotheses are often
dificult to verify.
assumptions. Specifically, normality was a key consider- Multiple Imputation Results. Before conducting
hyation: if the distribution of log-probs followed a normal pothesis testing, missing values in log-probs were
adpattern, a paired t-test would be appropriate; if not, a dressed using multiple imputation (MI). Out of 500 total
Wilcoxon signed-rank test, a popular non-parametric al- observations, 201 (40%) were missing, requiring
imputernative, would be used instead. Following this strategy, tation to allow for a complete dataset. Five imputed
an initial assessment of normality was planned, ensur- datasets were generated using a multivariate imputer
ing that the choice of statistical test was applied ad-hoc, that estimates each value from all the others. Pooled
rather than arbitrary. This decision was crucial because estimates were finally derived. To assess how
impulog-probs are inherently skewed measures, often concen- tation afected the distribution of log-probs, summary
trated around certain thresholds, and the dataset was statistics were calculated before and after imputation.
expected to contain NaN values where the model failed The only relevant variation is over standard deviation
to predict (or the retrieval algorithm failed to recompose) (std). To determine whether a parametric test or a
nonthe target word. To maintain statistical rigor, missing parametric alternative was appropriate, normality of the
values would be handled through imputation, but this imputed log-probs was assessed using the Shapiro-Wilk
step also had the potential to afect normality, requiring test. This evidenced a significant departure from
nora flexible approach. mality ( = 0.891,  &lt; 0.05) indicating that a
nonparametric test was required for hypothesis testing. A
Wilcoxon signed-rank test showed that there is no strong
evidence that GPT-4o mini assigned significantly higher
log-probs to targets in the related condition vs. the
unrelated condition ( = 441.0,  = 0.088). This contrasts
with expectations, as human studies typically show a
clear priming efect in reaction times and lexical decision
tasks.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Complete-Case Results. The complete-case analy</title>
        <p>Complete Case Analysis. Precisely because it is difi- sis was conducted using only full retrieved prime-target
cult to determine with certainty whether the data is miss- pairs, ensuring that all statistical comparisons were based
ing for largely random reasons, it is also useful to perform on directly observed data. Out of 500 total trials, 298
logthe test on the dataset without imputation. Therefore, prob values were successfully retrieved, but only 127
the second approach involved analysing the subset of contrastive pairs could be reconstructed for direct
comthe results where log-probabilities for each condition parison. This represents a substantial reduction in sample
were reconstructed. Both approaches were then tested size, which afects statistical power but ensures that no
following the statistical decision tree: if normality was assumptions were made about missing values.
Congrupreserved, a paired t-test would be applied; if not, the ently to what was done with imputed data, a normality
Wilcoxon signed-rank would be used instead. assessment was conducted to confirm a strong deviation
from normality ( = 0.789,  &lt; 0.05). Since
normality assumption was violated, a Wilcoxon signed-rank
3. Results test was conducted to compare the survived log-probs.
Unlike multiple imputation, the complete-case yielded a
significant result (  = 1793.0,  &lt; 0.05). This provides
evidence that GPT-4o mini exhibits a semantic priming
effect, with significantly higher log-probabilities for target
words in related conditions than in unrelated conditions.</p>
      </sec>
      <sec id="sec-1-4">
        <title>The aim of this results section is to determine whether</title>
        <p>GPT-4o mini exhibits semantic priming efects, measured
as diferences in log-probabilities of target words in
related vs. unrelated priming conditions. Given the
presence of missing—cases where the experiment failed to
generate the expected target word—two complementary
analytical approaches were adopted. Summarizing from 4. Discussion
the previous section: (a) Multiple Imputation, which
estimates missing values to maintain the statistical power, The findings of this study ofer an interesting
perspecand (b) Complete-Case Analysis, which restricts the tive on the challenges of using LLMs in cognitive
moddataset to instances where log-probs were successfully elling. While complete-case analysis detected a
signifiretrieved in both conditions, ensuring pairwise compar- cant priming efect, the multiple imputation approach did
isons. not, raising important methodological and conceptual
inquiries. The discussion is divided into two sections:
(a) methodological considerations, focusing on missing
data challenges, tokenization artifacts, statistical sensi- retrieval was the format of the model’s output, which
tivity, and potential imputation biases that may have returns a ranked list of predicted tokens along with their
influenced the results and (b) conceptual implications, log-probs. In cases where the model generated the target
addressing whether LLMs exhibit cognitive-like prim- as a single token extraction was straightforward.
Howing, how predictive mechanisms compare to biological ever, when the model split the target across multiple
tosemantic encoding and retrieval and what these findings kens, its overall log-prob had to be reconstructed from its
mean for cognitive modelling. individual components—a process that introduces
uncertainty. To tackle this challenge, a beam search algorithm
4.1. Methodological Considerations was implemented to iteratively reconstruct multi-token
targets from the list of predicted sub-word tokens. While
Handling Missing Data beam search improved reconstruction, it also introduced
potential artifacts: (a) some reconstructions may not have
perfectly matched the intended target, leading to
incorrect log-prob values, and (b) certain targets may have
been tokenized inconsistently. If tokenization patterns
difered systematically between conditions, this could
have biased log-prob retrieval, introducing a confound.</p>
        <p>In this experiment, a critical methodological challenge
was posed by missing data—40% of the log-prob
values—requiring the use of multiple imputation to
reconstruct a complete dataset. MI is generally preferred over
list-wise deletion, as it preserves statistical power by
estimating missing values based on the observed
distribution. However, when such a substantial portion of data
is missing, MI may not fully recover the real distribution, Statistical Sensitivity and Priming Detection. That
raising questions about representativeness. One conse- being said, divergent findings in MI and complete-case
quence is the arousal of variance compression in log- results likely arise from two interrelated factors: (a)
variprobs values, testified by a shrink in standard deviation. ance compression introduced by imputation, which may
This phenomenon likely occurs predicting missing val- have diluted the contrast between related and unrelated
ues based on observed ones, pulls extreme values toward conditions, and (b) tokenization and reconstruction
inthe mean. While this can stabilize estimates in smaller consistencies, which could have added noise to log-prob
datasets, it may have unintentionally smoothed meaning- retrieval, particularly in cases where targets were split
ful variability in the log-probs, afecting true distribution. into multiple tokens. The takeaway is that priming
sigIndeed, normality test showed a significant departure nal drawn from next-word probability retrieval in LLMs
from normality after imputation was performed. Since may be relatively weak, making it overtly susceptible to
semantic priming efects are often subtle, any reduction distortions introduced by data pre-processing.
in variance could have diminished the contrast between
related and unrelated conditions, thereby weakening the 4.2. LLMs and Cognitive Modelling
observable efects. This is consistent with the Wilcoxon
test result in the MI dataset, whereas the complete-case The methodological considerations discussed so far
analysis did detect a significant efect. The divergence demonstrated how data pre-processing choices and
tokbetween imputed and complete-case results raises an im- enization can influence statistical sensitivity in LLM
cogportant methodological question: did MI impoverish the nitive experiments. However, these findings also raise
priming efect, preventing statistical detection, rather than deeper conceptual questions: To what extent do LLMs
recover lost information? If the missing data was missing exhibit semantic priming efects comparable to those
obnot at random (MNAR)3 but instead systematic then MI served in human cognition? And if LLMs capture
statisticould have incorrectly smoothed meaningful distinctions, cal relationship between words, does this also means that
masking an efect that was present in the raw data. they can replicate the cognitive mechanisms underlying
human semantic memory? To answer such questions, it
Tokenization and Target Reconstruction Bias. A is possible to draw insights from the two dominant
theosignificant challenge in the experiment was retrieving retical frameworks that have shaped our understanding
log-probabilities for target words due to GPT’s sub-word on semantic processing: spreading activation theory, as
tokenization. Like other transformer models, it does not already presented in the introductory section and in the
always generate words as units, instead break less fre- predictive coding theory (Friston, 2005). These models
ofquent or morphologically complex words into multiple fer diferent perspectives on how the brain organizes and
sub-word tokens via BPE. This posed a serious obsta- retrieves meaning and comparing findings from present
cle to probability extraction. Further complicating word work allows to assess the extent to which LLMs
approximate cognitive mechanisms. The rest of this section
reflects on these themes.</p>
      </sec>
      <sec id="sec-1-5">
        <title>3Unfortunately, there is no surefire way to determine in which cat</title>
        <p>egory data will fall. Random missingness is an assumption that
need to be made based upon direct knowledge of the data and its
collection mechanisms.</p>
        <p>Spreading Activation, Semantic Memory and LLMs ing token selection within the fixed-parameters of the
The spreading activation theory (Collins &amp; Loftus, 1975) trained model. This means GPT does not actively
minisuggests that semantic memory is structured as a network mize uncertainty over time. The experimental findings
of interconnected concepts, where activation spreads support this distinction. In human coding models,
primfrom one node (a word/concept) to related nodes based on ing efects are expected to persist across diferent noise
semantic similarity and association strength. This model conditions because the brain continuously adjust its
prohas been widely supported by human psycholinguistic cessing. In contrast, the fragility of GPT’s mechanisms
studies. The priming efects detected in the complete-case suggests that the models lack a hierarchical learning
proanalysis seems to align with spreading activation frame- cess that adapts to uncertainty over time. This highlights
work. LLMs, much like human semantic memory, links a fundamental limitation of LLMs: while they
approxiconcept by encoding statistical co-occurrence patterns be- mate prediction-driven behaviours, they do not engage in
tween words—though they do it on a considerably larger error-driven learning during inference, a key component
scale. However, while human priming efects are driven of human cognition. As a result, while priming in LLMs
by neural activation spreading across conceptual net- may superficially resembles predictive coding, it does not
works, GPT does not store explicit semantic structures, it capture the adaptive mechanisms that govern biological
instead predicts word based on learned probability distri- semantic memory. The results of this study highlight an
butions. This distinction is crucial: in human cognition, ongoing debate in cognitive modelling: to what extent
spreading is dynamically modulated by context, prior ex- do LLMs exhibit cognitive-like processing? The presence
perience, and attentional control, whereas LLMs’ priming of a priming efect suggests that. LLMs capture
meanemerges from purely statistical dependencies in language ingful relationships between words, much like spreading
data. Current results suggest that semantic priming ef- activation models, but the disappearance of this efect in
fects in GPT do not necessarily indicate cognitive-like the imputed dataset suggests that LLMs’ priming is more
concept retrieval. The observed priming efect is likely fragile than human priming. Together, these findings
a by-product of training, rather than a direct parallel to give the impression that LLMs do not simulate human
human conceptual activation. Additionally, the lack of cognition in a mechanistic sense. Instead, they exhibit
a significant efect in MI dataset further challenges the statistical properties that resemble cognitive processes
idea that LLM-based priming mirrors human spreading at the output level but are not necessarily driven by the
activation dynamics. According to human experiments, same underlying computations.
priming efects persist despite noise or missing data
because activation propagates through associative memory
networks. In contrast, the weakening of priming in the
imputed dataset suggests a more fragile mechanism.</p>
        <p>Final Thoughts and Future Directions. We firmly
believe that while LLMs do not currently replicate
human semantic cognition, they ofer valuable tools for
modelling language-based associations. It is our
opinion that the presented approach may be improved and
extended:
Predictive Coding and the Mechanisms Underlying
Priming in LLMs. An alternative perspective for
understanding semantic processing is predictive coding
theory [9]. This model suggests that the brain functions as a
hierarchical predictive system, continuously generating
expectations about incoming sensory input and
minimizing prediction errors by adjusting internal models. In this
framework, priming occurs because a related prime
reduces the uncertainty (prediction error) associated with
recognizing the target, leading to faster processing. LLMs,
particularly autoregressive models like GPT, operate in a
manner structurally similar to predictive coding. They
generate words one at a time, updating predictions based
on past context. This aligns with the core principle of
predictive coding. The log-probabilities extracted in this
study measure the system’s internal prediction certainty,
making them conceptually analogous to prediction error
signals in the human brain. The critical diference is that
in biological brains, prediction errors lead to adaptive
training and belief updating, whereas in LLMs, prediction
errors do not modify the model in real-time—they rather
influence generation for a short time-window,
impact1. Target predictability: controlling for how
predictable a target word is in natural language using
frequency norms, surprisal values and
entropybased estimates. This would help disentangle
semantic priming from simple word predictability
in LLMs.
2. Word frequency efects: since high-frequency
words are easily predicted and low-frequency
words may be underrepresented in training data,
future experiments should systematically control
word frequency to determine its impact in
priming strength.
3. Contextual influence: LLMs process meaning
based on statistical co-occurrence within a fixed
context window, which may amplify or suppress
subtle priming efects. Future studies should
manipulate prime-target distance to assess if context
length and structural dependencies influence
results. Additionally, future research should explore
alternative token-matching strategies, ensuring
log-probs reconstruction does not systematically
fail with certain word structures. And finally,
it should be also considered if modifying LLM
architectures—for example, incorporating
mechanisms for hierarchical belief updating similar to
predictive coding models—would lead to more
cognitively plausible representations of meaning.</p>
      </sec>
      <sec id="sec-1-6">
        <title>Comparative studies relating neural language process</title>
        <p>
          ing signals (e.g., N400 efects) to outputs of LLMs have
been increasingly prominent. Heilbron et al. [
          <xref ref-type="bibr" rid="ref11 ref5 ref7">10, 11</xref>
          ]
demonstrated that predictability estimates produced
by deep neural language models (e.g., GPT-2)
correlate with EEG/MEG components—including N400 and
P600—during naturalistic comprehension, providing
direct evidence that model-derived surprisal signals track
human-like prediction dynamics. Subsequent work has
further refined the cognitive plausibility of
transformerbased models in this domain, showing that their
contextual predictions are closely aligned with neural
signatures of semantic facilitation and processing dificulty
[5]. While Futrell et al.[12] approach the question from a
complementary angle—treating neural language models
as psycholinguistic subject to probe their internal
syntactic representations—these strands jointly motivate our
efort to align LLM-based priming metrics with known
neural phenomena.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgments References</title>
      <sec id="sec-2-1">
        <title>Code Availability</title>
        <sec id="sec-2-1-1">
          <title>Code and data for reproducing the results are publicly available on GitHub at https://github.com/fico/ semantic-priming-in-LLMs</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), under the NRRP MUR program.</title>
          <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI), Grammarly, and DeepL
Write / DeepL Translate in order to: Drafting content, Text translation, Paraphrase and reword,
Improve writing style, and Peer review simulation. After using these tool(s)/service(s), the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s
content.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .findings-acl.
          <volume>877</volume>
          . [7]
          <string-name>
            <given-names>B.-D.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Schuler</surname>
          </string-name>
          , Leading whitespaces of lan-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>ceedings of the 2024 Conference on Empirical Meth-</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <year>2024</year>
          , pp.
          <fpage>3464</fpage>
          -
          <lpage>3472</lpage>
          . [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hutchison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Balota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Neely</surname>
          </string-name>
          , M. Cortese,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>ing project</article-title>
          ,
          <source>Behavior research methods 45</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>doi:10</source>
          .3758/s13428-012-0304-z. [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Friston</surname>
          </string-name>
          ,
          <article-title>A theory of cortical responses</article-title>
          , Philo-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          don. Series B,
          <source>Biological sciences 360</source>
          (
          <year>2005</year>
          )
          <fpage>815</fpage>
          -
          <lpage>836</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>doi:10</source>
          .1098/rstb.
          <year>2005</year>
          .
          <volume>1622</volume>
          . [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heilbron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ehinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hagoort</surname>
          </string-name>
          , F. de Lange,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>deep neural language models</article-title>
          , in: 2019 Confer-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>CCN</surname>
          </string-name>
          ,
          <source>Cognitive Computational Neuroscience</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          URL: http://dx.doi.org/10.32470/CCN.
          <year>2019</year>
          .
          <volume>1096</volume>
          -
          <fpage>0</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>doi:10</source>
          .32470/ccn.
          <year>2019</year>
          .
          <volume>1096</volume>
          -
          <fpage>0</fpage>
          . [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , R. Schvaneveldt, Facilitation in recog- [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heilbron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Armeni</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Schofelen</surname>
          </string-name>
          , P. Ha-
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>tal psychology 90</source>
          (
          <year>1971</year>
          )
          <fpage>227</fpage>
          -
          <lpage>234</lpage>
          . doi:
          <volume>10</volume>
          .1037/ Proceedings of the National Academy of Sciences
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          h0031564.
          <volume>119</volume>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1073/pnas.2201968119. [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Collins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Loftus</surname>
          </string-name>
          , A spreading activation theory [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Futrell</surname>
          </string-name>
          , E. Wilcox,
          <string-name>
            <given-names>T.</given-names>
            <surname>Morita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qian</surname>
          </string-name>
          , M. Balles-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>of semantic processing, Psychological Review 82 teros</article-title>
          , R. Levy,
          <article-title>Neural language models as psycholin-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          (
          <year>1975</year>
          )
          <fpage>407</fpage>
          -
          <lpage>428</lpage>
          . doi:
          <volume>10</volume>
          .1037//0033-
          <fpage>295X</fpage>
          .
          <year>82</year>
          .
          <article-title>6. guistic subjects: Representations of syntactic state,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          407. in: Proceedings of the 2019 Conference of the North [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Neely</surname>
          </string-name>
          ,
          <article-title>Semantic priming and retrieval from lexi- American Chapter of the Association for Compu-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>cal memory: Roles of inhibitionless spreading ac-</article-title>
          tational
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          , Minneapolis, Minnesota,
          <year>2019</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>tivation and limited-capacity attention</article-title>
          ,
          <source>Journal</source>
          of pp.
          <fpage>32</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>