<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cognitive Modeling of Semantic Fluency Using Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Animesh Nighojkar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Khlyzova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Licato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Advancing Machine and Human Reasoning (AMHR) Lab Department of Computer Science and Engineering, University of South Florida</institution>
          ,
          <addr-line>Tampa</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Can deep language models be explanatory models of human cognition? If so, what are their limits? In order to explore this question, we propose an approach called hyperparameter hypothesization that uses predictive hyperparameter tuning in order to find individuating descriptors of cognitive-behavioral profiles. We take the first step in this approach by predicting human performance in the semantic fluency task (SFT), a well-studied task in cognitive science that has never before been modeled using transformerbased language models (TLMs). In our task setup, we compare several approaches to predicting which word an individual performing SFT will utter next. We report preliminary evidence suggesting that, despite obvious implementational diferences in how people and TLMs learn and use language, TLMs can be used to identify individual diferences in human fluency task behaviors better than existing computational models, and may ofer insights into human memory retrieval strategies-cognitive process not typically considered to be the kinds of things TLMs can model. Finally, we discuss the implications of this work for cognitive modeling of knowledge representations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transformer-based language models (TLMs)</kwd>
        <kwd>Semantic Fluency Task (SFT)</kwd>
        <kwd>human cognition</kwd>
        <kwd>semantic networks</kwd>
        <kwd>Word2Vec</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Two of the most important ideas underpinning contemporary cognitive science–and the closely
related AI subfield of computational cognitive modeling–are the suppositions that the human
mind uses cognitive structures and that progress in understanding the mind can come from
modeling those structures and the algorithms which operate on them. The semantic fluency task
(SFT), sometimes called the verbal fluency task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], is commonly employed in service of those
goals. In SFT, participants name as many items belonging to a particular semantic category
(animals, fruits, etc.) as they can in a fixed amount of time (typically 40-180 seconds). Despite
this task’s simplicity, the lists generated by participants (which we call semantic fluency lists or
SFLs) ofer insights into the structure of human knowledge and the heuristics used for memory
retrieval. For example, words sharing semantic features tend to group in clusters, and there is
often a temporal delay before a participant switches from one cluster to another.
      </p>
      <p>
        Multiple approaches to computationally modeling behaviors in SFT have been proposed
[
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4, 5, 6</xref>
        ], most relying on graph-based representations in which words are represented as
nodes, and edges correspond to some meaningful semantic relationship between the nodes.
However, to date, no work has explored whether transformer-based language models (TLMs) can
be any better at modeling the generation of SFLs. And there are multiple reasons, at least from
an exploratory perspective, to suspect TLMs might do well in this regard, e.g.: (1) a large body
of literature demonstrates why semantic memory can not be suficiently represented purely
by fixed associative links between lexical nodes—at minimum, representations must allow for
dynamic role binding, hierarchical (or otherwise unidirectional) activations, and enough richness
to carry out structure-sensitive similarity assessments [7, 8]; (2) TLMs perform unexpectedly
well on human-oriented linguistic benchmarks [9], and they are typically pre-trained using a
lengthy process designed to embed deep semantic knowledge, resulting in a dense encoding
of semantic relationships [10]; (3) The pre-training process often proceeds by optimizing LMs
to perform well on the MLM (masked language modeling) task, which shares more than a
passing resemblance to the kind of word prediction that some researchers believe children are
performing [11]; and (4) TLMs tend to outperform other approaches in recent work modeling
human reading times, eye-tracking data, and others [12].
      </p>
      <p>Considered altogether, these reasons are suficient to motivate an initial exploration into
TLM-based semantic fluency modeling. Our novel contributions include:
• We are the first, to our knowledge, to generate and model SFLs using TLMs; we use
RoBERTa-Large, DistilBERT, and miniBERTa-med-small in this paper to further the
state-of-the-art on modeling SFLs. Generally, our models significantly outperform more
traditional, semantic network-based approaches.
• We design two adaptive approaches that predict the next SFL item as they learn from
the previous items and turn out to be superior to other non-adaptive approaches. These
adaptive approaches, we believe, can serve as baselines against which to compare future
computational cognitive models.
• In a broader sense, we argue and demonstrate that TLMs, despite being pre-trained using
techniques and datasets very diferent than those human beings use, can be powerful tools
for studying human cognition, knowledge representation, and memory retrieval. This is
a first step in a computational cognitive modeling strategy that we call hyperparameter
hypothesization (§1.1).</p>
      <p>Any performance on modeling SFLs discussed in this paper is for a pre-trained model with
no fine-tuning on the SFT. We do this because the objective of this work is not to learn how to
perform the SFT in the most precise way; it is to model human SFLs in an attempt to use the
best performing hyperparameters to learn something about human cognitive traits.</p>
      <sec id="sec-1-1">
        <title>1.1. Deep Learning as Cognitive Model</title>
        <p>Cognitive Science has long benefited from computational cognitive models (CMs), which are
computational implementations of cognitive processes, at various levels of abstraction, created
typically in order to test theoretical claims [13]. Furthermore, because carrying out empirical
studies with people can involve dificult logistics, the existence of myriad confounding variables,
and prohibitive costs, the use of well-designed CMs can save psychologists an immense amount
of time and resources, e.g. by making it easier to test hypotheses about cognitive processes with
CMs prior to empirical work. However, there are fundamental hardware and implementational
diferences between human brains and silicon-based electronics, raising the question: to what
extent can a CM support or refute a theory of human cognition? Although there is a longstanding
debate about to which degree the algorithms used by a CM commits it to certain claims about
the cognitive phenomena it purports to model (e.g., see [14, 15, 16]), most agree that the level
of abstraction the CM represents does entail some ontological commitment [17, 18],1 much
like any other scientific theory can be said to model and explain something about the natural
world. In other words, if we want a CM to be able to teach us something about the human
mind, its design choices cannot be made arbitrarily because the way the model works must
have some correspondence to the cognitive process it purports to model. How then can massive
transformer-based language models, that are trained on large datasets using algorithms and
data structures that appear fundamentally diferent from those used by people, tell us something
about human minds?</p>
        <p>Our answer to this important question is brief: We propose a technique we call hyperparameter
hypothesization, the form of which goes as follows: If, for certain values of hyperparameters
ℋ: (1) a CM matches large amounts of human data significantly better than other models; (2)
the human data matched ranges across a variety of tasks given values of ℋ; and (3) all ℎ ∈ ℋ
have functional roles in the CM that reasonably align with functional roles known to exist in
human cognition; then we can reasonably use it to form a hypothesis about human cognition.
For example, suppose we have a TLM with a hyperparameter , which restricts the amount of
information that the CM can consider simultaneously. We then find that certain values of 
allow the CM to match human data on a cognitive task much better than existing models (e.g.,
SFT). We may then find that a similar range of values for  allows the CM to match human
data on other cognitive tasks as well. This can allow us to reasonably hypothesize (but not
yet definitively conclude) that this range of values for  corresponds to a similar range in
people—we might predict that it corresponds to the amount of working memory that people
typically have. This hypothesis can then be tested: we can observe how our CM performs on
values of  lower than the optimal range and see whether its resulting behaviors align with
those of humans who are known to have lower working memory sizes.</p>
        <p>Although the above is only one example of how hyperparameter hypothesization may work,
its first step involves demonstrating that a certain type of CM can indeed match human data
significantly better than others. The remainder of this paper restricts its focus to that, specifically
on the semantic fluency task.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Prior attempts to model semantic fluency have largely been based on semantic networks, or
graph representations where words are nodes and relationships between those nodes are
edges. At least since Collins and Quillian [19], semantic networks have been a common tool in
1A model that is ontologically committed tells us something about the real-world object or process that it is
meant to model, rather than simply to match the data that the object or process outputs.
computational modeling [20, 6], typically using graph representations drawing from large-scale
databases such as WordNet [21], text corpora [22], and the USF free association norms [23].
The U-INVITE model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] reconstructs individuals’ semantic networks using a combination of
large-scale databases and semantic fluency data.
      </p>
      <p>
        Information obtained from analyzing semantic fluency lists (SFLs) can be used to construct
portions of semantic networks. But since the amount of data in SFLs is very small, it is more
common to instead obtain word association data from larger semantic datasets. For instance,
the USF Free Association Norms [23] is a free association dataset collected from more than 6,000
participants who were asked to write the first word  that came to their mind given a “cue
word” . This dataset ofers more than 72,000 word pairs ( , ) along with the percentage of
participants who wrote  given . Zemla and Austerweil [5] used the USF norms to construct
a semantic network and simulate a variety of memory search processes, including censored
random walk, whose simulations are compared to the results of the previously collected human
data in an SFT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Small World of Words (SWOW) [
        <xref ref-type="bibr" rid="ref5">24</xref>
        ] is a more recent word association
dataset ofering more than 1.3M ( , ) pairs. SNAFU [
        <xref ref-type="bibr" rid="ref6">25</xref>
        ] is a tool for estimating semantic
networks and analyzing fluency data (including random walk); the authors provide a sample
dataset of an SFT called “SNAFU Sample”.2 gathered from 82 participants that contains 796 lists
spanning across 6 categories.3 In this paper, we try to model the SFLs from this dataset.
      </p>
      <p>
        Hills et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] compare the memory search process to the strategies animals use when
searching for food (optimal foraging). This includes a dynamic process of switching from local
search of a cluster of semantically similar items, to a global search when the dificulty of finding
an item nearby reaches a certain point. This process is called “patch switching”. To replicate the
dynamic process of switching between patches, the authors implemented a dynamic model that
used the previous item recalled and frequency to perform the switching. The model produced
a log-likelihood fit, which was then compared to the static models that ignored the patchy
structure of the network. The dynamic model showed better results, suggesting that humans
perform memory search using patch switching too.
      </p>
      <p>
        Kajic et al. [
        <xref ref-type="bibr" rid="ref7">26</xref>
        ] proposed a biologically-constrained spiking neural network model to produce
human-like SFLs. Three diferent sources of associative data, including the USF norms, were
used to construct association matrices for a neural network. To compare the results with
the human data, the authors recorded word responses as decoded vector representations and
inter-item response times between the adjacent retrieved words. The locality shown in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is
supported by the results of these experiments: the preceding word is the most similar to the
current word in a patch.
      </p>
      <p>
        A related task is Entity Set Expansion (ESE) [
        <xref ref-type="bibr" rid="ref8">27</xref>
        ] that takes a set of entities as input (and not
a category word like SFT) and tries to add new entities to that set after predicting a category
all those entities belong to (this additional step is absent in SFT). The fundamental diference
between ESE and the work presented in this paper is that we are trying to model human SFLs
instead of just generating SFLs. Some work has also been done to explore the information
language models capture [
        <xref ref-type="bibr" rid="ref10 ref9">28, 29</xref>
        ], but we note that at present, the ability of TLMs to model
      </p>
      <sec id="sec-2-1">
        <title>2https://github.com/AusterweilLab/snafu-py/blob/master/fluency_data/snafu_sample .csv</title>
        <p>3The number of items in each category are fruits (60), vegetables (60), animals (296), supermarket items (81), tools
(149), foods (150). The median list lengths are fruits (18), vegetables (17.5), animals (34), supermarket items (35), tools
(16), foods (36.5)
semantic fluency has not been explored.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>To understand the extent to which TLMs can improve the modeling of SFLs, we set out to
establish baselines based on semantic networks, using word association data similar to the
approaches cited earlier, and comparing their performance to TLMs’. We use the SNAFU Sample
dataset, cleaning it to correct suspected data entry errors (like autocorrecting typos).</p>
      <sec id="sec-3-1">
        <title>3.1. Experimental Setup</title>
        <p>Assume participant  generates an SFL  = {1, ..., ||} in response to category cue 
(animals, fruits, etc.). Given a function  based on an approach (described below) which takes a
context Dn (a list [, 1, ..., − 1] such that  ≤ | |), applies some pre-processing to it, and
uses the underlying approach, can  predict ? We use two methods to describe and score  :
1. Coverage = | ∩  |/|| where  is the set of words considered by  while making
its predictions. We also define scaled log-likelihood within coverage as the log-likelihood
of each in-coverage item in  according to  . In other words, the scaled log-likelihood
reflects how likely the list  is to be generated by the function  . Since this is defined
only in coverage, it depends largely on coverage, and a better scaled log-likelihood does
not necessarily mean that a function is better.
2. Top- accuracy is the percentage of times  is present in  ’s top-k predictions. Top-k
accuracy is independent of coverage and thus, we can compare diferent functions based
just on their top-k scores.</p>
        <p>
          For both metrics, function 1 is said to model human performance better than 2 if it has a
higher score. We create multiple functions based on each of the following approaches difering
in hyperparameter values. We lemmatize the predictions and only keep nouns (due to the
category words given) for these approaches. For simplicity, we also assume that a word will
never occur twice in the same SFL.
3.1.1. Baseline Approaches (Non-TLM based)
We use five non-TLM based approaches as baselines:
1. Random Baseline: We use a dataset of 1/3M most frequent unigrams (single words) on
the internet4 to find the frequency with which unigrams and bigrams occur. The most
likely predictions are chosen from this weighted distribution of unigrams and bigrams
with the top-k predictions being the top-k most common words.
2. Random Walk on USF Norms: We approximate the censored random walk algorithm [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
on the USF Free Association Norms [23]. (|) is the number of times  was the cue
word and  was the response divided by the total number of times  was the cue word.
        </p>
        <p>Coverage is determined by words that were responses to  in the USF Norms.
4https://www.kaggle.com/datasets/rtatman/english-word-frequency
3.1.2. TLM-based approaches
We discussed several reasons behind the intuition to use TLMs to model human SFLs in §1. We
perform the MLM task (§1) on pre-trained TLMs using empirically generated prompts for a
category  and a context size :
1. The − 1− , ..., the − 1, and the [MASK] are examples of Cs.
2. Examples of Cs are the − 1− , ..., the − 1, and the [MASK].
3. The − 1− , ..., the − 1, and the [MASK] are the first Cs that come to my mind.
4. The first Cs that come to my mind are the − 1− , ..., the − 1, and the [MASK].
Most of these prompts have the word ‘the’ preceding all the SFL items because, without it, TLMs
tended to predict stopwords much more often in our preliminary experiments. Context sizes
ct = 0, 1, 3, 5 are tested, as with Word2Vec and GloVe. Each TLM-based function difers in ct
prompt pair, giving us 56 functions for each TLM.</p>
        <p>Each of these TLMs split the input prompts into tokens such that more than one token is
required to encode some words (for example, ‘blueberry’ is encoded by RoBERTa using two
tokens). We use a greedy strategy to allow our functions to predict such words. Since one
mask is insuficient to predict some words, we also use two consecutive masks for the TLM
to allow subwords for each of those masks. A prompt with  = 1 would look like “Examples
of fruits are the strawberry and the [MASK][MASK].” We take the top 100 predictions the TLM
made for the first mask and pass a new prompt with each of those predictions replacing the
ifrst mask to get 15 predictions for the second mask. Each TLM outputs a softmax distribution
over all its tokens corresponding to the mask token. After choosing the top 15 predictions,
we scale their probability to add up to 1. These probabilities are multiplied by the previous
prediction’s probabilities to get a valid probability distribution for the two-mask sequence. Since
our function does not know the word we are trying to predict, we generate 3000 one-mask, 1500
two-mask, 400 three-mask, and 100 four-mask predictions for each function (these values were
chosen to balance search space size and computation time based on preliminary tests; their
efect on performance was minimal because we report top-1 and top-5 scores and these values
are well over that range). Since the cumulative probabilities of these four sets of predictions add
up to 4, we scale them based on how frequently these words occur in the dataset of 1/3M most
frequent words on the internet (note that this is an estimate used to weigh the predictions).</p>
        <p>
          The TLMs we use in this paper are DistilBERT-base-uncased [
          <xref ref-type="bibr" rid="ref13">32</xref>
          ], RoBERTa-Large [
          <xref ref-type="bibr" rid="ref14">33</xref>
          ], and
RoBERTa-Med-Small-1M-2 (commonly known as the smallest miniBERTa) [
          <xref ref-type="bibr" rid="ref15">34</xref>
          ]. The models
difer in architecture, size, and perhaps most importantly, pre-training data amounts. The
smallest miniBERTa is pre-trained on just 1M words, DistilBERT-base-uncased is a smaller
version of BERT [
          <xref ref-type="bibr" rid="ref16">35</xref>
          ] pre-trained on about 3.4B words, and RoBERTa-Large is pre-trained on
approximately 34B words. Since pre-training is the only training these models get before using
them in our functions, we can hypothesize that lesser pre-training data (miniBERTa) might lead
to poorer performance.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experiment 1: Which approach is the best at modeling SFLs?</title>
        <sec id="sec-3-2-1">
          <title>How well do our TLM-based approaches</title>
          <p>bmaosdedelahpupmroaanchSeFsL?s,Acnodmdpoartehde tdoifenroenn-tThLyM- - Avg = |L|1|F| ∑∈︁L∑∈︁F(, ) (1)
perparameters make the approaches better?
Furthermore, if certain approaches and hy- BO = 1 max∑︁(, ) (2)
perparameter values efectively model hu- |L|  ∈L
man SFLs, do they tell us anything about
human cognitive traits? Our experimental setup BI = 1 ∑︁max (, ) (3)
is designed to be a comparative study: we |L| ∈L 
record the performance (coverage, scaled
loglikelihoods, and top- accuracies) of each of
our functions (from all approaches) for each SFL in SNAFU Sample (§2). Let L be the set of SFLs
in SNAFU Sample, and let (, ) denote the score (any metric) of function  on SFL . Let F
be the set of all functions for an approach. Since we tested a wide range of hyperparameter
value settings (functions) for each approach, we define and report the approach average (Avg),
best overall (BO), and best individual (BI) scores as defined by Equations 1, 2, and 3.
3.2.1. Results
The top part of Table 1 shows the performance comparison of all approaches. Random baseline
has a high coverage because the search space is 1/3M most common words on the internet.
RoBERTa-Large5 proves to be the best performing approach when generalized across all users,
closely followed by DistilBERT. miniBERTa is outstandingly poor, performing worse than all
baselines, proving our hypothesis about more pre-training data leading to better performance.</p>
          <p>On an aggregate level, TLM-based approaches outperform non-TLM-based approaches. Best
Overall (BO) reports the average scores of the function (approach - hyperparameters
combination) that performed the best across all lists. Best Individual (BI ) picks the best function
for each SFL and reports the average scores of this group of functions. BO and BI show the
performance of an approach if we were able to choose the best hyperparameter setting for</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>5To avoid confusion, TLM names are in regular font and approach names are in italics</title>
          <p>Approach
Random</p>
          <p>USF
SWOW
Word2Vec</p>
          <p>GloVe
miniBERTa
DistilBERT
RoBERTa
Best Overall (BO) coverage and top-5 accuracy (%) on SFL categories. The scores for the best performing
approach on each category are in bold.
those approaches. Avg, BO, and BI are the same for approaches which do not have diferent
hyperparameter values (multiple functions). Note that best individual (BI ) is just a theoretical
upper limit, as it assumes that we choose the function which is the best for a particular SFL
without any means of knowing which one that might be. Thus, practical comparisons must be
made with BO instead of BI.</p>
          <p>
            In order to get a deeper understanding of the strengths (and weaknesses) of TLM-based
approaches, we look at how they perform on each individual category (Table 2). We report
coverage and top-5 accuracy, which is a common metric for cases where the number of classes
is either large or not strictly defined (each SFL category can have many items belonging to
it) [
            <xref ref-type="bibr" rid="ref17 ref18">36, 37</xref>
            ] and more lenient than top-1 accuracy. We can see from Table 2 that SWOW has
a much better performance on ‘fruits’ and ‘tools’, possibly because the smaller and higher
quality search space SWOW has as a semantic network is suited for categories that have fewer
items belonging to them in general (‘fruits’ and ‘tools’ have median list lengths of 18 and 16
respectively). However, smaller search space is a double-edged sword and hurts performance
on uncommon and wider categories like ‘supermarket items’ and ‘foods’ (median list lengths 35
and 36.5 respectively). The most significant performance gain by TLM-based approaches is on
these uncommon categories. Possible reasons for this are larger vocabulary (coverage), better
context-awareness (all these TLMs have attention heads), and higher processing capabilities.
Note that we never fine-tuned these TLMs, and we are using just the pre-trained versions, so
any performance reflected here is due to learning during pre-training.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experiment 2: Can TLMs identify individual diferences?</title>
        <p>Table 1 clearly shows that large TLMs outperform the other language model types we considered
(except miniBERTa, which is significantly smaller than the other TLMs listed). However, such
aggregate evaluations can sometimes hide interesting information. For example, is it the case
that certain functions better model individual SFLs? If so, this might suggest further ways of
generating exploratory hypotheses, in line with the idea of hyperparameter hypothesization
(§1.1)—e.g., if one prompt type seems to model lists generated by one person best, whereas a
diferent prompt type best models lists generated by another, then this may suggest that there is
a qualitative diference in how those two individuals “query” their memories when performing
this task which is captured by the diferent prompt styles.</p>
        <p>Experiment 2, therefore, aims to take first steps in exploring the plausibility of these ideas.
We start by examining the extent to which specific functions (each of which, recall, consist of a
model + hyperparameter values) can adapt to individuals. Our task is conceptualized as follows:
Given an individual who is performing the SFT, and a machine that is observing the list being
generated one item at a time, how quickly can the machine adapt to that individual and predict
what items the individual will list next?</p>
        <p>We refer to the functions compared in Experiment 1 as static, as they each have fixed
hyperparameter values. They are compared to two adaptive functions:
• Adapt-Then-Change (ATC): Chooses the static function performing the best on the first
 items of the SFL, and applies that chosen function to the rest of the SFL. Performance is
averaged across all the items in the latter part of the list.
• Continuous Adaptation (CA): Uses a sliding window of size  and chooses the best
performing function for all the items in this window. It then applies that chosen function
to the next item. After this, the window slides one item to the right. Whereas ATC only
makes one decision about which function to use per list, CA makes that decision for each
list item (starting from the ℎ item).</p>
        <p>We do not calculate coverage and scaled log-likelihoods in coverage for adaptive approaches
because they are not new approaches, just strategies to choose one hyperparameter setting for
one of the existing approaches.
3.3.1. Results
From Table 1, adaptive approaches clearly outperform the static approaches, indicating that a
smart way to choose hyperparameters can drastically improve prediction accuracy. Essentially,
this means that based on just 1, ..., − 1 and without knowing , we are able to find a
set of hyperparameters that can be used to predict the next word with about 27% accuracy.
Considering that this is across all categories and across all individuals, and that the search space
for this task is not strictly bounded, these results exceed expectations. The adaptive approaches
do not have hyperparameters of their own so BO and BI in Table 1 are calculated based on
diferent values of .</p>
        <p>The comparison of top-5
accuracies between the adaptive
approaches and the best static
approach (RoBERTa-Large) is shown in
Figure 1. ATC outperforms BO static
for certain values of , and CA
outperforms BO static almost always.</p>
        <p>This is expected because CA adapts
to the SFL items continuously while
ATC adapts just once. As discussed
in §3.2.1, BI is more of a
theoretical maximum, due to the fact that
it is selected from all functions for
each SFL after we already know it
does best on that SFL. Nonetheless, Figure 1: Top-5 accuracy comparison of adaptive and
CA still impressively outperforms static approaches.</p>
        <p>BI static for certain values of .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion, Limitations, and Future Work</title>
      <p>
        To our knowledge, we are the first to use transformer-based LMs to model semantic fluency
lists. We do so in the hopes of advancing the computational modeling of human semantic
knowledge and memory retrieval processes. To that end, we used RoBERTa-Large,
DistilBERTbase, and miniBERTa-med-small to model semantic fluency lists and found that RoBERTa-Large
and DistilBERT-base consistently outperformed other non-transformer based approaches. We
hypothesize that miniBERTa-med-small’s poor performance is due to its smaller pre-training
corpus. DistilBERT-base’s performance suggests that the associative semantic information
needed to model human SFLs starts getting imparted as the size of the training corpus increases
from 1M words to about 3.4B. Future work can attempt to estimate the size of training corpus
in this range where TLMs start outperforming baseline approaches and whether larger training
corpora would hurt performance. Hopefully, using upcoming TLMs with a wider variety of
prompts inspired by this paper, perhaps using a technique closer to prompt fine-tuning [
        <xref ref-type="bibr" rid="ref19">38</xref>
        ]
will give better results.
      </p>
      <p>Furthermore, we took a first step in exploring the ability of TLMs to determine individual
diferences in retrieval behaviors in the SFT. Our results suggest that an adaptive approach
works best, in some cases even outperforming an oracular baseline. However, we do not yet
know if the function and hyperparameter choices that our adaptive approaches make reflect
stable individual cognitive traits or behaviors. Answering this question will be the focus of
future work, for which the present work has laid an important foundation.</p>
      <p>It is possible that fine-tuning on some subset of the SNAFU Sample dataset may yield better
likelihoods or top-k scores. Likewise, it may be trivial to fine-tune RoBERTa to output all
instances of a given category, or to train a LM to simply enumerate all known hyponyms of a
category word. But the goal of this work, as described in §1.1, is primarily to match and then
produce hypotheses for cognitive processes. As such, it was not our goal to simply create an
exhaustive item list generator; rather, we want to emulate how people generate SFLs.</p>
      <p>In using transformer-based LMs to model human response patterns, it should be noted that
we are not taking into account how many other psychological and cognitive constructs factor
in to the complex retrieval processes involved in SFT. Although the method we describe here is
designed to compare individual linguistic retrieval strategies, it is unclear what exactly it tells us
about how performance on semantic fluency tasks relates to individuals’ executive functioning
and self-regulation skills, which fluency tasks are often employed to study [ 39, 40]. Rather, the
work here is the first step in our proposed hyperparameter hypothesization strategy (§1.1), which
we propose here for the first time and believe contributes to the present symposium’s goals.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgements</title>
      <p>Part of this research was sponsored by the DEVCOM Analysis Center and was accomplished
under Cooperative Agreement Number W911NF-22-2-0001. The views and conclusions
contained in this document are those of the authors and should not be interpreted as representing
the oficial policies, either expressed or implied, of the Army Research Ofice or the U.S.
Government. The U.S. Government is authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright notation herein.
[5] J. C. Zemla, J. L. Austerweil, Modeling semantic fluency data as search on a semantic
network, in: CogSci... Annual Conference of the Cognitive Science Society. Cognitive
Science Society (US). Conference, volume 2017, 2017, pp. 3646–3651.
[6] J. Avery, M. N. Jones, Comparing models of semantic fluency: Do humans forage optimally,
or walk randomly?, in: CogSci, 2018.
[7] K. J. Holyoak, J. E. Hummel, The Proper Treatment of Symbols in a Connectionist
Architecture, in: E. Deitrich, A. Markman (Eds.), Cognitive Dynamics: Conceptual Change in
Humans and Machines, MIT Press, Cambridge, MA, 2000.
[8] R. Sun, Duality of the Mind: A Bottom Up Approach Toward Cognition, Lawrence Erlbaum</p>
      <p>Associates, Mahwah, NJ, 2002.
[9] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman,
Superglue: A stickier benchmark for general-purpose language understanding systems,
in: Proceedings of NeurIPS, 2019.
[10] L. Cui, S. Cheng, Y. Wu, Y. Zhang, Does bert solve commonsense task via commonsense
knowledge?, 2020. arXiv:2008.03945.
[11] C. Gambi, P. Jindal, S. Sharpe, M. J. Pickering, H. Rabagliati, The relation between
preschoolers’ vocabulary development and their ability to predict and recognize words,
Child Development n/a (2020). URL: https://srcd.onlinelibrary.wiley.com/doi/abs/10.1111/
cdev.13465. doi:10.1111/cdev.13465.
[12] M. Schrimpf, I. Blank, G. Tuckute, C. Kauf, E. A. Hosseini, N. Kanwisher, J. Tenenbaum,
E. Fedorenko, The neural architecture of language: Integrative reverse-engineering
converges on a model for predictive processing, bioRxiv (2020). URL: https://www.biorxiv.org/
content/early/2020/10/09/2020.06.26.174482. doi:10.1101/2020.06.26.174482.
[13] R. Sun, Introduction to computational cognitive modeling, in: The Cambridge Handbook
of Computational Psychology, Cambridge University Press, 2008.
[14] M. Jones, B. C. Love, Bayesian Fundamentalism or Enlightenment? On the Explanatory
Status and Theoretical Contributions of Bayesian Models of Cognition, Behavioral and
Brain Sciences 34 (2011) 169–231.
[15] G. Marcus, E. Davis, How Robust Are Probabilistic Models of Higher-Level Cognition?,</p>
      <p>Psychological Science 24 (2012) 2351–2360.
[16] G. Marcus, E. Davis, Still Searching for Principles: A Response to Goodman et al. (2015),</p>
      <p>Psychological Science 26 (2015) 542–544.
[17] P. Bricker, Ontological Commitment, in: E. N. Zalta (Ed.), The Stanford Encyclopedia of</p>
      <p>Philosophy, winter 2016 ed., Metaphysics Research Lab, Stanford University, 2016.
[18] L. Floridi, The Philosophy of Information, Oxford University Press, 2011.
[19] A. M. Collins, M. R. Quillian, Retrieval time from semantic memory, Journal of verbal
learning and verbal behavior 8 (1969) 240–247.
[20] T. T. Hills, P. M. Todd, M. N. Jones, Foraging in semantic fields: How we search through
memory, Topics in cognitive science 7 (2015) 513–534.
[21] G. A. Miller, WordNet: An electronic lexical database, MIT press, 1998.
[22] B. MacWhinney, The CHILDES project: The database, volume 2, Psychology Press, 2000.
[23] D. L. Nelson, C. L. McEvoy, T. Schreiber, The university of south florida free
association, rhyme, and word fragment norms, Behavior Research Methods, Instruments, and
Computers 36 (2004) 402–407.
Base Construction, 2021. URL: https://openreview.net/forum?id=o7sMlpr9yBW.
[39] D. M. Whiteside, T. Kealey, M. Semla, H. Luu, L. Rice, M. R. Basso, B. Roper,
Verbal fluency: Language or executive function measure?, Applied Neuropsychology:
Adult 23 (2016) 29–34. URL: https://doi.org/10.1080/23279095.2015.1004574. doi:10.1080/
23279095.2015.1004574, pMID: 26111011.
[40] S. L. Aita, J. D. Beach, S. E. Taylor, N. C. Borgogna, M. N. Harrell, B. D. Hill,
Executive, language, or both? an examination of the construct validity of verbal fluency
measures, Applied Neuropsychology: Adult 26 (2019) 441–451. URL: https://doi.org/10.1080/
23279095.2018.1439830. doi:10.1080/23279095.2018.1439830, pMID: 29513079.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Welsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. F.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Groisser</surname>
          </string-name>
          ,
          <article-title>A normative-developmental study of executive function: A window on prefrontal function in children</article-title>
          ,
          <source>Developmental neuropsychology 7</source>
          (
          <year>1991</year>
          )
          <fpage>131</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Hills</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Todd</surname>
          </string-name>
          ,
          <article-title>Optimal foraging in semantic memory</article-title>
          .,
          <source>Psychological review 119</source>
          (
          <year>2012</year>
          )
          <fpage>431</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Abbott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Austerweil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <article-title>Random walks on semantic networks can resemble optimal foraging</article-title>
          .,
          <source>in: Neural Information Processing Systems</source>
          Conference;
          <article-title>A preliminary version of this work was presented at the aforementined conference</article-title>
          ., volume
          <volume>122</volume>
          , American Psychological Association,
          <year>2015</year>
          , p.
          <fpage>558</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Zemla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Kenett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-S.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Austerweil</surname>
          </string-name>
          , U-invite:
          <article-title>Estimating individual semantic networks from fluency data</article-title>
          .,
          <source>in: Proceedings of the 38th Annual Meeting of the Cognitive Science Society, Cognitive Science Society</source>
          , Austin, TX,
          <year>2016</year>
          , pp.
          <fpage>1907</fpage>
          -
          <lpage>1912</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>De Deyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Navarro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perfors</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brysbaert</surname>
          </string-name>
          , G. Storms,
          <article-title>The small world of words: English word association norms for over 12,000 cue words</article-title>
          ,
          <year>2019</year>
          . URL: psyarxiv.
          <source>com/mb93p. doi:10.3758/s13428-018-1115-7.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Zemla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Austerweil</surname>
          </string-name>
          ,
          <article-title>Snafu: The semantic network and lfuency utility</article-title>
          ,
          <source>Behavior research methods 52</source>
          (
          <year>2020</year>
          )
          <fpage>1681</fpage>
          -
          <lpage>1699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kajic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gosmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Komer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Orr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Eliasmith</surname>
          </string-name>
          ,
          <article-title>A biologically constrained model of semantic memory search</article-title>
          ., in: CogSci,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Entity set expansion via knowledge graphs</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>1101</fpage>
          -
          <lpage>1104</lpage>
          . URL: https://doi.org/10.1145/3077136.3080732. doi:
          <volume>10</volume>
          .1145/3077136.3080732.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ettinger</surname>
          </string-name>
          ,
          <article-title>What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>34</fpage>
          -
          <lpage>48</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .tacl-
          <volume>1</volume>
          .3. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00298</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A. Laverghetta</given-names>
            <surname>Jr.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nighojkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mirzakhalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Licato</surname>
          </string-name>
          ,
          <article-title>Can Transformer Language Models Predict Psychometric Properties?</article-title>
          ,
          <source>in: Proceedings of the 10th Joint Conference on Lexical and Computational Semantics (*SEM</source>
          <year>2021</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , W.-t. Yih, G. Zweig,
          <article-title>Linguistic regularities in continuous space word representations, in: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Association for Computational Linguistics,
          <year>2013</year>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          . URL: http://aclweb.org/anthology/N13-1090.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>R.</given-names>
            <surname>Parker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Graf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Maeda</surname>
          </string-name>
          ,
          <article-title>English gigaword fifth edition</article-title>
          ,
          <source>Linguistic Data Consortium</source>
          (
          <year>2011</year>
          ). doi:
          <volume>10</volume>
          .35111/wk4f-qt80.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper</article-title>
          and lighter,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .01108. doi:
          <volume>10</volume>
          .48550/ ARXIV.
          <year>1910</year>
          .
          <volume>01108</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-S. Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Learning which features matter: Roberta acquires a preference for linguistic generalizations (eventually</article-title>
          ),
          <year>2020</year>
          . arXiv:
          <year>2010</year>
          .05358.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Plosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS</source>
          <year>2017</year>
          ), Long Beach, CA, USA,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [36]
          <string-name>
            <surname>J.-H. Luo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Thinet: A filter level pruning method for deep neural network compression</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ravuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Classification accuracy score for conditional generative models</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1905</year>
          .10887.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fichtel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          , W.-T. Balke,
          <article-title>Prompt tuning or fine-tuning - investigating relational knowledge in pre-trained language models</article-title>
          ,
          <source>in: 3rd Conference on Automated Knowledge</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>