Alexander Porshnev1,*, Kevin Dirk Kiy 1, Diarmuid O'Donoghue1, Manokamna Singh1, Cai
                                Wingfield2 and Dermot Lynott1,*
                                1
                                    Maynooth University, Maynooth, Co Kildare, Ireland
                                2
                                    The University of Birmingham, Edgbaston, Birmingham, B15 2TT, United Kingdom

                                                  Abstract
                                                  Prejudicial attitudes, such as those relating to age, race, or gender, exert a powerful influence on individuals,
                                                  and are pervasive throughout society. Recent research suggests that the statistical patterns of how words
                                                  are used in language may capture such biases, with language models providing approximations for people’s
                                                  linguistic experience. However, many questions on the links between language models and people’s biased
                                                  attitudes remain unanswered. In the current study we focus on gender–career bias (where men are routinely
                                                  favoured over women in the workplace) to examine the extent to which language models can be used to
                                                  model behavioural responses in the Gender–Career Implicit Association Test (IAT). We provide a
                                                  systematic evaluation of a range of language models, including n-gram, count vector, predict, and Large
                                                  Language Models (LLMs), to determine how well they capture people’s behaviour in the IAT. We examined
                                                  data from over 800,000 participants, tested against over 600 language model variants. While we find that
                                                  LLMs perform well in modelling IAT responses, they are not significantly better than simpler count vector
                                                  and predict models, with these other models actually providing better fits to the behavioural data using
                                                  Bayesian estimates. Our findings suggest that societal biases may be encoded in language, but that resource-
                                                  greedy large language models are not necessary for their detection.

                                                  Keywords
                                                  Linguistic distributional models, Computational modelling, Large language models, Implicit association
                                                  test, Bias


                                1. Introduction
                                    Janet is less likely to be called for an interview than John. Jamal is less likely to get a job than
                                James, and obese applicant Julie is less likely to be shortlisted than her perceived healthy colleague
                                Joan. Implicit biases (ones that we are not consciously aware of) are pervasive in our society [1] and
                                can lead directly to prejudicial decision-making [e.g., 2–4]. Biases linked to gender, race, perceived
                                health status, and many other characteristics are seen in employment, education, criminal justice,
                                politics, and healthcare [5]. For example, in employment contexts, prospective female employees are
                                rated as less competent and hireable than (identical) male applicants, and are less likely to be offered
                                a job [3]. Similarly, while women engineers publish in journals with higher Impact Factors than their
                                male peers, they receive fewer citations from the scientific community [6]. In these ways gender-
                                bias in career progression is readily visible, in both empirical studies and in the workplace.
                                    Such systemic patterns make the issue of bias one of global importance, and one with significant
                                economic and societal costs. For example, employees who perceive bias are more than three times as
                                likely to quit their jobs, with an estimated cost of up to $550 billion in the US alone [7]. Yet despite
                                the acknowledged prevalence of such biases, we still do not fully understand where these biases
                                come from or how they are transmitted [8].
                                    However, we know that language is a primary form of cultural transmission [9,10], making it a
                                plausible candidate for the communication and entrenchment of society’s implicit biases.
                                Furthermore, recent work in computational modelling suggests that implicit biases may be captured
                                by the latent statistical patterns in language [11–14]. In other words, the way words appear together
                                in language may influence our unconscious, and potentially prejudicial, attitudes towards others.

                                AICS’24: 32nd Irish Conference on Artificial Intelligence and Cognitive Science, December 09—10, 2024, Dublin, Ireland
                                *
                                  Corresponding author.
                                   alexander.porshnev@mu.ie (A. Porshnev); dermot.lynott@mu.ie (D.Lynott);
                                    0000-0002-0075-1061 (A. Porshnev); 0009-0009-1771-533X (K.D.Kiy); 0000-0002-3680-4217 (D.O'Donoghue), 0000-0002-
                                0187-3597 (M.Singh), 0000-0002-0254-199X (C.Wingfield), 0000-0001-7338-0567 (D.Lynott)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
However, we still do not have a good handle on whether these patterns can be adequately captured
by existing language models.
    Recent work has demonstrated that statistical distributional properties of words reflect human
biases and prejudicial judgements [11–13]. The way positive and negative terms are distributed in
language closely reflects the positive or negative biases people exhibit towards various concepts, as
measured by implicit association tests (IAT), for example. The IAT is a computerised task, with
strong internal consistency and test–retest reliability [8], and is the most commonly used measure
of people's automatic associations between concepts, used in thousands of studies [e.g., 5]. In an IAT,
participants classify stimuli into categories as quickly as possible, where faster responses indicate
stronger associations between concepts [15]. Displaying a greater degree of bias results in a higher
D score for a participant. For example, in a Gender–Career IAT which contrasts male and female
names, people consistently respond more quickly when male names are paired with career-related
concepts (e.g., “John” and “Management”), compared to male names paired with non-career-related
concepts (e.g., “John” and “family”), and vice versa for pairings with female names and concepts. This
pattern indicates stronger negative associations for women and career concepts. Thus, participants
tend to respond more quickly to congruent stimuli pairings (i.e., male names and career concepts, or
female names and family concepts) compared to incongruent stimuli pairings (i.e., male names and
family concepts, or female names and career concepts). This is not to say that every individual
participant follows this pattern, but over a large sample of participants this is the pattern that
emerges [e.g., 8].
    Investigating how language models capture human biases measured by Implicit Association Tests
(IATs) Lynott et al. [12] used n-gram co-occurrence counts from a corpus of over 1 trillion words to
show that stereotypical Black names (e.g., Jamal) co-occur with more negative attributes than White
names (e.g., Brad), correlating strongly with implicit biases (r = 0.79) found in IATs. In the work of
Caliskan et al. [11] word embeddings like GloVe were used to capture biases such as preferences for
flowers over insects and gender imbalances in professions. In a recent paper, Bhatia and Walasek
[16] go further, using large samples of human participants, which provides the much-needed
statistical power for these kinds of analyses, and directly links human behaviour in IAT tasks with
cosine similarity between stimuli words in language models.
    While these studies highlight the potential of language models in bias detection, limitations
include generally small sample sizes and reliance on a small number of language models. Zhang et
al. [17] have raised concerns about the robustness of word embedding-based bias detection, noting
that results may depend heavily on model parameters. Thus, it remains unclear whether language
models generally can reliably capture biases, or if only a select few perform well under specific
conditions.
    To address this issue, in the current study we conduct an analysis of a range of language models,
parameters and distance measures to determine a) what types of language models best capture
behavioural responses to gender–career concepts as measured by the implicit association test, and
b) what distance measures show the best relationship between language models and human
behaviour.
    While much work in AI, often assumes, that larger and more computationally intensive models
will perform better in most tasks, findings in the cognitive modelling literature suggest that this is
not always the case. For example, as tasks become more complex, simpler models often do as well
as, if not better than, more complex models [18,19]. Thus, a priori we might expect LLMs to perform
very well, but special efforts to remove biases by the developers can make LLMs less suitable for this
task, with greater computational complexity not buying them as much of an advantage as one might
expect. As well as modelling performance, there are of course additional reasons for exploring
simpler approaches than LLMs, including the high energy and financial costs, the environmental
impact of LLM training and use [20], possible exposure of sensitive information [21], connectivity
issues, potential reliance on un-governed corporations, and the questionable traceability,
explainability and reproducibility of results. Thus, we hope that the current study will provide some
insights into how well LLMs and other language models can capture human behaviour linked to
implicit bias.
2. Method
2.1. Linguistic Distributional Models
    We examined four families of Linguistic Distributional Models (LDMs) that ranged considerably
in their complexity: n-gram, count vector, predict, and large language models. The first three families
have been used extensively in previous research, particularly in cognitive and psycholinguistic work
[e.g., 18,22–24], which has found that larger models do not necessarily lead to better performance in
modelling cognitive tasks [e.g., 18]. We also included three well-known large language models from
Meta and MistralAI [25,26], which were primarily developed with a focus on text generation. A more
detailed description for many of these models, including calculations for each measure, can be found
in Wingfield and Connell [18], Pennington et al. [24], Touvron et al.[25] and Jiang et al. [26].

Table 1
Summary of all 663 models, including variants by corpus, window radius, distance measures, and
embedding size. Custom models are those where various parameters have been manipulated by the
researchers, while “open” models are those that, while transparent, have fixed sets of parameters,
including training corpus, embedding size, and so on.
   Model                Model                  Window             Embedding             Number
   family                                       radius                size              of LDMs
                                                     Custom models
count   a,b
                Conditional probability                    1,3,5,10                                               36
counta,b        Log cooccurrence frequency                 1,3,5,10                                               36
counta,b        PMI                                        1,3,5,10                                               36
counta,b        PPMI                                       1,3,5,10                                               36
counta,b        Probability ratio                          1,3,5,10                                               36
n-grama         Conditional probability                    1,3,5,10                                               12
n-grama         Log n-gram frequency                       1,3,5,10                                               12
n-grama         PMI n-gram                                 1,3,5,10                                               12
n-grama         PPMI n-gram                                1,3,5,10                                               12
n-grama         Probability ratio n-gram                   1,3,5,10                                               12
predicta,b      Skip-gram                                  1,3,5,10             50, 100, 200, 300, 500            180
predicta,b      CBOW                                       1,3,5,10             50, 100, 200, 300, 500            180
                                                      Open models
countb          GloVE                                        Global 300                                             3
llmb            llama-2-7b-chat.Q4_K_M                        4096                     4096                        18
llmb            cor_llama-2-13b-chat.Q5_K_S                   4096                     5120                        18
llmb            mistral-7b-v0.1.Q4_K_M                        4096                     4096                        18
   a
       Each of “custom” models have three modifications related to corpora on which it was trained (BNC, Subtitles, UKWAC)
   b
       Three distances were calculated Euclidean, Cosine, Correlation for each vector model

   In this study we examine two classes of models: “custom” and “open”. The “custom” trained
models – n-gram, count vector (excluding GloVe), and predict – and “open" models –LLMs and
GloVe. For the custom models, we had full control over the training corpora (using three different
corpora: UK Web As Corpus – UKWAC [27], British National Corpus – BNC [28], and the BBC
Subtitle Corpus; details provided below), context window sizes (1, 3, 5, 10), and embedding sizes (for
prediction models: 50, 100, 200, 300, 500). For “open” models we have no control over training
procedures.
   To measure word distances we used Conditional probability, Log n-gram frequency, PMI n-gram,
PPMI n-gram, Probability ratio n-gram for n-gram models and vector metrics (Euclidean distance,
Cosine distance, Correlation distance) for count, predict and LLM models (for a summary of “custom”
and “open” models used in analysis see Table 1).
2.2. Behavioral data and evaluating model performance
    We collated human behavioral data from the Project Implicit Gender & Career IAT study,
obtained from an Open Science Framework repository [29]. This data includes raw data per trial,
including response times (RTs) for stimuli categorization in blocks with "congruent" and
"incongruent" stimuli.
    We included only participants who chose UK or USA as their current residence and country of
their origin, reflecting the English-language corpora used to generate the LDMs. We selected
participants who were 18 or older and participated in the common version of the IAT (without
additional stimuli), those who did not have prior experience with IAT tasks [30,31], for whom raw
trial data is available, and where the calculated bias effect size D was found to be equal to the effect
size in the preprocessed dataset (to avoid discrepancies between raw and preprocessed data).
Summary of data preprocessing presented in Supplementary materials, Table A.
    After data cleaning, the final dataset comprised 802,070 participants from the USA and UK,
spanning the period from 2005 to 2021. Descriptive statistics presented in Supplementary materials,
Table B-E. Next, we randomly split the whole sample into two subsamples (Sample A: 401,025,
Sample B: 401,045) balanced by country and year. For each sample data we calculated mean response
time (m_RT) for each pair of stimuli in each condition (congruent vs incongruent), separated by
country (USA, UK).
2.2.1. Preregistration and analysis
    We preregistered our approach to data handling and our planned analyses
(https://aspredicted.org/ZWZ_9RV).
    Our primary analysis involves conducting multiple regression analyses to model the mean
response time (RT) of participants in the IAT. First, using Subsample A, we established a baseline
regression model containing the factors of country, log word frequency, number of letters and number
of syllables as predictors, to account for important predictors of reading and processing time [32],
but which are not of theoretical importance in this case. Subsequently, we add a single LDM (e.g.,
word distances for n-gram models and vector distances for other families, including LLMs) to
determine if this model improves fit over the baseline model, using p < .05 as a threshold for
significance. For each LDM, we calculate the Pearson correlation coefficient, r, and the change in
Bayesian Information Criterion (BIC) to provide complementary measures of how well the model
fits the RT data. Because the sign of correlations indicating a good fit between model values and
behavioural data differed between models [18], we report absolute Pearson’s correlation values for
ease of cross-comparison. BIC is a useful additional measure as it allows us to quantify the strength
of evidence for and against each model being considered [33]. For example, values of BF10 > 3 are
seen as providing positive support for the alternative hypothesis (models containing the LDM
feature), values between 0.33 and 3 are in the anecdotal range, while values < 0.33 are seen as
providing better support for the null hypothesis (i.e., in favour of the baseline model – see Jeffreys,
1961). We completed this analysis for all models. We ran all regression models in R [34] and flexmix
[35], as well as multiple helper packages [36–41].
    Thus, regression analysis allows us to find language models for which distances for the stimuli
words (or word similarity for n-gram models) provide additional information to the baseline model
for the RT of participants (for the same stimuli words). This allows us to investigate the link between
participants behavior and semantic distances in language model spaces, and to contribute to the
discussion how statistical distributional properties of words reflect human biases and prejudicial
judgements.
    To test for the robustness these findings, we ran the same models on the second half of the data
(subsample B), and then compared the performance across Subsamples A and B to determine if there
was consistency across the samples using a number of different measures: correlation direction
between model estimates in A and B samples, inclusion of B sample correlation estimate within 95%
Confidence Intervals of the A sample correlation, whether the direction of the regression coefficient
is the same for A and B samples, whether significant models in A are also significant in B, and
whether the change in BIC is similar (i.e., > 3) for both samples. We first report the results of our
preregistered analyses, followed by the results of the robustness analysis outlined above.
3. Results
    Overall, there was considerable variability in model performance, but the best-performing models
in each family of models did very well in their ability to reflect human performance in the Gender–
Career IAT. Using correlation strength, 42.5% of all language models lead to significant
improvements over the baseline model. Figure 1.a) summarises the model correlations: mean
performance was best for LLMs (Mean r = .495, 81.5% significant models, best-model r = .619),
followed by count vector (Mean r = .219, 42.5% significant models, best-model r = .535), predict (Mean
r = .209, 42.8% significant models, best-model r = .601), and then n-gram models (Mean r = .179, 33.3%
significant models, Best-model r = .492), with 9 of the top 10 performing models being LLMs (see
Figure 1.b and Supplementary materials, Table F).
    When we examined model performance using BIC as a measure of model fit, we saw a slightly
different picture (Figure 1.c). Using BIC change from the baseline model, only 2 of the top 10
models were LLMs, with the other 8 coming from the predict family of models, including a mix of
CBOW and Skip-gram variants (Figure 1.d and Supplementary materials, Table G). From each
model family, we found that the best model fit for predict models had a BIC change = 28.92,
followed by LLMs (best model BIC change = 18.8), count-vector (best model BIC change = 18.6) and
lastly n-gram (best model BIC change = 9.92). Figures 1.b and 1.d in particular highlight how
considering correlation strength or change in BIC reveals different patterns, favouring LLMs in the
former and predict models in the latter.
    In terms of comparing Euclidean, cosine, correlation and association (from n-gram models)
measures, we find that all measures perform reasonably well, but that mean performance for
Euclidean (M = 0.263, SD = 0.166) is greater than that of cosine (M = 0.226, SD = 0.165) and
correlation (M = 0.264, SD = 0.165), which in turn are greater than association strength of n-gram
models (M = 0.179, SD = 0.114). If we consider the top performing models (Supplementary
materials, Tables F and G), we can see that it is dominated by models using Euclidean distance. The
domination of the Euclidean distance was not expected, as in contemporary research this metric
sometimes even not regarded (e.g. [16]). Thus, the question of the best distance remains open and
need further investigation
    For the three different corpora that were used for the customizable models, we found that
models using the BBC Subtitle corpus performed best on average (M = 0.314, SD = 0.146), followed
by those using the BNC (M = 0.209, SD = 0.163), and then those using UKWAC (M =0.175, SD =
0.146). It is perhaps surprising to see that the relatively small, but high-quality subtitle corpus
outperforms the UKWAC corpus which is an order of magnitude larger.
    For embedding sizes, there is a notable increase in average performance for larger embeddings,
with the strongest performance for LLMs with embedding sizes of 4,096 (M = 0.509, SD = 0.103) and
5120 (M = 0.467, SD = 0.048). However, these mean values mask the fact that there are individual
models with extremely good performance at almost all embedding sizes. For example, the best-
performing predict models with embedding sizes of only 50 and 100 have model performance of
0.593 and 0.601 respectively.
    Similarly, for context window radius size, while larger radiuses perform better on average (e.g.,
radius 32768, M = 0.516, SD = 0.081; radius 4096 M = 0.485, SD = 0.094), the best-performing predict
models with a radius of 5 and 10 had model performances of 0.578 and 0.601 respectively. Even the
best performing predict model with a radius of only 1 had a correlation of 0.535 with the
behavioural data.
    We compared the patterns observed in Sample A and Sample B and found very similar overall
patterns. We found that the correlation between model performance between the two r > .99, with
100% of Sample B correlations being within the 95% confidence intervals for the correlations of
Sample A, demonstrating extremely high consistency across both samples. With more conservative
Figure 1
Distribution of correlations (a,b) and BIC change (c,d) between each LDM and mean RT in the four
model families for all models (a,c) and top ten models (b,d). Each dark circle represents an individual
model instantiation, while the violin outline areas represent the density of correlations in the overall
distribution
checks, some differences emerge across model families. Looking at the consistency of the sign of
beta coefficients across samples, almost all models do well (>99% consistent), with the exception of
the count probability ratio model (67%) and several n-gram models being less than 95% consistent.
Looking at whether significant regression models are consistent across both samples, we find that
certain models within each model family perform well, and overall LLMs and predict models are
more consistent. Considering the change in BIC in both samples, LLMs show reasonable
robustness, with 43% of models showing consistency across both samples, followed by predict
(29%), count vector (26%), and lastly n-gram models trailing with 13%. All results for the models can
be found in supplementary materials Table H.
4. Discussion
   Language models and linguistic distributional information more generally have previously been
used to demonstrate associations between the statistical regularities in language and people’s
implicit biases. In this paper, we report the findings of a systematic analysis of a large range of
language models in modelling human behaviour in a Gender–Career implicit association test.
While large language models perform well in some cases, by other measures, higher BIC change
are observed for less resource-intensive predict models, such as CBOW and Skip-gram models (in
the top 10 models by decrease in BIC on sample A, 3 are CBOW and 5 are Skip-gram models, see
supplementary materials Table H). It is worth noting that LLM values may be slightly elevated
generally given that these models cannot be customised to the same extent as the other models,
meaning we cannot create LLM variants with much smaller embedding sizes or context windows,
as we can with the other model families. Results also reveal that high-quality corpora can
outperform larger, but noisier corpora. For example, the relatively small Subtitle corpus (200
million words), resulted in better model performance than UKWAC (2 billion words), and even
larger corpora like those used in GloVe (6 billion words) and the LLMs (trillions of words). For
distance measures, we found that there were good performing models with all measures
(Euclidean, cosine and correlation measures), but that the best forming models tended to use
Euclidean distance. While there was a general trend for larger embedding and context window
radius sizes to do better, we found that there are models with very strong performance at even the
smallest of embedding and radius sizes. Our robustness analyses found that there was generally
very good correspondence with model performance across the A and B samples of the dataset, with
very high correlations between observed effects, although more conservative measures highlight
some differences across model families.
    What do these patterns suggest? The patterns certainly go counter to a “bigger is better”
intuition people often have about language models, and the seemingly constant drive to produce
larger and larger models with higher and higher resource requirements. The findings suggest that
appropriately tailored non-LLMs can perform as well as, if not better than, LLMs in certain cases.
This pattern reflects recent findings indicating that larger LLMs may actually be less reliable than
smaller language models [42]. Given that even the simplest of language models can give rise to
good model performance in capturing behaviour in Gender–Career implicit association test, it also
suggests that biases are encoded very generally in linguistic information, and therefore do not
specifically require LLMs for them to be uncovered. Thus, although LLMs do well in the current
study, despite their massively greater complexity and resource requirements [e.g., 20], they do not
do significantly better than the leaner, more efficient, less resource-intensive predict and even
some count-vector models. If one is additionally concerned with the cognitive plausibility of the
models being examined, then LLMs are also left wanting in terms of plausible learning
mechanisms, training data that is orders of magnitude greater than what people can experience
during a lifetime, and general lack of grounding in broader sensorimotor experience, which is also
critical to people’s acquisition of semantic knowledge [22].
    A finding that language models can predict human behaviour in IATs demonstrates that
cultural biases are reflected in the statistical properties of language, however it is also the case that
biases in language – the medium through which much information about the world is obtained –
reinforce psychological biases in humans [13]. With LLMs playing an ever-larger role in mediating
information, and in the production of new linguistic material, biases in statistical language models
risk entering a vicious, self-reinforcing cycle with potential real-world consequences [43].
4.1.1. Limitations, challenges, and deviations from preregistration
   Despite the general trend for good model performance across a range of model families and
parameter settings, there are of course important limitations to highlight in the current work.
Furthermore, we would like to note some minor deviations from the original preregistration of this
study.
   First, although we were able to use a large sample of participant data from Project Implicit
(>800K participants, following preprocessing), our focus was only on one specific IAT topic, and
therefore only on one set of stimuli. While the current findings are suggestive of the capacity of
language models generally to reflect human behavioural biases, our future work will need to
consider extending the stimuli and range of topic areas addressed in our modelling work. Given
findings elsewhere [e.g., 11,13], we are hopeful that these findings will extend well to other areas of
implicit bias.
   During the process of testing the large range of models, we also observed some surprising
patterns in terms of the relationship between model output and the behavioural responses in the
IAT. Most models produced results in an intuitive and expected way, with higher model values
showing a positive relationship with the behavioural responses. However, for the large language
and GloVe models, we consistently observed negative correlations. Thus, while the relationships
between model and behaviour were often strong, they were in the opposite direction to those
observed for most other models. It is unclear exactly why this is the case, but there are a number of
possible explanations. First, the majority of LLMs, including the Llama2 and Mistral models used
here, include not only a language-model component, but also a further reinforcement learning
from human feedback (RLHF) component that fine-tunes the original language model. This is
similar to the approach used with the ChatGPT family of models. However, it remains unclear
what the specific impact is of this reinforcement learning stage on the linguistic representations
within the models. One of the aims of the reinforcement learning stage of LLMs is to counteract
known biases, such as those associated with race and gender, thereby potentially altering the
internal representations and leading to unexpected negative correlations.
    Related to this issue, in the current study, we focus only on the semantic similarity of individual
words within language models. However, Zhang and colleagues [17] suggest that focussing on the
word level can be problematic and give rise to anomalous results. Considering metrics of gender bias
specifically, Zhang et al. suggest that in addition to semantic similarity, two other factors can be
responsible for smaller distances in embeddings space: sociolinguistic factors and mathematical
properties of vectors [17]. Regarding mathematical properties of vectors Zhang et al. argue that cases
where there are very high similarity scores between vectors, can make it very difficult to properly
evaluate bias. For example, words and their plurals can be assigned opposite bias directions due to
vector multiplication, even though they should be considered conceptually in the same way. In our
case, we can speculate that in cases of a very high cosine similarity between base words (e.g., in
Glove 6b model cosine similarity between “female” and “male” equals to 0.894) smaller distances can
be observed by chance, rather than due to gender bias per se. Additionally, we could also expect
some sociolinguistic factors to impact the representation of bias within models. For example, we
might expect that distance between “family” and “business” will be lower not because of any Gender–
Career bias, but because of the relatively high frequency of co-occurrence phrases like “family
business”. For example, in the GloVe model used here, the cosine distance between “business” and
“family” is lower (.632) than that between “business” and “career” (.673). Zhang and colleagues
suggest that focusing on the conceptual level (e.g., examining clusters of concepts related to the core
concepts of “family” and “career”) may give rise to more robust results when considering bias,
although the feasibility of this approach may depend on the modelling context [17].
    Many of the language models we tested here are fully transparent and fully customisable.
However, LLMs rarely offer the same levels of transparency, which is problematic for researchers.
We overcome this issue somewhat by using publicly available models like Llama2 and Mistral,
which for LLMs, offer some of the greatest visibility into their behaviour and underlying
representations. However, even with these models we don’t have complete information on
constituency and size of the training data used. In an ideal world, researchers would have complete
access to all aspects of these models to fully and fairly assess their performance. In addition, the
field of LLMs continues to develop rapidly, and LLMs included in this study are now classified by
researchers as S(mall)LLMs. Furthermore, in this study we used heavily quantized (Q4) models
potentially reducing precision. Thus, results of this study should be regarded with care and for
better generalization of fundings more research with an extended set of models is needed.
    In terms of preregistration deviations, our primary analyses and treatment of the data follow
our original plan very closely. However, in our original plan we included a smaller number of
models and model families. This was primarily because this work was originally proposed more
than 2 years ago, meaning that we originally included only n-gram, count vector and predict
models. However, given the pace of change in the world of AI and language models, we felt it was
important to include newer large language models, and so added the Llama 2, Mistral, and GloVe
models to provide a more complete picture of language models in this domain.
    Finally, it is worth mentioning that "gender" is descriptive of a broad phenomenon which
extends beyond simply "male" and "female". Existing gender-based IATs employ a strict
male/female binary, which we have therefore followed in the present analysis. Gender-based
stereotypes and biases relating to transgender and non-binary identities, and particularly their
reflection language, remains an under-studied topic [44,45].
4.1.2. Conclusion
    Overall, we find that a range of language models can capture human behavioural performance
in relation to Gender–Career implicit biases. While LLMs perform well, their additional resource
requirements may not be warranted as they do not reliably outperform much simpler and more
cost-effective models.
Acknowledgements
   This publication has emanated from research conducted with the financial support of Science
Foundation Ireland under Grant number 21/FFP-P/10118.
   All images and data used in this article are licensed under a Creative Commons Attribution 4.0
International License (CC-BY), which permits use, sharing, adaptation, distribution, and
reproduction in any medium or format, as long as you give appropriate credit to the original author(s)
and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Any images or other third-party material in this article are included in the article’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the      copyright      holder.     To     view      a     copy     of    this       license,    visit
http://creativecommons.org/licenses/by/4.0/.
   The authors declare that they have no conflict of interest.
   Our data, analysis scripts, supplementary materials and figures have been deposited in the
project’s page on the Open Science Framework:
https://osf.io/n36gr/?view_only=f385b438c2ac430b8f5e2ddc5fc87c0d
References
[1]  C. Staats, Understanding Implicit Bias: What Educators Should Know, American Educator 39
     (2016) 29.
[2] R. Chang, Preliminary report on race and Washington’s criminal justice system, (2011).
[3] C.A. Moss-Racusin, J.F. Dovidio, V.L. Brescoll, M.J. Graham, J. Handelsman, Science faculty’s
     subtle gender biases favor male students, Proceedings of the National Academy of Sciences 109
     (2012) 16474–16479.
[4] K.S. O’Brien, J.D. Latner, D. Ebneter, J.A. Hunter, Obesity discrimination: The role of physical
     appearance, personal ideology, and anti-fat prejudice, International Journal of Obesity 37 (2013)
     455.
[5] A.G. Greenwald, L.H. Krieger, Implicit bias: Scientific foundations, California Law Review 94
     (2006) 945–967.
[6] G. Ghiasi, V. Larivière, C.R. Sugimoto, On the compliance of women engineers with a gendered
     scientific        system,         PLOS          ONE          10         (2015)         e0145931.
     https://doi.org/10.1371/journal.pone.0145931.
[7] E. O’Boyle, J. Harter, State of the American workplace: Employee engagement insights for U.S.
     business leaders, Gallup, 2013.
[8] B.A. Nosek, F.L. Smyth, J.J. Hansen, T. Devos, N.M. Lindner, K.A. Ranganath, M.R. Banaji,
     Pervasiveness and correlates of implicit attitudes and stereotypes, European Review of Social
     Psychology 18 (2007) 36–88.
[9] M.A.K. Halliday, Explorations in the functions of language, Arnold, 1973.
[10] S. Kirby, H. Cornish, K. Smith, Cumulative cultural evolution in the laboratory: An experimental
     approach to the origins of structure in human language, Proceedings of the National Academy of
     Sciences 105 (2008) 10681–10686.
[11] A. Caliskan, J.J. Bryson, A. Narayanan, Semantics derived automatically from language corpora
     contain         human-like        biases,       Science        356        (2017)        183–186.
     https://doi.org/10.1126/science.aal4230.
[12] D. Lynott, H. Kansal, L. Connell, K. O’Brien, Modelling the IAT: Implicit Association Test
     reflects shallow linguistic environment and not deep personal attitudes, in: Proceedings of the
     Annual Meeting of the Cognitive Science Society, 2012.
[13] D. Lynott, M. Walsh, T. McEnery, L. Connell, L. Cross, K. O’Brien, Are you what you read?
     Predicting implicit attitudes to immigration based on linguistic distributional cues from
     newspaper readership; A pre-registered study, Frontiers in Psychology 10 (2019) 842.
     https://doi.org/10.3389/fpsyg.2019.00842.
[14] L. Onnis, A. Lim, Distributed semantic representations of inanimate nouns are gender biased in
     gendered languages, in: Proceedings of the Annual Meeting of the Cognitive Science Society,
     2024. https://escholarship.org/uc/item/50m8883c.
[16] S. Bhatia, L. Walasek, Predicting implicit attitudes with natural language data, Proceedings of
     the       National      Academy          of      Sciences      120      (2023)      e2220726120.
     https://doi.org/10.1073/pnas.2220726120.
[17] H. Zhang, A. Sneyd, M. Stevenson, Robustness and Reliability of Gender Bias Assessment in
     Word Embeddings: The Role of Base Pairs, in: K.-F. Wong, K. Knight, H. Wu (Eds.),
     Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for
     Computational Linguistics and the 10th International Joint Conference on Natural Language
     Processing, Association for Computational Linguistics, Suzhou, China, 2020: pp. 759–769.
     https://aclanthology.org/2020.aacl-main.76 (accessed October 2, 2024).
[18] C. Wingfield, L. Connell, Understanding the role of linguistic distributional knowledge in
     cognition,     Language,      Cognition      and     Neuroscience     37    (2022)     1220–1270.
     https://doi.org/10.1080/23273798.2022.2069278.
[19] L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep
     learning on typical tabular data?, in: Proceedings of the 36th International Conference on Neural
     Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, 2024: pp. 507–
     520.
[20] S. Luccioni, B. Gamazaychikov, S. Hooker, R. Pierrard, E. Strubell, Y. Jernite, C.J. Wu, Light
     bulbs have energy ratings—so why can’t AI chat-bots?, Nature 632 (2024) 736–738.
[21] E. Jaff, Y. Wu, N. Zhang, U. Iqbal, Data exposure from LLM apps: An in-depth investigation of
     OpenAI’s GPTs, (2024).
[22] L. Connell, D. Lynott, What Can Language Models Tell Us About Human Cognition?, Current
     Directions         in       Psychological          Science        33       (2024)        181–189.
     https://doi.org/10.1177/09637214241242746.
[23] T.K. Landauer, S.T. Dumais, A solution to Plato’s problem: The latent semantic analysis theory
     of acquisition, induction, and representation of knowledge, Psychological Review 104 (1997)
     211–240. https://doi.org/10.1037/0033-295X.104.2.211.
[24] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in:
     Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), 2014: pp. 1532–1543.
[25] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, T. Scialom, Llama 2: Open
     foundation and fine-tuned chat models, arXiv Preprint (2023).
[26] A.Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Singh Chaplot, D. de las Casas, F.
     Bressand, Mistral 7B, (2023).
[27] A. Ferraresi, E. Zanchetta, M. Baroni, S. Bernardini, Introducing and evaluating UKWaC, a very
     large web-derived corpus of English, in: Proceedings of the 4th Web as Corpus Workshop
     (WAC-4) Can We Beat Google, 2008: pp. 47–54.
[28] BNC Consortium, British National Corpus, (2007). http://hdl.handle.net/20.500.12024/2554.
[29] F.K. Xu, N. Lofaro, B.A. Nosek, A.G. Greenwald, J. Axt, L. Simon, N. Frost, Gender-Career
     IAT 2005-2023, (2024). https://osf.io/abxq7/.
[30] A. Cochrane, W.T.L. Cox, C.S. Green, Robust within-session modulations of IAT scores may
     reveal novel dynamics of rapid change, Scientific Reports 13 (2023) 16247.
     https://doi.org/10.1038/s41598-023-43370-w.
[31] J. Röhner, C.K. Lai, A diffusion model approach for understanding the impact of 17 interventions
     on the race implicit association test, Personality and Social Psychology Bulletin 47 (2021) 1374–
     1389. https://doi.org/10.1177/0146167220974489.
[32] A. Dymarska, L. Connell, B. Banks, Weaker than you might imagine: Determining imageability
     effects on word recognition, Journal of Memory and Language 129 (2023) 104398.
     https://doi.org/10.1016/j.jml.2022.104398.
[33] Z. Dienes, N. Mclatchie, Four reasons to prefer Bayesian analyses over significance testing,
     Psychonomic Bulletin & Review 25 (2018) 207–218.
[34] R Core Team, R: A language and environment for statistical computing, R Foundation for
     Statistical Computing, 2023. https://www.R-project.org/.
[35] B. Grün, F. Leisch, flexmix: Flexible mixture modeling (Version 2.3-19) [R package], (2023).
     https://CRAN.R-project.org/package=flexmix.
[36] H. Wickham, R. François, L. Henry, K. Müller, D. Vaughan, dplyr: A grammar of data
     manipulation      (Version    1.1.4),  GitHub,       2023.   https://github.com/tidyverse/dplyr;
     https://dplyr.tidyverse.org.
[37] T. Barrett, M. Dowle, A. Srinivasan, J. Gorecki, M. Chirico, T. Hocking, B. Schwendinger,
     data.table:        Extension       of       data.frame         [R        package],      (2006).
     https://doi.org/10.32614/CRAN.package.data.table.
[38] D. Lüdecke, M. Ben-Shachar, I. Patil, P. Waggoner, D. Makowski, Performance: An R package
     for assessment, comparison, and testing of statistical models, Journal of Open Source Software
     6 (2021) 3139. https://doi.org/10.21105/joss.03139.
[39] D. Lüdecke, sjPlot: Data visualization for statistics in social science (Version 2.8.16) [R
     package], (2024). https://CRAN.R-project.org/package=sjPlot.
[40] E.F. Haghish, md.log: Produces markdown log file with a built-in function call [R package],
     (2017). https://doi.org/10.32614/CRAN.package.md.log.
[41] W. Revelle, psych: Procedures for psychological, psychometric, and personality research
     (Version 2.4.6), Northwestern University, 2024. https://CRAN.R-project.org/package=psych.
[42] Y. Zhou, P. Xu, X. Liu, B. An, W. Ai, F. Huang, Explore Spurious Correlations at the Concept
     Level in Language Models for Text Classification, (2024). http://arxiv.org/abs/2311.08648
     (accessed September 19, 2024).
[43] I.O. Gallegos, R.A. Rossi, J. Barrow, M.M. Tanjim, S. Kim, F. Dernoncourt, N.K. Ahmed, Bias
     and fairness in large language models: A survey, Computational Linguistics (2024) 1–79.
[44] K. Hansen, K. Żółtak, Social perception of non-binary individuals, Archives of Sexual Behavior
     51 (2022) 2027–2035. https://doi.org/10.1007/s10508-021-02234-y.
[45] M.K. McCarty, A.H. Burt, Understanding perceptions of gender non-binary people: Consensual
     and unique stereotypes and prejudice, Sex Roles 90 (2024) 392–416.
     https://doi.org/10.1007/s11199-024-01449-2.