=Paper= {{Paper |id=Vol-3762/518 |storemode=property |title=LLM embeddings on test items predict post hoc loadings in personality tests |pdfUrl=https://ceur-ws.org/Vol-3762/518.pdf |volume=Vol-3762 |authors=Monica Casella,Maria Luongo,Davide Marocco,Nicola Milano,Michela Ponticorvo |dblpUrl=https://dblp.org/rec/conf/ital-ia/CasellaLMMP24 }} ==LLM embeddings on test items predict post hoc loadings in personality tests== https://ceur-ws.org/Vol-3762/518.pdf

LLM embeddings on test items predict post hoc loadings
in personality tests.
𝑀𝑜𝑛𝑖𝑐𝑎 𝐶𝑎𝑠𝑒𝑙𝑙𝑎1 , Maria Luongo2 , Davide Marocco3, Nicola Milano4*, Michela
Ponticorvo5

1University of Naples Federico II, Department of humanistic studies, Natural and artificial cognition laboratory

“Orazio Miglino”, via Porta di Massa 1, Naples, 80125, Italy

Abstract

In this article we examine the application of Large Language Models (LLMs) in predicting factor
loadings in personality tests through the semantic analysis of test items. By leveraging text
embeddings generated from LLMs, we assess the semantic similarity of test items and their
alignment with hypothesized factorial structures, without relying on human response data. Our
methodology uses embeddings from the Big Five personality test to explore correlations
between item semantics and their grouping in factorial analyses. Preliminary results suggest
that LLM-derived embeddings can effectively capture semantic similarities among test items,
potentially serving as a valid measure for initial survey design and refinement. This approach
offers insights into the robustness of embedding techniques in psychological evaluations,
indicating a significant correlation with traditional test structures and providing a novel
perspective on test item analysis.

Keywords
Embeddings, language models, personality traits, semantic similarity 1

1. Introduction used data coming from the administration to a sample
of respondents. In this phase, together with item
In psychological testing and, more in general, in the characteristics, evaluation is run on the test,
different contexts that include evaluation, it is very especially in the framework of TCT, including the
important to assess item quality. This process, known study reliability and of test structure in terms of latent
as item analysis, foresees the evaluation of different variables by the means of factor analysis [3]. This is a
item characteristics including item difficulty, item consolidated approach.
discrimination, item and test reliability. Factor analysis is used to reduce a large number of
The two main approaches to run item analysis - variables into fewer numbers of factors, which have a
classical test theory (CTT) and item response theory psychological meaning. This technique extracts
(IRT) – (see [1]) share the starting point of a person maximum common variance from all variables and
per item matrix: a matrix where examinees are rows expresses them into a common score, that is used for
and items are columns. This raises some problems, for further analysis. This analysis allows also to calculate
example the matrix can be sparse if a lot of missing factor loadings that are basically the correlation
data are present. coefficient between the single variable and the factor.
Whereas the item formulation relies on procedures Factor analysis has been widely used in psychological
that come before the test (and items) administration, research both in cognitive domain (consider the
for example focus group with experts [2], item factorial theories of intelligence, [4-5]) and in
analysis is typically based on post hoc analysis that personality domain. In this latter case, between the

Ital-IA 2024: 4th National Conference on Artificial Intelligence,
organized by CINI, May 29-30, 2024, Naples, Italy © 2024 Copyright for this paper by its authors. Use permitted under
∗ Corresponding author. Creative Commons License Attribution 4.0 International (CC BY 4.0).

nicola.milano@unina.it

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
theoretical proposals up the 1990s ([6-7]), the Big 2. Materials and Methods: Utilizing
Five Theory of personality is a notable example of how
personality can be conceived and described as a Item Embeddings for Semantic
constellation of different dimensions, factors in the Analysis
terms we have used before. Evidence of this theory
has grown over the years and five broad personality Advanced large language models (LLMs), have set
traits described by the theory are identified in new benchmarks in processing and generating text
extraversion, agreeableness, openness, that is understandable by humans, seeing widespread
application across countless tasks by millions
conscientiousness, and neuroticism. This approach
globally. Within the realm of psychology, the ability of
was developed in 1949 by [8] and later expanded by
LLMs to interpret, contextualize, and extrapolate from
other researchers up to the work by [9].
human language without prior exposure has sparked
For the development of this approach, a key role has
been played by traits measurement. Traditionally, a considerable interest due to their potential in
Big Five personality test is taken with a questionnaire exploring unseen texts through a zero-shot approach
and response on a Likert scale [10]. The Big Five [14-16]. This section outlines the methodology for
Questionnaire [11], and its newer version BFQ-2 [12], leveraging LLMs to assess semantic similarities
is used in different contexts and represents a golden among test items, determining if these similarities
standard for measuring personality, according to the align with a hypothesized factorial structure
Big Five theory. It is formed of 132 items. Some subsequently identified in participant responses.
questions ask how much a person agrees or disagrees LLMs internally convert input text into vector
that he or she is someone who exemplifies various representations, known as embeddings, through the
specific statements, such as: “Is open to trying new training phase. Each word or sentence is transformed
experiences” (for openness, or open-mindedness) or into a fixed-size vector of floating-point numbers.
“Is anxious about the future all the time” (for Research indicates that the space of these embeddings
neuroticism, or negative emotionality). The possesses metric characteristics [18], allowing for the
responses, “Strongly agree” to “Strongly Disagree” mapping of similar concepts, such as colors or
(with alternatives in between) determines to what synonyms, closer together than disparate ones [18-
degrees the respondent show that specific traits. 19]. By utilizing this feature, embeddings serve as a
In this contribution, we propose a method to tool for gauging semantic similarity across various
understand the strength of the connection between domains.
items and factors based only on the item considered
as a linguistic material before the test administration. We propose employing embeddings from established
In order to do this, we used LLM, large language psychological tests to examine item similarities and
models [13], artificial intelligence models that are able verify whether the test exhibits the anticipated
to process and generate natural language. In these factorial structure by focusing solely on the items,
models, embeddings take a piece of text - a word, excluding participant responses. To this end, we
sentence, paragraph or even a whole article, and employ the Bidirectional Encoder Representations
convert that into an array of floating-point numbers, from Transformers (BERT) model [20], a pre-trained,
the embedding vector. This way, the artificial neural open-source language model developed by Google.
BERT, which has been trained on over five billion
network can code the linguistic information.
sentences from Wikipedia and the Google Book
Embeddings are a numerical representation of the
Corpus, aims to predict missing words in sentences.
semantic meaning of the content in a many-multi-
Since its introduction in 2018, numerous
dimensional space.
We used this representation to analyze item and enhancements to the BERT model have been
check if the proximity of items in this multi- suggested. Our methodology utilizes roBERTa, a
dimensional space corresponds to post-hoc loadings variant that has achieved top results on standard LLM
in personality test factor analysis. benchmarks [21]. As an initial step, we apply the Big5
personality test [22], mapping each of its 50 items into
the embedding space with BERT to achieve a 1024-
dimensional vector representation for each item.

To determine the proximity of embeddings for
different items within this space, we use cosine
similarity, which calculates the cosine of the angle
between two vectors. This measure, dependent solely
on the angle and not the magnitude of vectors, is
obtained by dividing the dot product of two vectors by
the product of their magnitudes.

𝐀⋅𝐁
cosinesimilarity(A, B) = (1)
‖𝐀‖‖𝐁‖

The resulting cosine similarity equation, where A and
B represent the embeddings of two different items,
produces a similarity matrix. This matrix has various
applications, such as clustering to observe if the
embeddings align with the hypothesized factorial
structure or directly applying principal component
analysis. The cosine similarity matrix, particularly
when vectors are centered to have a zero mean,
equates to the Pearson correlation coefficient. Thus,
by analyzing items through the lens of LLMs, we can Figure 1. T-SNE projection of the embeddings in a 2-
deduce whether our test's structure conforms to the dimensional space. Yellow points are openness (O)
expected factorial arrangement. Moreover, this related items, Green points are Conscientiousness (C)
approach allows for the modification or elimination of items, Blue points are Neuroticism (N) items, Black
items that either poorly correlate with others or points are extra-version (E) items and violet points
duplicate the same construct, streamlining the test are Agreeableness (A) items.
process.
We then applied Principal Component Analysis (PCA)
to the cosine correlation matrix of these embeddings,
which revealed a latent space that accurately groups
3. Results : Alignment with Human
items belonging to the same construct under a single
Responses principal component. This process not only facilitates
the interpretation of outcomes but also confirms the
theoretically anticipated structure. The focus now
Our analysis aims to determine whether embeddings shifts to comparing the latent structure unearthed
derived from large language models can successfully from the semantic similarity analysis with that
identify semantic similarities among items and derived from actual human responses. Specifically, we
predict human responses to them. To this end, we first aim to investigate whether the item loadings
examine whether there is similarity between generated from the embedded item representations
construct-related items in the embedding space. mirror those obtained from analyzing human
Figure 1 shows a t-SNE [23] projection of the 1024- responses. By examining the correlation between the
dimensional embeddings of the Big Five items into a two sets of loadings, we gain insight into the extent to
two-dimensional space. The colors and letters which item semantics predict human response
indicate the factors underlying the items; for example, patterns. Ideally, a high correlation between construct
yellow and 'O' represent openness, with the numbers loadings would indicate not only that related items
specifying the particular item. Different categories of are grouped accurately but also that both cross-
items are mapped closer together and occupy loadings and other factor loadings exhibit similar
substantially different zones of the space. There is trends.
some overlap due to single items that are more For this purpose, we sourced human response data
difficult to classify and to specific constructs that from the Open Psychometrics website for the Big 5
share overlapping meanings, even for humans, such as personality test. We then replicated the previously
Agreeableness and Extraversion. outlined embedding analysis on this response data,
starting with the calculation of Pearson correlations
for each pair of items, followed by PCA to determine
item loadings based on the theoretical number of
latent factors. For items phrased in reverse, we
adjusted their scales to align with the correct Moreover, the application of Principal Component
direction, a necessary step because the embedding Analysis (PCA) on the cosine correlation matrix of the
model does not differentiate between reversed and embeddings reveals a latent space where items
non-reversed items. This adjustment ensures that the belonging to the same construct are grouped under a
comparison of loadings between the models remains single principal component. This alignment with the
valid. theoretical structure provides further validation of
The results, depicted in Figure 2, show the Spearman the embedding analysis. Considering the comparison
correlation coefficient for the Big 5 test examined. We with human response data, the correlation between
observed a high correlation between the embedding the item loadings derived from the embedded item
loadings and the human response loadings within the representations and those obtained from analyzing
same constructs (R > 0.4, p-value << 0.001). human responses indicates a significant relationship.
Additionally, a significant correlation (R > 0.4, p-value The high correlation observed, particularly within the
<< 0.001) was noted between the constructs of same constructs, suggests that the semantic
Agreeableness and Extraversion. These findings similarities captured by the embeddings effectively
suggest that the semantic similarities among items mirror the relationships observed in human subjects.
effectively reflect the relationships among factors as Notably, a significant correlation is observed between
found in human subjects. the constructs of Agreeableness and Extraversion,
indicating a meaningful association between these
factors in both the embedding analysis and human
responses. Overall, these findings support the notion
that embeddings derived from large language models
can successfully identify semantic similarities among
items and, so, serve as a valid preliminary measure in
the context of survey design.

References
[1] Kline, T. J. (2005). Psychological testing: A
practical approach to design and evaluation.
Sage publications.
[2] Mallinckrodt, B., Miles, J. R., & Recabarren, D.
A. (2016). Using focus groups and Rasch item
response theory to improve instrument
development. The Counseling Psychologist,
Figure 2. Spearman correlation between Item 44(2), 146-194.
embedding loadings (x-axis) and human response [3] Cole, D. A. (1987). Utility of confirmatory
loadings (y-axis). Sperman r greater than 0.40 shows factor analysis in test validation research.
significative correlation between the loadings (p- Journal of consulting and clinical psychology,
value << 0.001) 55(4), 584.
[4] Sternberg, R. J. (1980). Factor theories of
intelligence are all right almost. Educational
Researcher, 9(8), 6-18.
4. Conclusions [5] Carroll, J. B. (2013). A three-stratum theory
of intelligence: Spearman's contribution. In
Human abilities (pp. 1-17). Psychology Press.
From our analysis we found that the t-SNE projection [6] Zuckerman, M., Kuhlman, D. M., Thornquist,
of the embeddings maps items related to similar M., & Kiers, H. (1993). Five (or three) robust
constructs close together in the embedding space. questionnaire scale factors of personality
Despite some overlap due to ambiguous items or without culture. Personality and individual
constructs with overlapping meanings, such as Differences, 14(4), 569-578.
Agreeableness and Extraversion, the overall pattern [7] Eysenck, H. J. (1953). The structure of
suggests that embeddings derived from large human personality (Psychology Revivals).
language models capture semantic similarities among Routledge.
items.
[8] Fiske, D. W. (1949). Consistency of the [21] https://www.sbert.net/docs/pretrained_m
factorial structures of personality ratings from odels.html
different sources. The Journal of Abnormal and [22] McCrae, R. R., & Costa, P. T.: Updating
Social Psychology, 44(3), 329. Norman's" adequacy taxonomy": Intelligence
[9] Costa Jr, P. T., & McCrae, R. R. (1992). Four and personality dimensions in natural language
ways five factors are basic. Personality and and in questionnaires. Journal of personality and
individual differences, 13(6), 653-665. social psychology, 49(3), 710. (1985).
[10] Jebb, A. T., Ng, V., & Tay, L. (2021). A review [23] Van der Maaten, L., & Hinton, G. (2008).
of key Likert scale development advances: Visualizing data using t-SNE. Journal of machine
1995–2019. Frontiers in psychology, 12, learning research, 9(11).
637547.
[11] Caprara, G. V., Barbaranelli, C., Borgogni, L.,
& Perugini, M. (1993). The “Big Five
Questionnaire”: A new questionnaire to assess
the five factor model. Personality and individual
Differences, 15(3), 281-288.
[12] Caprara, G. V., Barbaranelli, C., Borgogni, L.,
& Vecchione, M. (2008). BFQ-2. Big five
questionnaire, 2.
[13] Vaswani, A., Shazeer, N., Parmar, N.,
Uszkoreit, J., Jones, L., Gomez, A. N., ... &
Polosukhin, I. (2017). Attention is all you need.
Advances in neural information processing
systems, 30.
[14] Brown, T., Mann, B., Ryder, N., Subbiah, M.,
Kaplan, J. D., Dhariwal, P., ... & Amodei, D :
Language models are few-shot learners.
Advances in neural information processing
systems, 33, 1877-1901. (2020).
[15] Binz, M., & Schulz, E.: Turning large language
models into cognitive models. arXiv preprint
arXiv:2306.03917. (2023)
[16] Buschoff, L. M. S., Akata, E., Bethge, M., &
Schulz, E. : Have we built machines that think
like people?. arXiv preprint arXiv:2311.16093.
(2023).
[17] Chuang, Y. S., Goyal, A., Harlalka, N., Suresh,
S., Hawkins, R., Yang, S., ... & Rogers, T. T.:
Simulating Opinion Dynamics with Networks of
LLM-based Agents. arXiv preprint
arXiv:2311.09618. (2023).
[18] Yan, F., Fan, Q., & Lu, M.: Improving semantic
similarity retrieval with word embeddings.
Concurrency and Computation: Practice and
Experience, 30(23), e4489. (2018).
[19] Colla, D., Mensa, E., & Radicioni, D. P.: Novel
metrics for computing semantic similarity with
sense embeddings. Knowledge-Based Systems,
206, 106346. (2020).
[20] Devlin, J., Chang, M. W., Lee, K., & Toutanova,
K.: Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv
preprint arXiv:1810.04805. (2018).