=Paper= {{Paper |id=Vol-3762/518 |storemode=property |title=LLM embeddings on test items predict post hoc loadings in personality tests |pdfUrl=https://ceur-ws.org/Vol-3762/518.pdf |volume=Vol-3762 |authors=Monica Casella,Maria Luongo,Davide Marocco,Nicola Milano,Michela Ponticorvo |dblpUrl=https://dblp.org/rec/conf/ital-ia/CasellaLMMP24 }} ==LLM embeddings on test items predict post hoc loadings in personality tests== https://ceur-ws.org/Vol-3762/518.pdf
                                LLM embeddings on test items predict post hoc loadings
                                in personality tests.
                                𝑀𝑜𝑛𝑖𝑐𝑎 𝐶𝑎𝑠𝑒𝑙𝑙𝑎1 , Maria Luongo2 , Davide Marocco3, Nicola Milano4*, Michela
                                Ponticorvo5

                                1University of Naples Federico II, Department of humanistic studies, Natural and artificial cognition laboratory

                                “Orazio Miglino”, via Porta di Massa 1, Naples, 80125, Italy




                                                    Abstract

                                                    In this article we examine the application of Large Language Models (LLMs) in predicting factor
                                                    loadings in personality tests through the semantic analysis of test items. By leveraging text
                                                    embeddings generated from LLMs, we assess the semantic similarity of test items and their
                                                    alignment with hypothesized factorial structures, without relying on human response data. Our
                                                    methodology uses embeddings from the Big Five personality test to explore correlations
                                                    between item semantics and their grouping in factorial analyses. Preliminary results suggest
                                                    that LLM-derived embeddings can effectively capture semantic similarities among test items,
                                                    potentially serving as a valid measure for initial survey design and refinement. This approach
                                                    offers insights into the robustness of embedding techniques in psychological evaluations,
                                                    indicating a significant correlation with traditional test structures and providing a novel
                                                    perspective on test item analysis.

                                                    Keywords
                                                    Embeddings, language models, personality traits, semantic similarity 1



                                1. Introduction                                                     used data coming from the administration to a sample
                                                                                                    of respondents. In this phase, together with item
                                In psychological testing and, more in general, in the               characteristics, evaluation is run on the test,
                                different contexts that include evaluation, it is very              especially in the framework of TCT, including the
                                important to assess item quality. This process, known               study reliability and of test structure in terms of latent
                                as item analysis, foresees the evaluation of different              variables by the means of factor analysis [3]. This is a
                                item characteristics including item difficulty, item                consolidated approach.
                                discrimination, item and test reliability.                          Factor analysis is used to reduce a large number of
                                The two main approaches to run item analysis -                      variables into fewer numbers of factors, which have a
                                classical test theory (CTT) and item response theory                psychological meaning. This technique extracts
                                (IRT) – (see [1]) share the starting point of a person              maximum common variance from all variables and
                                per item matrix: a matrix where examinees are rows                  expresses them into a common score, that is used for
                                and items are columns. This raises some problems, for               further analysis. This analysis allows also to calculate
                                example the matrix can be sparse if a lot of missing                factor loadings that are basically the correlation
                                data are present.                                                   coefficient between the single variable and the factor.
                                Whereas the item formulation relies on procedures                   Factor analysis has been widely used in psychological
                                that come before the test (and items) administration,               research both in cognitive domain (consider the
                                for example focus group with experts [2], item                      factorial theories of intelligence, [4-5]) and in
                                analysis is typically based on post hoc analysis that               personality domain. In this latter case, between the



                                Ital-IA 2024: 4th National Conference on Artificial Intelligence,
                                organized by CINI, May 29-30, 2024, Naples, Italy                                  © 2024 Copyright for this paper by its authors. Use permitted under
                                ∗ Corresponding author.                                                            Creative Commons License Attribution 4.0 International (CC BY 4.0).

                                   nicola.milano@unina.it




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
theoretical proposals up the 1990s ([6-7]), the Big         2. Materials and Methods: Utilizing
Five Theory of personality is a notable example of how
personality can be conceived and described as a                Item Embeddings for Semantic
constellation of different dimensions, factors in the          Analysis
terms we have used before. Evidence of this theory
has grown over the years and five broad personality         Advanced large language models (LLMs), have set
traits described by the theory are identified in            new benchmarks in processing and generating text
extraversion,          agreeableness,         openness,     that is understandable by humans, seeing widespread
                                                            application across countless tasks by millions
conscientiousness, and neuroticism. This approach
                                                            globally. Within the realm of psychology, the ability of
was developed in 1949 by [8] and later expanded by
                                                            LLMs to interpret, contextualize, and extrapolate from
other researchers up to the work by [9].
                                                            human language without prior exposure has sparked
 For the development of this approach, a key role has
been played by traits measurement. Traditionally, a         considerable interest due to their potential in
Big Five personality test is taken with a questionnaire     exploring unseen texts through a zero-shot approach
and response on a Likert scale [10]. The Big Five           [14-16]. This section outlines the methodology for
Questionnaire [11], and its newer version BFQ-2 [12],       leveraging LLMs to assess semantic similarities
is used in different contexts and represents a golden       among test items, determining if these similarities
standard for measuring personality, according to the        align with a hypothesized factorial structure
Big Five theory. It is formed of 132 items. Some            subsequently identified in participant responses.
questions ask how much a person agrees or disagrees         LLMs internally convert input text into vector
that he or she is someone who exemplifies various           representations, known as embeddings, through the
specific statements, such as: “Is open to trying new        training phase. Each word or sentence is transformed
experiences” (for openness, or open-mindedness) or          into a fixed-size vector of floating-point numbers.
“Is anxious about the future all the time” (for             Research indicates that the space of these embeddings
neuroticism, or negative emotionality). The                 possesses metric characteristics [18], allowing for the
responses, “Strongly agree” to “Strongly Disagree”          mapping of similar concepts, such as colors or
(with alternatives in between) determines to what           synonyms, closer together than disparate ones [18-
degrees the respondent show that specific traits.           19]. By utilizing this feature, embeddings serve as a
 In this contribution, we propose a method to               tool for gauging semantic similarity across various
understand the strength of the connection between           domains.
items and factors based only on the item considered
as a linguistic material before the test administration.    We propose employing embeddings from established
In order to do this, we used LLM, large language            psychological tests to examine item similarities and
models [13], artificial intelligence models that are able   verify whether the test exhibits the anticipated
to process and generate natural language. In these          factorial structure by focusing solely on the items,
models, embeddings take a piece of text - a word,           excluding participant responses. To this end, we
sentence, paragraph or even a whole article, and            employ the Bidirectional Encoder Representations
convert that into an array of floating-point numbers,       from Transformers (BERT) model [20], a pre-trained,
the embedding vector. This way, the artificial neural       open-source language model developed by Google.
                                                            BERT, which has been trained on over five billion
network can code the linguistic information.
                                                            sentences from Wikipedia and the Google Book
Embeddings are a numerical representation of the
                                                            Corpus, aims to predict missing words in sentences.
semantic meaning of the content in a many-multi-
                                                            Since its introduction in 2018, numerous
dimensional space.
 We used this representation to analyze item and            enhancements to the BERT model have been
check if the proximity of items in this multi-              suggested. Our methodology utilizes roBERTa, a
dimensional space corresponds to post-hoc loadings          variant that has achieved top results on standard LLM
in personality test factor analysis.                        benchmarks [21]. As an initial step, we apply the Big5
                                                            personality test [22], mapping each of its 50 items into
                                                            the embedding space with BERT to achieve a 1024-
                                                            dimensional vector representation for each item.

                                                            To determine the proximity of embeddings for
                                                            different items within this space, we use cosine
                                                            similarity, which calculates the cosine of the angle
between two vectors. This measure, dependent solely
on the angle and not the magnitude of vectors, is
obtained by dividing the dot product of two vectors by
the product of their magnitudes.

                                      𝐀⋅𝐁
          cosinesimilarity(A, B) =                 (1)
                                     ‖𝐀‖‖𝐁‖


The resulting cosine similarity equation, where A and
B represent the embeddings of two different items,
produces a similarity matrix. This matrix has various
applications, such as clustering to observe if the
embeddings align with the hypothesized factorial
structure or directly applying principal component
analysis. The cosine similarity matrix, particularly
when vectors are centered to have a zero mean,
equates to the Pearson correlation coefficient. Thus,
by analyzing items through the lens of LLMs, we can       Figure 1. T-SNE projection of the embeddings in a 2-
deduce whether our test's structure conforms to the       dimensional space. Yellow points are openness (O)
expected factorial arrangement. Moreover, this            related items, Green points are Conscientiousness (C)
approach allows for the modification or elimination of    items, Blue points are Neuroticism (N) items, Black
items that either poorly correlate with others or         points are extra-version (E) items and violet points
duplicate the same construct, streamlining the test       are Agreeableness (A) items.
process.
                                                          We then applied Principal Component Analysis (PCA)
                                                          to the cosine correlation matrix of these embeddings,
                                                          which revealed a latent space that accurately groups
3. Results : Alignment with Human
                                                          items belonging to the same construct under a single
    Responses                                             principal component. This process not only facilitates
                                                          the interpretation of outcomes but also confirms the
                                                          theoretically anticipated structure. The focus now
Our analysis aims to determine whether embeddings         shifts to comparing the latent structure unearthed
derived from large language models can successfully       from the semantic similarity analysis with that
identify semantic similarities among items and            derived from actual human responses. Specifically, we
predict human responses to them. To this end, we first    aim to investigate whether the item loadings
examine whether there is similarity between               generated from the embedded item representations
construct-related items in the embedding space.           mirror those obtained from analyzing human
Figure 1 shows a t-SNE [23] projection of the 1024-       responses. By examining the correlation between the
dimensional embeddings of the Big Five items into a       two sets of loadings, we gain insight into the extent to
two-dimensional space. The colors and letters             which item semantics predict human response
indicate the factors underlying the items; for example,   patterns. Ideally, a high correlation between construct
yellow and 'O' represent openness, with the numbers       loadings would indicate not only that related items
specifying the particular item. Different categories of   are grouped accurately but also that both cross-
items are mapped closer together and occupy               loadings and other factor loadings exhibit similar
substantially different zones of the space. There is      trends.
some overlap due to single items that are more            For this purpose, we sourced human response data
difficult to classify and to specific constructs that     from the Open Psychometrics website for the Big 5
share overlapping meanings, even for humans, such as      personality test. We then replicated the previously
Agreeableness and Extraversion.                           outlined embedding analysis on this response data,
                                                          starting with the calculation of Pearson correlations
                                                          for each pair of items, followed by PCA to determine
                                                          item loadings based on the theoretical number of
                                                          latent factors. For items phrased in reverse, we
adjusted their scales to align with the correct             Moreover, the application of Principal Component
direction, a necessary step because the embedding           Analysis (PCA) on the cosine correlation matrix of the
model does not differentiate between reversed and           embeddings reveals a latent space where items
non-reversed items. This adjustment ensures that the        belonging to the same construct are grouped under a
comparison of loadings between the models remains           single principal component. This alignment with the
valid.                                                      theoretical structure provides further validation of
The results, depicted in Figure 2, show the Spearman        the embedding analysis. Considering the comparison
correlation coefficient for the Big 5 test examined. We     with human response data, the correlation between
observed a high correlation between the embedding           the item loadings derived from the embedded item
loadings and the human response loadings within the         representations and those obtained from analyzing
same constructs (R > 0.4, p-value << 0.001).                human responses indicates a significant relationship.
Additionally, a significant correlation (R > 0.4, p-value   The high correlation observed, particularly within the
<< 0.001) was noted between the constructs of               same constructs, suggests that the semantic
Agreeableness and Extraversion. These findings              similarities captured by the embeddings effectively
suggest that the semantic similarities among items          mirror the relationships observed in human subjects.
effectively reflect the relationships among factors as      Notably, a significant correlation is observed between
found in human subjects.                                    the constructs of Agreeableness and Extraversion,
                                                            indicating a meaningful association between these
                                                            factors in both the embedding analysis and human
                                                            responses. Overall, these findings support the notion
                                                            that embeddings derived from large language models
                                                            can successfully identify semantic similarities among
                                                            items and, so, serve as a valid preliminary measure in
                                                            the context of survey design.

                                                            References
                                                            [1]   Kline, T. J. (2005). Psychological testing: A
                                                                  practical approach to design and evaluation.
                                                                  Sage publications.
                                                            [2]       Mallinckrodt, B., Miles, J. R., & Recabarren, D.
                                                                  A. (2016). Using focus groups and Rasch item
                                                                  response theory to improve instrument
                                                                  development. The Counseling Psychologist,
Figure 2. Spearman correlation between Item                       44(2), 146-194.
embedding loadings (x-axis) and human response              [3]       Cole, D. A. (1987). Utility of confirmatory
loadings (y-axis). Sperman r greater than 0.40 shows              factor analysis in test validation research.
significative correlation between the loadings (p-                Journal of consulting and clinical psychology,
value << 0.001)                                                   55(4), 584.
                                                            [4]       Sternberg, R. J. (1980). Factor theories of
                                                                  intelligence are all right almost. Educational
                                                                  Researcher, 9(8), 6-18.
4. Conclusions                                              [5]       Carroll, J. B. (2013). A three-stratum theory
                                                                  of intelligence: Spearman's contribution. In
                                                                  Human abilities (pp. 1-17). Psychology Press.
From our analysis we found that the t-SNE projection        [6]       Zuckerman, M., Kuhlman, D. M., Thornquist,
of the embeddings maps items related to similar                   M., & Kiers, H. (1993). Five (or three) robust
constructs close together in the embedding space.                 questionnaire scale factors of personality
Despite some overlap due to ambiguous items or                    without culture. Personality and individual
constructs with overlapping meanings, such as                     Differences, 14(4), 569-578.
Agreeableness and Extraversion, the overall pattern         [7]       Eysenck, H. J. (1953). The structure of
suggests that embeddings derived from large                       human personality (Psychology Revivals).
language models capture semantic similarities among               Routledge.
items.
[8]      Fiske, D. W. (1949). Consistency of the           [21]     https://www.sbert.net/docs/pretrained_m
     factorial structures of personality ratings from           odels.html
     different sources. The Journal of Abnormal and        [22]     McCrae, R. R., & Costa, P. T.: Updating
     Social Psychology, 44(3), 329.                             Norman's" adequacy taxonomy": Intelligence
[9]      Costa Jr, P. T., & McCrae, R. R. (1992). Four          and personality dimensions in natural language
     ways five factors are basic. Personality and               and in questionnaires. Journal of personality and
     individual differences, 13(6), 653-665.                    social psychology, 49(3), 710. (1985).
[10]     Jebb, A. T., Ng, V., & Tay, L. (2021). A review   [23]     Van der Maaten, L., & Hinton, G. (2008).
     of key Likert scale development advances:                  Visualizing data using t-SNE. Journal of machine
     1995–2019. Frontiers in psychology, 12,                    learning research, 9(11).
     637547.
[11]     Caprara, G. V., Barbaranelli, C., Borgogni, L.,
     & Perugini, M. (1993). The “Big Five
     Questionnaire”: A new questionnaire to assess
     the five factor model. Personality and individual
     Differences, 15(3), 281-288.
[12]     Caprara, G. V., Barbaranelli, C., Borgogni, L.,
     & Vecchione, M. (2008). BFQ-2. Big five
     questionnaire, 2.
[13]     Vaswani, A., Shazeer, N., Parmar, N.,
     Uszkoreit, J., Jones, L., Gomez, A. N., ... &
     Polosukhin, I. (2017). Attention is all you need.
     Advances in neural information processing
     systems, 30.
[14]     Brown, T., Mann, B., Ryder, N., Subbiah, M.,
     Kaplan, J. D., Dhariwal, P., ... & Amodei, D :
     Language models are few-shot learners.
     Advances in neural information processing
     systems, 33, 1877-1901. (2020).
[15]     Binz, M., & Schulz, E.: Turning large language
     models into cognitive models. arXiv preprint
     arXiv:2306.03917. (2023)
[16]     Buschoff, L. M. S., Akata, E., Bethge, M., &
     Schulz, E. : Have we built machines that think
     like people?. arXiv preprint arXiv:2311.16093.
     (2023).
[17]     Chuang, Y. S., Goyal, A., Harlalka, N., Suresh,
     S., Hawkins, R., Yang, S., ... & Rogers, T. T.:
     Simulating Opinion Dynamics with Networks of
     LLM-based         Agents.        arXiv     preprint
     arXiv:2311.09618. (2023).
[18]     Yan, F., Fan, Q., & Lu, M.: Improving semantic
     similarity retrieval with word embeddings.
     Concurrency and Computation: Practice and
     Experience, 30(23), e4489. (2018).
[19]     Colla, D., Mensa, E., & Radicioni, D. P.: Novel
     metrics for computing semantic similarity with
     sense embeddings. Knowledge-Based Systems,
     206, 106346. (2020).
[20]     Devlin, J., Chang, M. W., Lee, K., & Toutanova,
     K.: Bert: Pre-training of deep bidirectional
     transformers for language understanding. arXiv
     preprint arXiv:1810.04805. (2018).