<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LLM embeddings on test items predict post hoc loadings in personality tests.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Luongo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Marocco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Milano</string-name>
          <email>nicola.milano@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michela</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ital-IA 2024: 4th National Conference on Artificial Intelligence</institution>
          ,
          <addr-line>organized by CINI</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Naples Federico II, Department of humanistic studies, Natural and artificial cognition laboratory “Orazio Miglino”</institution>
          ,
          <addr-line>via Porta di Massa 1, Naples, 80125</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this article we examine the application of Large Language Models (LLMs) in predicting factor loadings in personality tests through the semantic analysis of test items. By leveraging text embeddings generated from LLMs, we assess the semantic similarity of test items and their alignment with hypothesized factorial structures, without relying on human response data. Our methodology uses embeddings from the Big Five personality test to explore correlations between item semantics and their grouping in factorial analyses. Preliminary results suggest that LLM-derived embeddings can effectively capture semantic similarities among test items, potentially serving as a valid measure for initial survey design and refinement. This approach offers insights into the robustness of embedding techniques in psychological evaluations, indicating a significant correlation with traditional test structures and providing a novel perspective on test item analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Embeddings</kwd>
        <kwd>language models</kwd>
        <kwd>personality traits</kwd>
        <kwd>semantic similarity 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In psychological testing and, more in general, in the
different contexts that include evaluation, it is very
important to assess item quality. This process, known
as item analysis, foresees the evaluation of different
item characteristics including item difficulty, item
discrimination, item and test reliability.</p>
      <p>
        The two main approaches to run item analysis
classical test theory (CTT) and item response theory
(IRT) – (see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) share the starting point of a person
per item matrix: a matrix where examinees are rows
and items are columns. This raises some problems, for
example the matrix can be sparse if a lot of missing
data are present.
      </p>
      <p>
        Whereas the item formulation relies on procedures
that come before the test (and items) administration,
for example focus group with experts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], item
analysis is typically based on post hoc analysis that
used data coming from the administration to a sample
of respondents. In this phase, together with item
characteristics, evaluation is run on the test,
especially in the framework of TCT, including the
study reliability and of test structure in terms of latent
variables by the means of factor analysis [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This is a
consolidated approach.
      </p>
      <p>
        Factor analysis is used to reduce a large number of
variables into fewer numbers of factors, which have a
psychological meaning. This technique extracts
maximum common variance from all variables and
expresses them into a common score, that is used for
further analysis. This analysis allows also to calculate
factor loadings that are basically the correlation
coefficient between the single variable and the factor.
Factor analysis has been widely used in psychological
research both in cognitive domain (consider the
factorial theories of intelligence, [
        <xref ref-type="bibr" rid="ref4 ref5">4-5</xref>
        ]) and in
personality domain. In this latter case, between the
© 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
theoretical proposals up the 1990s ([
        <xref ref-type="bibr" rid="ref6 ref7">6-7</xref>
        ]), the Big
Five Theory of personality is a notable example of how
personality can be conceived and described as a
constellation of different dimensions, factors in the
terms we have used before. Evidence of this theory
has grown over the years and five broad personality
traits described by the theory are identified in
extraversion, agreeableness, openness,
conscientiousness, and neuroticism. This approach
was developed in 1949 by [8] and later expanded by
other researchers up to the work by [9].
      </p>
      <p>For the development of this approach, a key role has
been played by traits measurement. Traditionally, a
Big Five personality test is taken with a questionnaire
and response on a Likert scale [10]. The Big Five
Questionnaire [11], and its newer version BFQ-2 [12],
is used in different contexts and represents a golden
standard for measuring personality, according to the
Big Five theory. It is formed of 132 items. Some
questions ask how much a person agrees or disagrees
that he or she is someone who exemplifies various
specific statements, such as: “Is open to trying new
experiences” (for openness, or open-mindedness) or
“Is anxious about the future all the time” (for
neuroticism, or negative emotionality). The
responses, “Strongly agree” to “Strongly Disagree”
(with alternatives in between) determines to what
degrees the respondent show that specific traits.
In this contribution, we propose a method to
understand the strength of the connection between
items and factors based only on the item considered
as a linguistic material before the test administration.
In order to do this, we used LLM, large language
models [13], artificial intelligence models that are able
to process and generate natural language. In these
models, embeddings take a piece of text - a word,
sentence, paragraph or even a whole article, and
convert that into an array of floating-point numbers,
the embedding vector. This way, the artificial neural
network can code the linguistic information.
Embeddings are a numerical representation of the
semantic meaning of the content in a
many-multidimensional space.</p>
      <p>We used this representation to analyze item and
check if the proximity of items in this
multidimensional space corresponds to post-hoc loadings
in personality test factor analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods: Utilizing</title>
    </sec>
    <sec id="sec-3">
      <title>Item Embeddings for Semantic</title>
    </sec>
    <sec id="sec-4">
      <title>Analysis</title>
      <p>Advanced large language models (LLMs), have set
new benchmarks in processing and generating text
that is understandable by humans, seeing widespread
application across countless tasks by millions
globally. Within the realm of psychology, the ability of
LLMs to interpret, contextualize, and extrapolate from
human language without prior exposure has sparked
considerable interest due to their potential in
exploring unseen texts through a zero-shot approach
[14-16]. This section outlines the methodology for
leveraging LLMs to assess semantic similarities
among test items, determining if these similarities
align with a hypothesized factorial structure
subsequently identified in participant responses.
LLMs internally convert input text into vector
representations, known as embeddings, through the
training phase. Each word or sentence is transformed
into a fixed-size vector of floating-point numbers.
Research indicates that the space of these embeddings
possesses metric characteristics [18], allowing for the
mapping of similar concepts, such as colors or
synonyms, closer together than disparate ones
[1819]. By utilizing this feature, embeddings serve as a
tool for gauging semantic similarity across various
domains.</p>
      <p>We propose employing embeddings from established
psychological tests to examine item similarities and
verify whether the test exhibits the anticipated
factorial structure by focusing solely on the items,
excluding participant responses. To this end, we
employ the Bidirectional Encoder Representations
from Transformers (BERT) model [20], a pre-trained,
open-source language model developed by Google.
BERT, which has been trained on over five billion
sentences from Wikipedia and the Google Book
Corpus, aims to predict missing words in sentences.
Since its introduction in 2018, numerous
enhancements to the BERT model have been
suggested. Our methodology utilizes roBERTa, a
variant that has achieved top results on standard LLM
benchmarks [21]. As an initial step, we apply the Big5
personality test [22], mapping each of its 50 items into
the embedding space with BERT to achieve a
1024dimensional vector representation for each item.
To determine the proximity of embeddings for
different items within this space, we use cosine
similarity, which calculates the cosine of the angle
between two vectors. This measure, dependent solely
on the angle and not the magnitude of vectors, is
obtained by dividing the dot product of two vectors by
the product of their magnitudes.
 ⋅ 
cosinesimilarity(A, B) = ‖ ‖‖ ‖
(1)
The resulting cosine similarity equation, where A and
B represent the embeddings of two different items,
produces a similarity matrix. This matrix has various
applications, such as clustering to observe if the
embeddings align with the hypothesized factorial
structure or directly applying principal component
analysis. The cosine similarity matrix, particularly
when vectors are centered to have a zero mean,
equates to the Pearson correlation coefficient. Thus,
by analyzing items through the lens of LLMs, we can
deduce whether our test's structure conforms to the
expected factorial arrangement. Moreover, this
approach allows for the modification or elimination of
items that either poorly correlate with others or
duplicate the same construct, streamlining the test
process.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Results : Alignment with Human</title>
    </sec>
    <sec id="sec-6">
      <title>Responses</title>
      <p>Our analysis aims to determine whether embeddings
derived from large language models can successfully
identify semantic similarities among items and
predict human responses to them. To this end, we first
examine whether there is similarity between
construct-related items in the embedding space.
Figure 1 shows a t-SNE [23] projection of the
1024dimensional embeddings of the Big Five items into a
two-dimensional space. The colors and letters
indicate the factors underlying the items; for example,
yellow and 'O' represent openness, with the numbers
specifying the particular item. Different categories of
items are mapped closer together and occupy
substantially different zones of the space. There is
some overlap due to single items that are more
difficult to classify and to specific constructs that
share overlapping meanings, even for humans, such as
Agreeableness and Extraversion.</p>
      <p>We then applied Principal Component Analysis (PCA)
to the cosine correlation matrix of these embeddings,
which revealed a latent space that accurately groups
items belonging to the same construct under a single
principal component. This process not only facilitates
the interpretation of outcomes but also confirms the
theoretically anticipated structure. The focus now
shifts to comparing the latent structure unearthed
from the semantic similarity analysis with that
derived from actual human responses. Specifically, we
aim to investigate whether the item loadings
generated from the embedded item representations
mirror those obtained from analyzing human
responses. By examining the correlation between the
two sets of loadings, we gain insight into the extent to
which item semantics predict human response
patterns. Ideally, a high correlation between construct
loadings would indicate not only that related items
are grouped accurately but also that both
crossloadings and other factor loadings exhibit similar
trends.</p>
      <p>For this purpose, we sourced human response data
from the Open Psychometrics website for the Big 5
personality test. We then replicated the previously
outlined embedding analysis on this response data,
starting with the calculation of Pearson correlations
for each pair of items, followed by PCA to determine
item loadings based on the theoretical number of
latent factors. For items phrased in reverse, we
adjusted their scales to align with the correct
direction, a necessary step because the embedding
model does not differentiate between reversed and
non-reversed items. This adjustment ensures that the
comparison of loadings between the models remains
valid.</p>
      <p>The results, depicted in Figure 2, show the Spearman
correlation coefficient for the Big 5 test examined. We
observed a high correlation between the embedding
loadings and the human response loadings within the
same constructs (R &gt; 0.4, p-value &lt;&lt; 0.001).
Additionally, a significant correlation (R &gt; 0.4, p-value
&lt;&lt; 0.001) was noted between the constructs of
Agreeableness and Extraversion. These findings
suggest that the semantic similarities among items
effectively reflect the relationships among factors as
found in human subjects.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Conclusions</title>
      <p>From our analysis we found that the t-SNE projection
of the embeddings maps items related to similar
constructs close together in the embedding space.
Despite some overlap due to ambiguous items or
constructs with overlapping meanings, such as
Agreeableness and Extraversion, the overall pattern
suggests that embeddings derived from large
language models capture semantic similarities among
items.</p>
      <p>Moreover, the application of Principal Component
Analysis (PCA) on the cosine correlation matrix of the
embeddings reveals a latent space where items
belonging to the same construct are grouped under a
single principal component. This alignment with the
theoretical structure provides further validation of
the embedding analysis. Considering the comparison
with human response data, the correlation between
the item loadings derived from the embedded item
representations and those obtained from analyzing
human responses indicates a significant relationship.
The high correlation observed, particularly within the
same constructs, suggests that the semantic
similarities captured by the embeddings effectively
mirror the relationships observed in human subjects.
Notably, a significant correlation is observed between
the constructs of Agreeableness and Extraversion,
indicating a meaningful association between these
factors in both the embedding analysis and human
responses. Overall, these findings support the notion
that embeddings derived from large language models
can successfully identify semantic similarities among
items and, so, serve as a valid preliminary measure in
the context of survey design.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Kline</surname>
            ,
            <given-names>T. J.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Psychological testing: A practical approach to design and evaluation</article-title>
          .
          <source>Sage publications.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mallinckrodt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miles</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Recabarren</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Using focus groups and Rasch item response theory to improve instrument development</article-title>
          .
          <source>The Counseling Psychologist</source>
          ,
          <volume>44</volume>
          (
          <issue>2</issue>
          ),
          <fpage>146</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Cole</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          (
          <year>1987</year>
          ).
          <article-title>Utility of confirmatory factor analysis in test validation research</article-title>
          .
          <source>Journal of consulting and clinical psychology</source>
          ,
          <volume>55</volume>
          (
          <issue>4</issue>
          ),
          <fpage>584</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Sternberg</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          (
          <year>1980</year>
          ).
          <article-title>Factor theories of intelligence are all right almost</article-title>
          .
          <source>Educational Researcher</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <fpage>6</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Carroll</surname>
            ,
            <given-names>J. B.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>A three-stratum theory of intelligence: Spearman's contribution</article-title>
          . In Human abilities (pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          ). Psychology Press.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Zuckerman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhlman</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thornquist</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kiers</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>Five (or three) robust questionnaire scale factors of personality without culture</article-title>
          .
          <source>Personality and individual Differences</source>
          ,
          <volume>14</volume>
          (
          <issue>4</issue>
          ),
          <fpage>569</fpage>
          -
          <lpage>578</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Eysenck</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          (
          <year>1953</year>
          ).
          <article-title>The structure of human personality (Psychology Revivals)</article-title>
          . Routledge. https://www.sbert.net/docs/pretrained_m odels.
          <source>html McCrae</source>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            , &amp;
            <surname>Costa</surname>
          </string-name>
          , P. T.:
          <article-title>Updating Norman's" adequacy taxonomy": Intelligence and personality dimensions in natural language and in questionnaires</article-title>
          .
          <source>Journal of personality and social psychology</source>
          ,
          <volume>49</volume>
          (
          <issue>3</issue>
          ),
          <fpage>710</fpage>
          . (
          <year>1985</year>
          ). Van der Maaten,
          <string-name>
            <given-names>L.</given-names>
            , &amp;
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Visualizing data using t-SNE</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>9</volume>
          (
          <issue>11</issue>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>