<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. Frazzetto);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Variational Inference for the Partial Credit Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Frazzetto</string-name>
          <email>paolo.frazzetto@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolò Navarin</string-name>
          <email>nicolo.navarin@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Sperduti</string-name>
          <email>alessandro.sperduti@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Psychometrics, Partial Credit Model, Variational Autoencoder, Ordinal Inference</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Augmented Intelligence Center, Bruno Kessler Foundation</institution>
          ,
          <addr-line>38123 Povo (TN)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics “Tullio Levi-Civita”, University of Padova</institution>
          ,
          <addr-line>35121 Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Item Response Theory (IRT) models, particularly the Partial Credit Model (PCM), are indispensable in psychometrics for estimating unobserved latent abilities from questionnaires with ordered categorical responses. However, traditional estimation methods struggle with the scalability required for modern large-scale datasets. While Variational Autoencoders (VAEs) ofer a promising path for scalable inference, their application to polytomous IRT models and the principled integration of respondent covariates remain underexplored. In this paper, we introduce a novel Variational Autoencoder framework for the Partial Credit Model (VA-PCM) that addresses these challenges. Our primary theoretical contribution is a psychometrically-informed generative model where the crucial ordering of item dificulty thresholds is guaranteed by construction. This is achieved by defining a Dirichlet prior over the proportions of the latent ability scale, which are then deterministically transformed into ordered thresholds via a stick-breaking process. Furthermore, we integrate respondent covariates as auxiliary variables whose distributions are explicitly conditioned on the latent abilities within a coherent probabilistic graphical model, allowing the model to leverage this information to enhance estimation. We present a complete proof-of-concept implementation using the Pyro probabilistic programming language, detailing the amortized inference architecture and deriving the Evidence Lower Bound (ELBO) for optimization. Preliminary experiments on synthetic data validate the framework's core mechanisms, demonstrating strong parameter recovery for the ordered item thresholds. The results also underscore the inherent dificulty of individual ability estimation from sparse response data, motivating clear directions for future work. By providing a complete theoretical and implementation blueprint, this work lays the groundwork for scalable, nuanced, and data-rich psychometric analysis of ordered response data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The quantitative measurement of unobserved human characteristics, such as cognitive abilities,
personality traits, or attitudes, is a cornerstone of modern psychology, education, and the social sciences.
Within this domain, Item Response Theory (IRT) has emerged as the dominant and most sophisticated
paradigm for modeling the relationship between an individual’s latent traits and their responses to a set
of items on a questionnaire or test [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. One of the most important models within the IRT family is the
Partial Credit Model (PCM), which provides a framework for analyzing items with ordered, polytomous
response categories, such as Likert scales or multi-step problems where partial credit is awarded [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        However, the advent of large-scale digital assessments, online learning platforms, and massive open
online courses (MOOCs) has generated datasets of unprecedented size and complexity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This explosion
of data presents a significant computational challenge to traditional IRT estimation methods. Classical
approaches like Marginal Maximum Likelihood (MML) estimation via the Expectation-Maximization
(EM) algorithm become computationally intensive and may struggle with high-dimensional latent
spaces, while Bayesian methods using Markov Chain Monte Carlo (MCMC) sampling, though robust,
are often prohibitively slow for large datasets, hindering rapid model development and iteration [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ].
      </p>
      <p>
        In response to these scalability challenges, recent research has turned to Variational Inference (VI),
a technique from modern machine learning that reframes Bayesian inference as a fast and eficient
optimization problem [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In particular, Variational Autoencoders (VAEs) have been successfully applied
      </p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
to dichotomous IRT models, demonstrating the ability to perform fast, scalable, and amortized inference,
allowing for the instantaneous estimation of abilities for new individuals without retraining the model
[
        <xref ref-type="bibr" rid="ref1 ref4 ref7">1, 7, 4</xref>
        ]. These works have established VAEs as a viable and powerful tool for large-scale psychometric
analysis.
      </p>
      <p>Despite these advances, the application of VAEs to the more complex Partial Credit Model has been
less explored. Furthermore, most existing VAE-IRT frameworks do not provide a principled way to
incorporate auxiliary respondent covariates (e.g., demographic or educational background), which
can ofer valuable information for improving the precision of ability estimates. This paper aims to
ifll these gaps by proposing a novel Variational Autoencoder framework specifically designed for the
Partial Credit Model (VA-PCM). Our contributions are threefold: (1) we develop a generative model that
faithfully represents the PCM for ordered categorical responses; (2) we integrate respondent covariates
as an auxiliary source of information to enhance latent ability estimation within a coherent probabilistic
graphical model; and (3) we formalize the corresponding amortized variational inference scheme and
derive the Evidence Lower Bound (ELBO) for optimization.</p>
      <p>This work lays the theoretical and methodological groundwork for applying deep generative modeling
to polytomous response data at scale. Our preliminary experiments on synthetic data demonstrate the
viability of the framework, showing its potential to recover item and person parameters, and setting the
stage for future large-scale empirical validation. Lastly, the VA-PCM represents a step towards building
more scalable, nuanced, and data-rich psychometric models for the modern era.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Item Response Theory</title>
        <p>
          Item Response Theory (IRT) encompasses a family of mathematical models that describe the probabilistic
relationship between an individual’s unobserved latent trait(s) and their observed responses to items
on a test or questionnaire [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Unlike classical test theory, which focuses on aggregate test scores,
IRT models the interaction at the item level, providing a more granular and theoretically grounded
understanding of measurement.
        </p>
        <p>The core idea of IRT is to characterize both respondents and items with a set of parameters on
a common latent scale. A respondent is typically characterized by an ability parameter, denoted  ,
which represents their level on the latent trait being measured. An item is characterized by one or
more parameters, most commonly including a dificulty parameter (  ), which indicates the ability level
required for a 50% chance of a correct response, and a discrimination parameter ( ), which reflects how
well the item diferentiates between individuals with diferent ability levels.</p>
        <p>In its most common forms, such as the one-, two-, and three-parameter logistic models (1PL, 2PL,
3PL), IRT defines the probability of a correct response to a dichotomous (correct/incorrect) item using a
logistic function. For example, the 2PL model is given by:
( correct| , , ) =</p>
        <p>1
1 +  −( −)
This function produces a characteristic ”S”-shaped curve where the probability of a correct response
increases with the individual’s ability  . The primary task in IRT is inference: the estimation of these
latent ability and item parameters from a matrix of observed responses.</p>
        <p>While the foundational IRT models were developed for dichotomous data, many assessment contexts
involve responses that are not simply right or wrong but reflect varying degrees of proficiency or
endorsement. To handle such data, IRT was extended to polytomous models, which are designed for
items with multiple, ordered response categories. It is within this class of models that the Partial Credit
Model resides, providing a powerful tool for extracting nuanced information from ordered response
data.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The Partial Credit Model</title>
        <p>
          The Partial Credit Model (PCM), first introduced by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and formalized by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], is a fundamental
psychometric model within IRT designed for analyzing responses to items with ordered categorical scores.
Unlike dichotomous models (e.g., the Rasch model) that classify responses as simply correct or
incorrect, the PCM accounts for situations where responses can reflect varying degrees of correctness or
proficiency, providing more nuanced insights into an individual’s performance or attitude. This makes
it particularly suitable for questionnaire items where responses are graded (e.g., essay scores, Likert
scales, multi-step problem-solving tasks) or when multiple-choice items allow for partial credit (i.e.,
fractional scoring). See Figure 1 for an example of such scale.
        </p>
        <p>The model is defined by a set of item parameters, which are the item’s inherent characteristics. These
characteristics are typically represented by a set of “step dificulty” or “threshold” parameters for each
item. Generally speaking, the PCM is an ordered categorical model, where the logistic function of the
diference between the respondent’s latent ability and the item’s dificulty gives the probability of a
response.</p>
        <p>
          Let  denote the number of respondents and  be the number of items in a questionnaire or test. For
each respondent  and item  :
•   represents the unobserved latent ability (trait) of respondent  . This ability is assumed to lie
on a continuous scale.
•   is the observed response  of respondent  to item  on—without loss of generality—a scale of
integer values {0, … ,   } ∋  where   is the maximum possible score for item  . These categories
are assumed to be ordered, reflecting increasing levels of proficiency. In the PCM, it is presumed
that individuals with greater abilities tend to achieve higher scores on a given item. Nonetheless,
the PCM does not make any assumption that there is an underlying sequential step process to
achieve a score. Thus, it does not necessitate that a respondent must successfully complete all
tasks associated with lower score categories to attain success in higher-scoring tasks [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
• Ψ denotes the set of item parameters for item  . Specifically, Ψ is the  -th step dificulty (or
threshold) parameter for item  , associated with moving from score category  −1 to  ≤   . Ideally,
these Ψ, values should be monotonically increasing for a given item (i.e., Ψ1 &lt; Ψ2 &lt; … &lt; Ψ  ),
indicating that it becomes progressively more dificult to achieve higher scores. This assumption
is only required for a psychometric interpretation, as unordered thresholds Ψ do not per se
violate its mathematical formulation [9]. For identifiability of a PCM instance, the propensity
for the base category (score 0) is normalized to 1, which is equivalent to setting the first step
dificulty Ψ0 = 0 for all items. For simplicity, we assume that all items have the same amount of
choices, so   =  ∀  .
        </p>
        <p>
          The PCM is formally derived by applying the dichotomous Rasch model [10] to adjacent pairs of
score categories for a polytomous item [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. For an item with  categories, the probability of scoring 
versus  − 1 , given that the score is either  − 1 or  , is modeled as a dichotomous Rasch-like function:
(  = |  =  − 1 or ,   , Ψ ) =
        </p>
        <p>exp(  − Ψ )
1 + exp(  − Ψ )</p>
        <p>This formulation highlights that each Ψ can be interpreted as the ability level at which a respondent
has a 50% chance of scoring  rather than  −1 , when considering only these two adjacent categories. The
0.8
0.6
0.4
0.2

M
C
P
0
−4</p>
        <sec id="sec-2-2-1">
          <title>Agree</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Strongly Agree Neutral</title>
          <p>uses step parameters Ψ = {0, −2, −1, 1, 2}, where Ψ0 = 0 by definition and the remaining values are increasing.
As  increases, the likelihood of selecting higher response categories increases consequently. Intersection points
between adjacent curves indicate category thresholds, making this plot a useful tool in IRT for evaluating item
functioning and category discrimination.
overall PCM probabilities for each score category are then derived from these conditional probabilities;
hence, the probability of respondent  achieving a score  on item  according to the PCM is given by:
 PCM( 
=  |  , Ψ ) =</p>
          <p>exp(∑=0 (  − Ψ ))</p>
          <p>ℎ
∑ℎ=0 exp(∑=0 (  − Ψ ))
(1)
where  ∈ {0, … , } . The numerator represents the cumulative “propensity” for achieving score 
or higher, while the denominator serves as a normalization constant, summing over all possible score
categories for item  . Figure 2 depicts this function for a 5-point scale. From a computational perspective,
this formulation essentially applies a softmax-like function to the cumulative log-odds of attaining each
score category, with Ψ</p>
          <p>serving as the ordered thresholds (or cutpoints) along the latent ability scale.</p>
          <p>This aligns with approaches used in probabilistic programming for ordinal regression, where observed
categories are drawn from a categorical distribution whose logits are derived from these structured
cumulative probabilities [11].</p>
          <p>
            Boarder details about the PCM, its variations, interpretations, and parameters’meaning are outside
the scope of this work and are deferred to the literature [
            <xref ref-type="bibr" rid="ref3">3, 12, 9</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Variational Inference for the PCM</title>
      <p>In real-world scenarios, given a  ×</p>
      <p>matrix of observed responses, we want to infer the ability of all 
people and the characteristics of all  items. Next, we provide a brief overview of inference in IRT.</p>
      <p>Traditional estimation methods for IRT models, such as Maximum Marginal Likelihood Estimation via
the Expectation-Maximization (EM) algorithm [13] or Markov Chain Monte Carlo (MCMC) methods [14],
have been foundational in psychometrics. However, these approaches face significant computational
challenges when applied to large-scale datasets, complex model structures (e.g., high-dimensional
latent spaces), or polytomous response formats like the PCM. The numerical integration required for
marginalization in EM can become intractable, and MCMC, while providing robust posterior estimates,
can be prohibitively slow, especially for estimating the large number of parameters involved in
largescale assessments [15].</p>
      <p>
        Variational Inference (VI) has emerged from the machine learning and statistical communities as a
powerful alternative for approximate Bayesian inference [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. VI reframes inference as an optimization
problem, seeking a tractable distribution that best approximates the true, often intractable, posterior
distribution of latent variables. A key advantage of VI, particularly when coupled with amortized
inference (as seen in VAEs [16]), is its remarkable speed and scalability to massive datasets. This
eficiency is achieved by training neural networks to directly map observed data to the parameters of
the approximate posterior, making subsequent inference queries highly eficient. While VAE-based
approaches have shown promise for dichotomous IRT models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], their extension to the PCM, which
inherently models ordered categorical responses with multiple item thresholds, presents novel challenges
and opportunities.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Integrating Respondent Covariates into the Generative Model</title>
        <p>Our approach extends the PCM within a VAE framework to leverage auxiliary respondent information.
In psychometric applications, additional data about respondents, such as age, education level,
socioeconomic status, or gender, are frequently available. These respondent covariates  can provide valuable
insights into the latent abilities  and enhance the precision of estimation. Rather than treating these
covariates as mere descriptive statistics, our generative model integrates them directly, assuming that
these observed features are, in part, determined by the underlying latent abilities. The idea is that these
covariates are spurious associations that link one’s features with their answers. Mathematically, the
probability distribution of  is:</p>
        <p>( ,  ) = ( | )( )</p>
        <p>We can then define the relationships among these variables by means of a probabilistic graphical
model [17], as in Figure 3. This graphical structure implies a joint probability distribution over all
observed and latent variables, which can be factorized according to the local Markov properties:
(,  ,  , Ψ) = (| , Ψ)( | )( )(Ψ)
(2)
In this factorization:
• (| , Ψ) represents the PCM itself, describing the likelihood of observing a particular response
  given the respondent’s latent ability   and the item parameters Ψ , as previously defined in
Eq. 1.
• ( | ) models the relationship between the observed respondent features   and the latent
abilities   . This component allows the model to leverage rich covariate information to inform
the estimation of   . For instance, ( | ) could be a simple linear model for continuous  or a
logistic regression for categorical  .
• ( ) is the prior distribution over the latent abilities. A common choice is a standard Gaussian
distribution,  (0, 1) , reflecting an initial assumption of abilities centered around zero with unit
variance.
• (Ψ) is the prior distribution over the item parameters Ψ = (Ψ0 , … , Ψ,  ). Standard practice
often uses independent Gaussian priors for each Ψ . A key psychometric characteristic of
the PCM is the expectation that these thresholds should be ordered. Standard independent
Gaussian priors do not enforce this ordering, which can lead to ill-posed or uninterpretable
thresholds during estimation. To address this, we employ a more principled prior based on a
stick-breaking construction using the Dirichlet distribution [18]. Instead of defining a prior
on the thresholds directly, we define a prior on the proportions ΔΨ of the latent ability scale
partitioned by the thresholds. A Dirichlet distribution, Dir() , is a natural choice for this, as
its samples are vectors of positive numbers that sum to one. We can then deterministically
transform these proportions into a set of ordered thresholds Ψ on the logit scale, for instance by
Ψ





, which is obtained by marginalizing out the latent variables from the joint
(,  ) =</p>
        <p>∫ (,  ,  , Ψ)   Ψ
which is intractable and needs to be approximated.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Variational Inference with Mean-Field Approximation</title>
        <p>The Local Markov property states that each random variable in a random variable set  is conditionally
independent of its non-descendants given its parent variables. We write:
where de( ) is the set of descendants, pa( ) is the set of parents,  ∖
de( ) is the set of non-descendants
of  . In our case, this can be expressed in the following relations:</p>
        <p>In particular, from the latter two, it is clear that:
   ⊧ ∖
de() ∣  pa()</p>
        <p>for all  ∈ 

Ψ⊧; Ψ  ,⊧ ; 
,⊧Ψ ∣  ;</p>
        <p>⊧∣ Ψ, 
( |, Ψ,  ) = ( | )
(| , Ψ,  ) = (|Ψ,  )
(3)
(4)
(6)</p>
        <p>
          In Variational Inference, instead of directly computing the intractable true posterior ( , Ψ|,  )
we introduce a simpler, tractable variational distribution ( , Ψ|,  )
that approximates it. The goal
is to find the  distribution that is “closest” to the true posterior, typically measured by minimizing
the Kullback-Leibler (KL) divergence KL(( , Ψ|,  )||( , Ψ|,  ))
. Minimizing the KL divergence is
equivalent to maximizing the Evidence Lower Bound (ELBO) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] which is a lower bound on the log
marginal likelihood:
ℒELBO() = log((,  )) −
        </p>
        <p>KL(( , Ψ|,  )||( , Ψ|,  )).</p>
        <p>To ensure tractability and enable eficient optimization, we typically employ a mean-field
approximation for the variational distribution. This involves assuming that the approximate posterior factorizes
over the latent variables in a so-called mean-field approximation, which for our model we define as:
( , Ψ|,  ) =</p>
        <p>Ψ(Ψ|)  ( |,  ),
The analogous form without covariates is ( , Ψ|) = 
Ψ(Ψ|)  ( |)
where  Ψ and   are the approximate posteriors for the item parameters and latent abilities, respectively.
3.2.1. Amortized Inference and ELBO Optimization
which becomes:
networks. This is known as amortized inference [19]. For   ( |,  )
are normally distributed and then we can model it as a Gaussian distribution  (
For   and  Ψ, we choose flexible, tractable distributions whose parameters are determined by neural
, we can assume that the traits
 , Σ ), where the
mean and covariance are outputs of a neural network with parameters   that takes  and  as inputs:
(  , Σ ) = NN

(,  )</p>
        <p>. For the item parameters  Ψ(Ψ|) , we mirror the structure of our prior. Instead
of learning the parameters of a distribution over the thresholds directly, the inference network NN Ψ
learns to output the concentration parameters of a Dirichlet distribution for each item,   = NN Ψ
The approximate posterior for the item-level proportions is thus (ΔΨ  ) = Dir(  ). The final ordered
( ⋅ ).
thresholds Ψ are then obtained through the same deterministic transformation used in the generative
model. This symmetric design ensures that our inference process respects the crucial ordering property
of the PCM parameters, guiding the model towards psychometrically plausible solutions.</p>
        <p>The parameters of these neural networks,   and  Ψ, are learned by maximizing the ELBO in Eq. 5
ℒELBO() =  ( ,Ψ|, )
[log (,  ,  , Ψ) −
log ( , Ψ|,  )</p>
        <p>]
=    ( |, )
=    ( |, )
Ψ(Ψ|, )
Ψ(Ψ|, )
[log (|Ψ,  ) +
log ( | ) +
log ( ) +</p>
        <p>log (Ψ)
− log   ( |,  ) −
log  Ψ(Ψ|,  )</p>
        <p>]
[log (|Ψ,  )
] +   ( |,  )
[log ( | )
]
− KL(  ( |,  )||( )) −</p>
        <p>KL( Ψ(Ψ|,  )||(Ψ))
(7)
The ELBO consists of several terms with its own interpretation.</p>
        <p>The reconstruction term
   ( |, )
Ψ(Ψ|, )
[log (|Ψ,  )</p>
        <p>] measures how well the latent variables and item parameters, sampled
from their approximate posteriors, can reconstruct the observed responses and encourages accuracy
in modeling the PCM. The auxiliary data likelihood term   ( |,  )
[log ( | )
] ensures that the
latent abilities inferred from responses are consistent with the observed respondent features. This is
how the auxiliary information  informs the latent ability estimates. Finally, the KL divergence terms
−KL(  ( |,  )||( ))−</p>
        <p>KL( Ψ(Ψ|,  )||(Ψ))</p>
        <p>act as regularization terms, encouraging the approximate
posteriors to mimic their respective priors. Maximizing this ELBO with respect to the neural network
parameters   and  Ψ is achieved through stochastic gradient descent.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Implementation</title>
      <p>To empirically validate our proposed Variational Autoencoder for the Partial Credit Model (VA-PCM),
we conducted a series of experiments on synthetic data. The primary objectives of these experiments are
to assess the model’s ability to accurately recover the ground-truth parameters of the generative process
and to detail the specific architectural and training methodologies required for a robust implementation.
This section outlines the implementation details of our model, the hyperparameters used, and the design
choices made to ensure stable and eficient training.</p>
      <sec id="sec-4-1">
        <title>4.1. Implementation Details and Model Architecture</title>
        <p>Our framework is implemented using PyTorch for neural network construction and gradient-based
optimization, alongside Pyro [11], a deep probabilistic programming language, for defining the probabilistic
model and performing variational inference.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.1. Role of Respondent Covariates ( )</title>
        <p>A key innovation of our framework is the principled integration of respondent covariates  . These
covariates play a dual role in our methodology: first, as a fundamental component of the generative story
that links latent abilities to observable characteristics, and second, as a critical source of information
for the inference process.</p>
        <p>In the Generative Process and Data Simulation Our generative model (Eq. 2) explicitly assumes
that respondent covariates  are influenced by their latent abilities  , captured by the conditional
likelihood term ( | ) . To create synthetic data that faithfully adheres to this assumption, we generate
covariates as a function of the ground-truth latent abilities. Specifically, we model this relationship as a
linear transformation of the latent traits with additive Gaussian noise:
  =   +   ,</p>
        <p>where   ∼  (0,   2)
Here,  is a weight matrix that defines the strength and nature of the relationship. This process ensures
that the generated covariates   contain a quantifiable signal about the latent traits   .</p>
        <p>Within the VA-PCM’s generative model, this relationship is parameterized by a dedicated decoder
neural network NN . This network takes a sampled latent ability   as input and outputs the parameters
(mean and variance) of a Gaussian distribution for   . During training, the observed covariates are
scored against this predicted distribution, contributing a likelihood term to the ELBO. This forces the
model to learn a latent representation  that is not only capable of explaining the observed responses  ,
but also the observed covariates  , thereby enforcing a powerful consistency constraint.
In the Inference Process The primary purpose of integrating covariates is to enhance the precision of
latent ability estimation. This is achieved within the inference model, where the NN network leverages
the covariates as a direct input. The encoder’s input for a given respondent  is a concatenation of their
response vector   and their covariate vector   .</p>
        <p>This design allows the inference network to fuse information from two distinct modalities: (1)
behavioral response patterns captured in  , and (2) contextual or demographic features contained in
 . By having access to both sources of evidence, the encoder can produce a more robust and precise
estimate of the latent ability posterior,   ( |,  ) . For instance, if a respondent’s answers are ambiguous,
their covariates can provide an additional signal that helps to resolve the uncertainty in their estimated
ability, leading to more accurate inference.
4.1.2. The Generative Model (Decoder)
The generative model, or decoder, programmatically defines the joint distribution outlined in Eq. 2. The
priors for the latent variables are specified first. The latent abilities  for each respondent are drawn
from a standard normal prior, ( ) =  (0, 1) . For the item thresholds Ψ, we adopt the method described
above to enforce the ordering constraint Ψ1 &lt; Ψ2 &lt; ⋯ &lt; Ψ  , which is crucial for psychometric
interpretability. Specifically, we sample a vector of proportions for each item from a Dirichlet prior,
(ΔΨ  ) = Dir() , and then deterministically transform these proportions into a set of ordered thresholds
on the logit scale using the inverse CDF of a standard normal distribution. This generative process
guarantees that the sampled thresholds are correctly ordered by construction.</p>
        <p>The likelihood functions for the observed data are conditioned on these latent variables. The
relationship between covariates and abilities, ( | ) , is modeled by a dedicated decoder neural network
NN , which takes a sampled latent ability   as input and outputs the parameters (mean and variance) of
a Gaussian distribution from which the observed covariates   are assumed to be drawn. The response
likelihood, (| , Ψ) , is implemented according to the PCM formula (Eq. 1), where the logits for the
categorical distribution of responses are computed from the sampled abilities  and item thresholds Ψ.
4.1.3. The Inference Model (Encoder)
The inference model, or encoder, specifies the variational distribution ( , Ψ|,  ) that approximates
the true posterior. This is achieved through amortized inference, where neural networks learn to map
observed data directly to the parameters of the approximate posterior distributions.
• Ability Inference (  ): The NN network infers each respondent’s latent ability. Its input is the
concatenation of the respondent’s covariate vector   and an embedding of their full response
vector   . To handle the categorical nature of responses, each item’s response is passed through
a dedicated embedding layer before concatenation. The network then outputs the mean and
log-variance of the Gaussian approximate posterior for   .
• Item Parameter Inference ( Ψ): The NNΨ network infers the parameters for each item. This
network exemplifies one of the key complexities in modeling item-level parameters in an amortized
fashion. Since item parameters are global, the network must summarize the information from
all responses given to a specific item. For each item  , we compute a feature vector consisting of
summary statistics (mean and standard deviation of responses) and the empirical distribution of
response categories. This feature vector is then fed into the network to produce the concentration
parameters of a Dirichlet distribution, which serves as the approximate posterior for the item’s
threshold proportions, (ΔΨ  |) = Dir(  ). This symmetric design, where the variational posterior
for the proportions mirrors the Dirichlet prior, ensures that the inferred thresholds also adhere to
the ordering constraint.</p>
        <p>The training process involves optimizing the ELBO via Pyro’s Stochastic Variational Inference (SVI)
engine, as detailed in Algorithm 1. This framework handles the complexities of the model, such as
computing gradients through stochastic latent variables (via the reparameterization trick) and managing
the interplay between the probabilistic model and the deep neural networks.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.2. Experimental Setup and Hyperparameters</title>
        <p>Our experiments are conducted on synthetic data generated according to the process described in the
previous section. This allows us to control the ground truth and systematically evaluate our model’s
performance.
4.2.1. Training Procedure
The model is trained end-to-end by maximizing the ELBO using mini-batch stochastic gradient descent.
We employ the ClippedAdam optimizer [20], a variant of Adam with gradient clipping, which we found
to provide additional stability during training. To enhance convergence and prevent common failure
modes in VAEs, we incorporate several more best-practice techniques:
• KL Annealing: In the initial phases of training, the KL divergence term in the ELBO can
overwhelm the reconstruction loss, causing the approximate posterior to collapse onto the prior
(i.e., the “posterior collapse” problem). To mitigate this, we use KL annealing, where the KL
terms are multiplied by a coeficient  that is gradually increased from a small initial value (e.g.,
 = 0.01 ) to its final value of  = 1.0 over a set number of training epochs. This allows the model
to first focus on learning to reconstruct the data before being strongly regularized by the prior.
• Learning Rate Scheduling: We use a learning rate scheduler that gradually decays the learning
rate after each epoch. This helps the optimizer to take smaller steps as it approaches a minimum,
leading to finer convergence.
• Early Stopping and Dropout: To prevent overfitting and reduce unnecessary computation, we
employ an early stopping (stopping training after 50 epochs without improvements) and dropout
mechanism (  = 0.1).
4.2.2. Hyperparameter Configuration
The specific hyperparameters for our neural network architectures and training procedure were selected
based on common practices for VAEs and preliminary experimentation. The key settings used in our
experiments are summarized in Table 1.</p>
        <p>KLΨ ← KLΨ + KLΨ</p>
        <p>end for
ℒELBO ← ℒELBO − KLΨ
for  = 1 …</p>
        <p>do
Algorithm 1 VI-PCM Forward Pass
Require: Observed responses  ∈ ℝ  × , Observed respondent features  ∈ ℝ  × dim( )
Initialize overall ELBO: ℒELBO ← 0
Initialize neural networks: NN , NNΨ, and the covariate decoder NN .</p>
        <p>{— Part 1: Infer Item Parameters from Responses —}
Initialize item parameter KL divergence: KLΨ ← 0
for  = 1 …  do</p>
        <p>Extract all responses for item  :  ⋅ = ( 1 , … ,    )
Compute variational parameters for item  ’s threshold proportions:   = NNΨ( ⋅ )
Sample item proportions ΔΨ ∼ Dirichlet(  ) {Sample from  Ψ}
Deterministically transform proportions ΔΨ to ordered thresholds Ψ
Compute KL term for item  : KLΨ = KL(Dirichlet( 
)||(ΔΨ  ))
{— Part 2: Infer Latent Abilities and Compute Likelihoods for each Respondent —}
Extract data for respondent  : Response vector   and covariate vector</p>
        <p>{Use inference network NN to get parameters for   (  |  ,   )}
Compute variational parameters for latent ability  
: (  
, Σ  ) = NN (  ,   )
Sample latent ability   ∼  (</p>
        <p>, Σ  ) {Sample from   }
{Compute the three components of the ELBO for respondent  }
Response Reconstruction Term:</p>
        <p>Auxiliary Data Likelihood Term:
ℒRecon, = ∑=1 log  PCM(  |  , Ψ ) {Using the sampled   and Ψ }
Use the covariate decoder NN to parameterize the likelihood (
 |  )
 |</p>
        <p>, Σ  ) {Score observed   under generated distribution}
ℒAux, = log  (
(</p>
        <p>, Σ  ) = NN (  )
Latent Ability KL Divergence:
KL ,
= KL( (
  , Σ 
)||(</p>
        <p>))
{Update total ELBO with respondent’s contribution}
ℒELBO ← ℒELBO + ℒRecon, + ℒAux, − KL ,
end for
return Return and Backpropagate on ℒELBO</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.3. Preliminary Results on Synthetic Data</title>
        <p>To provide an initial proof-of-concept and assess the fundamental viability of our proposed VA-PCM
framework, we conducted a preliminary experiment using synthetic data. The data were generated
according to the process detailed in Section 4, with a configuration designed to represent a common,
moderately-sized scenario:  = 1000</p>
        <p>respondents,  = 10 items, a single latent dimension, a single
covariate dimension, and  = 5</p>
        <p>response categories per item.</p>
        <p>After training the VA-PCM model on this dataset, we evaluated its ability to recover the ground-truth
latent abilities  and item threshold parameters Ψ. The results of this parameter recovery analysis are
summarized in Table 2.</p>
        <p>The analysis of the item parameter recovery is highly encouraging. We observe a strong positive
correlation of 0.750 between the estimated and true item thresholds. This indicates that the model is
successfully capturing the relative ordering and dificulty of the item steps. Furthermore, the R² score
of 0.498 suggests that our model can account for approximately 50% of the variance in the true item
parameters, a respectable result given the complexity of the model and the size of the dataset.</p>
        <p>In contrast, the recovery of the respondent latent abilities presents a notable challenge in this initial
experiment. The correlation between the estimated and true abilities is modest at 0.344, indicating a
positive but weaker association. More significantly, the negative R² score of -0.007 reveals that the
model’s predictions for latent ability are, on average, less accurate than simply using the mean of the
true abilities as a prediction. This highlights the inherent dificulty of person parameter estimation,
particularly with shorter test lengths (only 10 items), where the amount of data available for any
single respondent is limited. Nonetheless, these preliminary findings serve as a crucial benchmark and
motivate more extensive experimental plans.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper, we have introduced a novel variational inference framework for the Partial Credit
Model (VA-PCM), designed to address the scalability limitations of traditional psychometric estimation
methods while providing a principled mechanism for integrating respondent covariates. We established
a solid probabilistic framework, formalizing the relationships between responses, covariates, and latent
variables within a generative graphical model. By leveraging amortized inference via neural networks
and employing psychometrically-informed priors and variational families that respect the ordered
nature of PCM parameters, we have outlined a complete and feasible methodology for applying modern
deep generative modeling to polytomous response data.</p>
      <p>Our preliminary experiments on synthetic data ofer a promising, albeit mixed, proof-of-concept.
The model demonstrates a strong capacity to recover the underlying structure of item parameters,
which is a critical requirement for any psychometric model. However, the results also underscore the
well-known challenges of accurately estimating individual-level abilities from limited data, a task that
remains dificult even with the inclusion of covariates.</p>
      <p>This work is primarily a theoretical and methodological contribution, intended to lay the groundwork
for future research. We acknowledge that our experimental evaluation is preliminary. A comprehensive
validation of the VA-PCM will require a more extensive set of experiments, which, while essential, are
both computationally and temporally demanding. Key directions for future research include carrying
out more experiments in diferent settings, with a particular focus on testing the model on larger datasets
and under varying conditions of missing data. It is also essential to conduct a rigorous comparison of the
VA-PCM against established psychometric baselines, such as MML-EM and MCMC-based methods, to
benchmark its performance in terms of both accuracy and computational eficiency. Another important
direction is the application of the framework to a large-scale, real-world dataset, such as PISA or
TIMSS, to demonstrate its practical utility, scalability, and the interpretability of its findings in authentic
educational contexts. Finally, given the flexibility of the proposed framework, future work could explore
model extensions, such as incorporating multi-dimensional latent traits to capture more complex
cognitive structures, or experimenting with advanced neural architectures and prior distributions to
enhance estimation accuracy, such as normalizing flows.</p>
      <p>Ultimately, this work aims to bridge the gap between traditional psychometrics and modern deep
generative modeling. By demonstrating how the Partial Credit Model can be robustly integrated into
a VAE framework, complete with covariates and principled handling of ordered data, we hope to
foster further interdisciplinary research and pave the way for more scalable, nuanced, and data-rich
psychometric analyses.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank Prof. Barbara Hammer, Prof. Benjamin Paassen, and Amajor S.p.A. for the
valuable support and inspiration to conduct this research. We acknowledge the support of the PNRR
project FAIR - Future AI Research (PE00000013), Concession Decree No. 1555 of October 11, 2022, CUP
C63C22000770006 and of the project “Future AI Research (FAIR) - Spoke 2 Integrative AI - Symbolic
conditioning of Graph Generative Models (SymboliG)” funded by the European Union under the National
Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.3 - Call for tender No.
341 of March 15, 2022 of the Italian Ministry of University and Research – NextGenerationEU, Code
PE0000013, Concession Decree No. 1555 of October 11, 2022, CUP C63C22000770006.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have employed Generative AI tools for proofreading and improving the readability of
ifgures and tables.
[9] R. J. Adams, M. L. Wu, M. Wilson, The rasch rating model and the disordered threshold controversy,</p>
      <p>Educational and Psychological Measurement 72 (2012) 547–573.
[10] G. Rasch, Probabilistic models for some intelligence and attainment tests., ERIC, 1993.
[11] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip,
P. Horsfall, N. D. Goodman, Pyro: Deep universal probabilistic programming, Journal of Machine
Learning Research (2018). See also the ordinal regression tutorial: https://num.pyro.ai/en/stable/
tutorials/ordinal_regression.html.
[12] R. J. Adams, M. L. Wu, The Mixed-Coeficients Multinomial Logit Model: A Generalized Form
of the Rasch Model, in: Multivariate and Mixture Distribution Rasch Models: Extensions and
Applications, Statistics for Social and Behavioral Sciences, Springer, 2007, pp. 57–75. doi:10.1007/
978- 0- 387- 49839- 3_4.
[13] R. D. Bock, M. Aitkin, Marginal maximum likelihood estimation of item parameters: Application
of an em algorithm, Psychometrika 46 (1981) 443–459.
[14] J.-S. Kim, D. M. Bolt, Estimating item response theory models using markov chain monte carlo
methods, Educational Measurement: Issues and Practice 26 (2007) 38–51.
[15] L. Cai, A two-tier full-information item factor analysis model with applications, Psychometrika
75 (2010) 581–612.
[16] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, M. Welling, Semi-supervised learning with deep
generative models, Advances in neural information processing systems 27 (2014).
[17] D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques, MIT press, 2009.
[18] M. Betancourt, Ordinal regression case study, 2019. URL: https://betanalpha.github.io/assets/case_
studies/ordinal_regression.html, section 2.2.
[19] S. Gershman, N. Goodman, Amortized inference in probabilistic reasoning, in: Proceedings of the
annual meeting of the cognitive science society, volume 36, 2014.
[20] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980
(2014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Domingue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <article-title>Variational item response theory: Fast, accurate, and expressive</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>00276</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Masters</surname>
          </string-name>
          ,
          <article-title>A rasch model for partial credit scoring</article-title>
          ,
          <source>Psychometrika</source>
          <volume>47</volume>
          (
          <year>1982</year>
          )
          <fpage>149</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Tam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Jen</surname>
          </string-name>
          , Partial Credit Model, in: M.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>H. P.</given-names>
          </string-name>
          <string-name>
            <surname>Tam</surname>
          </string-name>
          , T.-H. Jen (Eds.),
          <source>Educational Measurement for Applied Researchers: Theory into Practice</source>
          , Springer, Singapore,
          <year>2016</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>185</lpage>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          - 981- 10- 3302-
          <issue>5</issue>
          _
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Curi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Converse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <article-title>Interpretable variational autoencoders for cognitive models</article-title>
          , in: 2019
          <source>international joint conference on neural networks (ijcnn)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , G. Xu,
          <article-title>Variational estimation for multidimensional generalized partial credit model</article-title>
          , psychometrika
          <volume>89</volume>
          (
          <year>2024</year>
          )
          <fpage>929</fpage>
          -
          <lpage>957</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kucukelbir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>McAulife</surname>
          </string-name>
          ,
          <article-title>Variational inference: A review for statisticians</article-title>
          ,
          <source>Journal of the American statistical Association</source>
          <volume>112</volume>
          (
          <year>2017</year>
          )
          <fpage>859</fpage>
          -
          <lpage>877</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>PaaBen</surname>
          </string-name>
          , M. Dywel,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fleckenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pinkwart</surname>
          </string-name>
          ,
          <article-title>Sparse factor autoencoders for item response theory</article-title>
          .,
          <source>International Educational Data Mining Society</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Muraki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Generalized</given-names>
            <surname>Partial Credit Model</surname>
          </string-name>
          :
          <source>Application of an EM Algorithm, Applied Psychological Measurement</source>
          <volume>16</volume>
          (
          <year>1992</year>
          )
          <fpage>159</fpage>
          -
          <lpage>176</lpage>
          . doi:
          <volume>10</volume>
          .1177/014662169201600206.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>