<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adaptive Learning Control via Proximal Policy Optimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Natalia Axak</string-name>
          <email>nataliia.axak@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maksym Kushnaryov</string-name>
          <email>maksym.kushnarov@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrii Tatarnykov</string-name>
          <email>andrii.tatarnykov@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>Nauky Ave. 14, Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a reinforcement-learning (RL) framework in which an intelligent tutoring system (ITS) acts as the agent and the student is modelled as the environment. A custom OpenAI Gym simulation captures key cognitive and behavioral parameters (decision time, help-request frequency, task accuracy, etc.). Three instructional strategies are compared under identical conditions: a rule-based tutor, Deep QNetwork (DQN), and Proximal Policy Optimization (PPO). PPO converges within 10 15 iterations and attains up to 12 × higher cumulative reward than DQN. Relative to the rule-based tutor (help-request rate = 0.40 req / task, task accuracy = 0.70), PPO lowers the helpto 0.83 (+18 %). To verify that these simulated gains transfer to authentic data, we replayed the learned policies on 0.9 million interaction logs from the public ASSISTments-2017 dataset. PPO achieved a +17 % improvement in NDCG for post-test accuracy and a +4.4 % increase in inverse-propensity-scored reward over the same rulebased baseline, corroborating the simulation results. These findings demonstrate that PPO enables robust, data-efficient personalization and can overcome the limitations of static e-learning courses, paving the way for next-generation adaptive tutoring systems.</p>
      </abstract>
      <kwd-group>
        <kwd>Reinforcement learning</kwd>
        <kwd>Proximal Policy Optimization</kwd>
        <kwd>Deep Q-Network</kwd>
        <kwd>adaptive learning</kwd>
        <kwd>agent-based modeling</kwd>
        <kwd>intelligent tutoring systems</kwd>
        <kwd>personalized education</kwd>
        <kwd>decision-making models</kwd>
        <kwd>e-learning environments 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Digital education systems, including MOOCs, face persistent challenges such as low engagement and
completion rates, often below 20%. A major cause is the uniform design of e-learning courses, which
fails to meet individual learner needs, leading to reduced motivation and early dropout.</p>
      <p>Personalization through adaptive control powered by Reinforcement Learning (RL) offers a
solution. Unlike rigid rule-based methods, RL dynamically adjusts instructional interventions based
on continuous learner feedback. Modeling learner tutor interaction as a Markov Decision Process
(MDP), RL agents optimize task difficulty, assistance timing, and feedback to enhance engagement
and performance.</p>
      <p>Conventional algorithms like Deep Q-Network (DQN) require large datasets and often exhibit
instability, limiting their use in real educational contexts. To overcome these drawbacks, we apply
Proximal Policy Optimization (PPO), an advanced policy-gradient method known for stable learning
under sparse data conditions.</p>
      <p>The goal of this study is to evaluate the effectiveness of a PPO-driven adaptive tutor in improving
learners' decision-making skills within a simulated environment. The research contributes by:
•
•
•
formalizing an MDP tailored to adaptive tutoring,
implementing a PPO-based agent optimized for stable learning with limited data,
accuracy, speed, and learning quality.</p>
      <p>This work advances intelligent educational control technologies by addressing the limitations of
static e-learning systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and related works</title>
      <p>
        AI agents such as intelligent tutoring systems (ITS), chatbots, and virtual assistants are becoming
more common in universities and online education. Their rise is driven by rapid AI advances and
broader adoption of tools like Duolingo and Khan Academy, which use AI to tailor learning for
millions [1]. Colleges and online programs now use AI to offer 24/7 support, boost instructor
presence, and provide personalized feedback without increasing workload [2]. This review
summarizes studies from 2019 to 2024, focusing on: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) theoretical models for designing AI agents,
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) practical uses in education, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) research on their impact on learning, engagement, and
perception.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Theoretical Models and Agent-Based Systems</title>
        <p>Scholars have proposed several frameworks for AI in education. One model distinguishes three roles:
AI-directed (behaviorism), where AI leads instruction; AI-supported (cognitivism), where AI assists
teachers; and AI-empowered (constructivism), where students drive learning with AI support [3].
Broader technology theories also apply: Tarisayi combines TAM, Diffusion of Innovation, and
TPACK to analyze AI adoption [4]. The concept of human AI hybrid adaptivity emphasizes shared
responsibility between teachers and AI, where AI personalizes content while teachers provide
motivation [5].</p>
        <p>Agent-based learning platforms further enhance personalization. Examples include systems that
adapt to learner traits and decision patterns [6], models evaluating strategies under information
overload [7], and a 2024 monograph on autonomous agents that track progress and adjust the
environment in real time [8]. Data-driven feedback loops also optimize learning paths; for instance,
RL agents that adjust task difficulty based on ac Pc in Eq. 4, 6) and help requests (Fh in Eq. 4,
6) improve completion rates by 22% over static rules [9]. However, many rely on simplistic rewards
(e.g., quiz scores), neglecting long-term skill retention.</p>
        <p>Our approach extends these efforts by:
•
•
integrating cognitive skill tracking (critical thinking Ct, risk assessment Ra) into the state
space,
and applying</p>
        <p>This addresses the limitation noted by [12], where DQN-based tutors failed to scale beyond binary
feedback.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Practical Applications and Student Perceptions</title>
        <p>AI agents are widely used in higher and online education, supporting tutoring, feedback, and
assistance roles. Intelligent Tutoring Systems (ITS) provide personalized guidance and instant
feedback, with apps like Duolingo and Khan Academy adapting content for learners of all ages [1].
Conversational agents, such as Jill Watson at Georgia Tech, answer questions and automate
announcements, reducing instructor workload [10]. Many universities employ chatbots for 24/7
support and interactive dialogue, delivering feedback similar to one-on-one sessions [2], [11].</p>
        <p>Some chatbots also act as learning coaches, prompting study planning, encouraging reflection,
and detecting when help is needed. Studies report positive effects: AI tutors improve practice and
classroom performance [1], while AI-supported learners, including those with learning difficulties,
demonstrate greater use of self-regulated strategies and significant gains [12]. Engagement data
shows students often use AI tutors in bursts, particularly before exams, which correlates with
improved outcomes [1]. Student feedback is generally favorable; learners find chatbots helpful and
-learning habits [14].</p>
        <p>Instructors value time savings but emphasize the need for accurate, reliable AI [11].</p>
        <p>However, results are mixed. Some research shows minimal improvement in perceived instructor
presence after adding a virtual TA [15], and concerns remain about AI errors, transparency,
overreliance, and data privacy [11]. These findings suggest AI agents can enhance learning, but effective
design and careful implementation are essential.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Reinforcement Learning and Decision Models</title>
        <p>Reinforcement learning (RL) is a framework for modeling sequential decision-making where agents
learn policies that maximize cumulative reward through trial-and-error, a principle applied in both
artificial and biological systems [16], [17]. Recent advances address challenges such as value function
approximation, unstable training, and exploration exploitation trade-offs. Deep learning enhances
generalization but reduces theoretical guarantees, prompting hybrid approaches that maintain
stability under data constraints [18], [19].</p>
        <p>Efficient exploration is crucial in sparse-reward domains like education. Algorithms such as
Upper Confidence Bound (UCB) [20], Thompson sampling [21], and Bayesian optimization [22]
balance exploration and exploitation. Hierarchical and meta-RL introduce temporal abstraction and
rapid adaptation;
metaevidence linking these mechanisms to orbitofrontal cortex functions [24], [25].</p>
        <p>RL also integrates with Bayesian inference. Bayesian RL improves uncertainty handling by
combining explicit belief modeling with model-free value learning [26 29]. Resource-rational RL
models cognitive heuristics (e.g., Win-Stay-Lose-Shift) as efficient approximations under limited
resources, constraining policy complexity to mirror real-world decision-making [28], [30], [31].</p>
        <p>Modern RL thus blends insights from cognitive science, neuroscience, and probabilistic modeling
to create adaptive agents capable of efficient learning and generalization essential for intelligent
educational systems, as demonstrated in our PPO-based tutoring framework.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Hybrid Decision Models and Integration</title>
        <p>Reinforcement learning (RL) and evidence accumulation models, such as the drift-diffusion model
(DDM), offer complementary views of decision-making. RL explains how agents learn action values
from rewards, while DDM simulates how noisy evidence accumulates until reaching a decision
threshold. Integrating these models improves understanding of both learning and real-time
decisions.</p>
        <p>The Reinforcement Learning Drift-Diffusion Model (RLDDM) combines Q-learning with a DDM
mechanism, where larger value differences lead to faster, more confident choices, outperforming
standalone RL or DDM [32]. Dual-system models extend this by integrating habitual, model-free RL
with deliberative, DDM-like processes. Evidence shows the brain shifts reliance between systems
depending on context [33], explaining differences in decision styles.</p>
        <p>Hybrid RL also merges model-free and model-based strategies, using Bayesian arbitration or
meta-control to switch adaptively. Lei and Solway [33] note that strong habits can suppress planning,
highlighting system competition. In AI, systems like AlphaGo combine deep RL with planning
(Monte Carlo Tree Search), reflecting bounded rationality and aligning with the expected value of
control theory.</p>
        <p>RL also integrates with probabilistic inference. Bayesian RL maintains beliefs over models and
updates them as data arrive, enabling exploration via Bayes-adaptive MDPs. Approximate methods
such as particle filtering and variational inference, as well as active inference, support this
integration. Practical algorithms like Thompson sampling, Variational RL (VAR), and BEAR enhance
data efficiency and robustness under uncertainty [34].</p>
        <p>Studies show that the choice of algorithm depends on the specifics of the task [37]. DQN
demonstrates better performance in controlled environments with discrete decisions [38], while PPO
proves to be more versatile across diverse educational scenarios. Experimental results indicate that
PPO achieves higher training stability (95.1% vs. 91.6% for A3C in complex environments) [37],
whereas A3C exhibits the fastest convergence due to parallel learning [39].</p>
        <p>In summary, combining RL with evidence accumulation, probabilistic reasoning, and cognitive
control advances AI performance and explains adaptive behavior. These models inform educational
technologies, where our framework addresses key challenges oversimplified states and unstable
learning .</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology and Learning Environment Modeling</title>
      <p>Our framework follows the standard reinforcement learning paradigm [35], where:
Agent: The intelligent tutoring system (PPO/DQN algorithm) that selects instructional actions.
Environment: The simulated student whose behavior generates states and rewards.</p>
      <p>This distinction ensures proper alignment with RL theory, where the agent actively learns while
the environment reacts to its actions.</p>
      <sec id="sec-3-1">
        <title>3.1. Objective of the Study</title>
        <p>This study aims to design and validate an agent-based reinforcement learning (RL) framework for
adaptive e-learning systems, focusing on optimizing personalized learning trajectories. Specifically,
we compare the effectiveness of Proximal Policy Optimization (PPO) and Deep Q-Network (DQN)
algorithms in dynamically adjusting task difficulty, feedback timing, and instructional strategies to
maximize student engagement (measured by help-request frequency) and knowledge retention
(measured by post-test accuracy). The proposed approach addresses limitations of static tutoring
systems by enabling real-time adaptation to individual cognitive profiles, as demonstrated in our
simulated environment.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. A model for acquiring decision-making skills in education using intelligent agents</title>
        <p>
          The use of reinforcement learning allows you to create an agent system that will be able to adapt its
strategies based on interaction with the environment. The figure 1 illustrates the flow of information
in the RL framework: the tutor agent selects actions (task difficulty, hints, motivation), the student
environment responds by generating states and rewards, and the state transition function updates
Transition between states: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) If the student completes the task, the student's knowledge and
skills increase. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) The task difficulty decreases if the student asks for help frequently. (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) If
the student uses structured methods, the student's choice validity increases.
        </p>
        <p>Limited number of hints or time to complete tasks.</p>
        <p>while the student makes decisions (e.g., whether to request help), these are part
is to learn how to influence
these decisions."</p>
        <p>The difficulty of the tasks varies depending on the student's performance.</p>
        <p>The environment is defined by a tuple</p>
        <p>= ( ,  ,  ,  ),
where: S set of environment states; A a set of possible actions of an agent;  :  ×  ×  →
[0,1] transition probability function between states;  :  ×  →  reward function.</p>
        <p>
          Our framework (Eq. 1-2) formalizes the tutor-student interaction as an RL problem, where:
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
•
•
•
•
•
•
•
l actions.
        </p>
        <p>The environment (student) generates states (e.g., skill levels) and rewards (e.g., accuracy
improvements).</p>
        <p>This separation mirrors established RL benchmarks where the environment (e.g., game physics
in Atari or robot dynamics in MuJoCo) responds to the agent's actions while remaining distinct from
the decision-making policy [35].</p>
        <p>Unlike rule-based systems, this approach enables adaptive decision-making under uncertainty
Transition function P determines the probability that the agent will move from the state s S into
a new state S after performing an action a A(s):</p>
        <p>
          :  ×  ×  → [0,1], (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
or as a conditional probability:  ′( ′| ,  ) =  (  +1 =  ′|  =  ,   =  ), where Pr⁡(∙) the
probability that a certain event will occur.
        </p>
        <p>The transition function models the dynamics of changes in the student's educational state in
response to the actions of the learning agent.</p>
        <p>The logic of the transition:</p>
        <p>If the action a = provides a hint when Fh &gt; Fth (threshold for help requests), the probability of
reducing the complexity Lc (e.g., from heavy to medium/light) in the following state increases.
If the action a= change of method to structure at low Qv (choice-validity index), the likelihood
of increasing Qv in the following state increases.</p>
        <p>If the action a= motivational support and Td was high (decision time), the likelihood of a
reduction in Td in the following state increases.</p>
        <p>
          Condition. s ∈ S – a formal representation of the environment’s current state, which integrates
temporal (decision time), cognitive (accuracy, validity), and behavioral (help requests, processing
method) parameters. These parameters are formally defined in Eq. (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), where the level of task
complexity is denoted as Lc (light, medium, heavy; encoded as 1, 2, 3 respectively).
        </p>
        <p>State - a formal representation of the current state of the environment:</p>
        <p>
          = {  ,   ,   ,   ,   ,  ℎ,   ,   ,   ,   } (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
where:   average decision-making time (seconds);   reaction time to problem situations (fast,
medium, slow);   accuracy of solutions (0 ≤   ≤ 1) is the percentage of correct answers for
the last N attempts;   level of validity of the choice (low, medium, high);   the level of
complexity of the problem situation (light=1, medium=2, heavy=3), if the student frequently requests
help, the complexity decreases  

= max(
        </p>
        <p>(the number of requests for tips recently);  
− ∆  ,</p>
        <p>ℎ );  ℎ frequency of requests for help
method of information processing (structured,
intuitive, algorithmic); skills profile:   ∈ [0,100]⁡ −⁡the level of critical thinking, if the applicant
successfully completes the task, his level of knowledge increases     =   
+ ∆  ;   ∈ [0,100]
data analytics;   ∈ [0,100]</p>
        <p>risk assessment.</p>
        <p>Action a</p>
        <p>A(s)
possible solutions that can be chosen RL- agent:</p>
        <p>= {  ,   ,   ,   ,   },
taking into account the parameters:</p>
        <p>selection of the task difficulty level: light, medium,
heavy;</p>
        <p>Providing hints or explanations: yes/no;</p>
        <sec id="sec-3-2-1">
          <title>Change the method of information</title>
          <p>
            processing: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) offer a structured method (e.g., analysis algorithm), (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) offer an intuitive approach, (
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
use forecasting algorithms;  
message;
          </p>
          <p>Pause: offer a break to reduce cognitive load.</p>
          <p>
            Motivation support: provide positive feedback or a motivational
Award function  :  ×  →  determines the effectiveness of the choice
 ( ,  ) =  1∆  +  2∆  +  3∆  +  4∆ ℎ +  5∆  ,
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
(
            <xref ref-type="bibr" rid="ref6">6</xref>
            )
where ∆  =
          </p>
          <p>−   
 

 

−   
−</p>
          <p>of the choice; ∆ ℎ =  ℎ
−  ℎ</p>
          <p>change in the average time to process information; ∆  =
change in decision accuracy; ∆  =  
−</p>
          <p>change in the validity</p>
          <p>Changes in the frequency of assistance requests; ∆  =
•</p>
          <p>Positive reward:  + =
•</p>
          <p>Negative reward:  − =
changing the skills profile;  
weighting factors that determine the importance
of each parameter in the reward function.</p>
          <p>To determine the weighting coefficients, a genetic algorithm was chosen, which allows us to find
the optimal values of wi by running an evolutionary algorithm on the simulation environment.</p>
          <p>Reward is based on the effectiveness of solutions, speed and cognitive load:
+5,⁡⁡⁡⁡if⁡⁡⁡∆  &lt; 0,
+10,⁡⁡⁡if⁡⁡∆  &gt; 0,
+7,⁡⁡⁡⁡if⁡⁡∆  &gt; 0,
+5,⁡⁡⁡⁡if⁡⁡∆ ℎ &lt; 0
{ +10,⁡⁡⁡if⁡⁡∆  &gt; 0.</p>
          <p>−5,⁡⁡⁡⁡if⁡⁡⁡∆  &gt; 0,
−10,⁡⁡⁡if⁡⁡∆  &lt; 0,
−7,⁡⁡⁡⁡if⁡⁡∆  &lt; 0,
−5,⁡⁡⁡⁡if⁡⁡∆ ℎ &gt; 0
{ −10,⁡⁡⁡if⁡⁡∆  &lt; 0.</p>
          <p>Reduced average time to process information (+5 points); increased number of correct answers
(+10 points); increased validity of choices (+7 points); reduced frequency of requests for assistance
(+5 points); improved skills profile (critical thinking, data analysis, risk assessment) (+10 points if
the skills profile is improved   ,   ,   ).</p>
          <p>Increase in average time to process information (-5 points); decrease in the number of correct
answers (-10 points); decrease in the validity of choices (-7 points); increase in the frequency of
requests for assistance (-5 points); deterioration of the skills profile (-10 points).</p>
          <p>The selected reward values are based on the impact of the relevant parameters on the quality of
decision-making in the learning context. Their weights are determined based on the following
considerations:</p>
          <p>Positive rewards</p>
          <p>Reduced average time to process information (+5 points) ⟶ Shorter decision-making time
indicates improved information processing skills. However, an excessive reduction in time
may not always be positive, so the weight is medium.</p>
          <p>Increase in the number of correct answers (+10 points) ⟶ The most important indicator of
learning effectiveness. Correct answers are a direct indication of the quality of learning and
therefore receive the highest reward.</p>
          <p>Increase in choice validity (+7 points) ⟶ High choice validity indicates improved critical
thinking and analysis, which is an important aspect of decision-making.</p>
          <p>Reduced frequency of requests for help (+5 points) ⟶ Less need for help indicates increased
independence and
confidence.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>However, an excessive decrease</title>
          <p>may indicate an
unwillingness to seek the necessary support.</p>
          <p>Improved skill profile (critical thinking, data analysis, risk assessment) (+10 points) ⟶ These
are key cognitive skills that directly affect student performance, so improving them has a
high reward factor.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Negative reward</title>
          <p>Increased average time to process information (-5 points) ⟶ Indicates a deterioration in
thinking speed or excessive confusion.</p>
          <p>Decrease in the number of correct answers (-10 points) ⟶ This is a direct negative indicator
of learning effectiveness, so the penalty is maximum.</p>
          <p>Decrease in validity of choices (-7 points) ⟶ Indicates rash or unjustified decisions that may
negatively impact the learning process.</p>
          <p>Increased frequency of help-seeking (-5 points) ⟶ Indicates a decrease in independence but
is not a critical negative factor, as a certain level of support is natural.</p>
          <p>Deterioration in skill profile (-10 points) ⟶ The most undesirable outcome, as it indicates a
regression in learning.</p>
          <p>∞
 =</p>
          <p>discount factor, which determines the importance of future reward,  (  ,   )
reward for action   ⁡in a state of   , [∙]</p>
          <p>mathematical expectation.</p>
          <p>Policy  determines an adaptive strategy for choosing actions to optimize the learning process. It
takes into account the following factors:</p>
          <p>Select the level of task complexity  с ∈ ⁡ {
⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡ ( ) = {</p>
          <p>ℎ
of solutions.</p>
          <p>ℎ ,⁡⁡⁡⁡  &lt; 0.6
ℎ , 
,   ≥ 0.8
, ℎ</p>
          <p>}:
0.6 ≤   &lt; 0.8, where  
accuracy</p>
          <p>Such values allow for a balanced stimulation of students to make quick, informed, and accurate
decisions while supporting the development of cognitive skills.</p>
          <p>Policy. :  →</p>
          <p>A strategy for choosing actions depending on the current state.</p>
          <p>The policy can be:</p>
          <p>Deterministic  ( ), where each state corresponds to one specific action a.</p>
          <p>Stochastic ( | ) , where each state s corresponds to a probability distribution for choosing
an action a.</p>
          <p>Optimal policy  ∗ maximizes the expected amount of reward for all future actions:

 ∗ = argmax  [∑    (  ,   )| 0 =  ]
(7)
1,⁡⁡⁡ ℎ &gt;   ℎ , where   ℎ
Changing the method of information processing   :  ( ) = {


,⁡⁡⁡⁡⁡  = 
ℎ</p>
          <p>,
,   = ℎ ℎ
⁡
  =
deadline for decision-making.</p>
          <p>Policy  allows you to adaptively adjust the learning process, ensuring an optimal balance
between the complexity of tasks, the level of support and the cognitive load.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Implementation of the simulation environment</title>
        <p>The simulation environment models the learning process in which an agent (student) makes
decisions, and the system adapts the complexity of tasks and provides hints depending on his or her
performance.</p>
        <sec id="sec-3-3-1">
          <title>Lt, hints Hp).</title>
          <p>Consistent with Sutton &amp; Barto's RL framework [35], our implementation separates:
Agent (tutor policy): Implemented as PPO/DQN, selects instructional actions (e.g., task difficulty
Environment (student simulator): Generates new states st+1 and rewards rt based on actions at,
following predefined rules (e.g., if Hp=1, help requests Fh decrease).</p>
          <p>This distinction ensures the student's behavior is part of the environment's dynamics, while the
tutor (agent) learns to optimize interventions.</p>
          <p>Python and the Gym library were used for the implementation.</p>
          <p>The simulation environment was implemented using Python 3.10 and the OpenAI Gym
framework, with training conducted using the Stable-Baselines3 library. All experiments were run
on a workstation with an Intel Core i7 CPU, 32 GB RAM, and no GPU acceleration. To ensure
reproducibility, a fixed random seed was used across all runs. Training for each agent was conducted
over 50,000 timesteps using the following hyperparameters:</p>
          <p>PPO: learning_rate = 0.0003, gamma = 0.99, clip_range = 0.2, n_steps = 2048, batch_size = 64,
ent_coef = 0.01</p>
          <p>DQN: learning_rate = 0.001, batch_size = 32, gamma = 0.99, train_freq = 4,
target_update_interval = 500, exploration_fraction = 0.1, exploration_final_eps = 0.05.</p>
          <p>Initial state of the agent  = {  ,   ,   ,   ,  ℎ,   ,   ,   ,   }, where   (Time for decision) = 30
sec;   (Decision accuracy) = 0.7 (70%);   (Justification quality) = 1
medium;   (Task complexity
level   ) = 2</p>
          <p>medium (light=1, medium=2, heavy=3);  ℎ (Help requests frequency) = 3 times for 5
tasks;   (Processing method) = 1</p>
          <p>;   ⁡ (Critical thinking) = 60 (60 out of 100);   (Data
analysis skills) = 50 (50 out of 100);   (Risk assessment skills) =55 (55 out of 100).</p>
          <p>All values at the start have average values - the agent starts training with standard characteristics.</p>
          <p>The agent's actions are determined by the set  = {  ,   ,   ,   ,   }. First, the agent chooses
actions randomly (Random Actions).</p>
          <p>The reward function is determined by the formula:
 ( ,  ) = (10 ∙ (30 −   )) + (20 ∙   ) + (15 ∙   ) − (8 ∙  ℎ) +
15∙(  +  +  ),
300
where incentives are introduced for speed of decision-making, high accuracy, soundness of choice
and skill development; a penalty for excessive requests for assistance.</p>
          <p>The initial reward depends on the balance between speed, accuracy and skill development.
Analysis of Random Actions training results shows (fig.2):
•</p>
          <p>The dynamics of reward is growing, which indicates the effective accumulation of useful
strategies by the agent.
•
•
•
•</p>
          <p>There are jumps in reward values - the agent finds profitable actions and optimizes its
behavior.</p>
          <p>There are no sharp drops in reward - this means that the agent does not make critical
mistakes in choosing actions.</p>
          <p>The maximum reward reached is 190 in the last step, which is comparable to future training
models.</p>
          <p>The final state of the agent shows that the agent has changed its characteristics and improved
its skills.</p>
          <p>While random actions allowed the agent to make some progress, the lack of a directed strategy
limited its potential. Therefore, the next step was to implement reinforcement learning algorithms,
such as DQN and PPO, which allow the agent to adaptively improve its action policy.</p>
          <p>We will use our own defined environment DecisionMakingEnv, based on gym.Env. Initial state
vector {  = 30,   = 0.7,   = 1,   = 1,  ℎ = 3,   = 1,   = 60,   = 50,   = 55}. During
training, the state changes depending on the agent's actions.</p>
          <p>Training parameters DQN (Deep Q-Network): policy="MlpPolicy" (multilayer neural network);
learning_rate=0.001; batch_size=32; gamma=0.99 (discounting future awards); train_freq=4 (update
frequency); target_update_interval=500; exploration_fraction=0.1, exploration_final_eps=0.05;
total_timesteps=50_000.</p>
          <p>DQN (Deep Q-Network) uses Q-Learning to update the Q-function. At each step, it updates the
Q-value using the formula:
 ( ,  ) ←  ( ,  )+∝ [ +  max  ( ′,  ′) −  ( ,  )]</p>
          <p>′
where r reward received,  ′ next state, ∝ learning rate.</p>
          <p>DQN uses the experience replay mechanism to reduce the correlation between data. Agent
training with DQN demonstrates a positive reward growth rate. At the beginning of training, the
reward values are small, but over time, the agent optimizes its actions and the reward stabilizes at a
high level.
(8)
•
•
•</p>
          <p>Initial stage (0-10 steps) (Fig.3):</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Reward .</title>
          <p>There is an increase in the following indicators   (accuracy of solutions) and   ,   ,  
(cognitive skills).</p>
          <p>The agent experiments with different actions and gets mixed results.</p>
          <p>Thus, already at the early stages of learning, there is an increase in reward, which indicates the
agent's potential for adaptation. Next, let's see how the agent's behavior changes in the following
steps of learning.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Middle stage (10-30 steps) (Fig.4):</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>The reward increases from 20 to 25 points.</title>
          <p>The agent starts choosing more effective strategies.</p>
          <p>There is a decrease in the frequency of requests for assistance ( ℎ).</p>
          <p>The agent started to hold high values   (precision), which indicates the right choice of
solutions.
•
•
•
•</p>
          <p>Thus, at the final stage of training with DQN, the agent reaches a stable level of reward, which
indicates the formed optimal policy (Fig. 6). For a deeper analysis, let's look at the overall reward
dynamics throughout the training.</p>
          <p>The graph "Agent reward dynamics with DQN" shows a smooth increase in reward, stabilizing at
26, which indicates successful training. The agent has optimized its policy and stopped exploring
after step 30, acting consistently according to the learned strategy. The use of a neural network
enables effective generalization and decision-making, confirming that the DQN agent has learned to
act optimally.</p>
          <p>During this period, the agent demonstrates a gradual improvement in behavior. However, further
stabilization is required for the strategy to be fully formed, which occurs at the following stage.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Stabilization (30-50 steps) (Fig.5):</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>The reward reaches 26.0 and remains constant.</title>
          <p>The agent has learned the optimal actions and now acts almost without error.</p>
          <p>Parameters.   ,   ,   are close to 100, which indicates maximum skill development.</p>
          <p>The agent no longer changes the strategy because he has found the optimal policy.</p>
          <p>However, DQN has limitations in adaptation speed. To compare efficiency, Proximal Policy
Optimization (PPO) is also evaluated. PPO demonstrates improved dynamics in several tasks.
adaptation using DQN and effective generalization of learned strategies.</p>
          <p>PPO updates its policy via stochastic gradient ascent, maximizing expected reward. It employs a
clipped surrogate objective to stabilize learning and prevent performance degradation:   ( ) =
 [ (  ( ) ̂ , clip(  ( ), 1 −  , 1 +  ) ̂ )], where   ( ) is the ratio of the new policy to the
old one, A _t is the estimate of the benefit of the action, and is the truncation parameter.</p>
          <p>When training an agent using Proximal Policy Optimization (PPO), there is a very rapid increase
in reward in the initial iterations, after which the graph reaches a plateau.</p>
          <p>The reward graph shows that the training is more efficient than in the case of DQN, as the agent
achieves consistently high rewards after 10-15 iterations (fig.7).</p>
          <p>Figure 6 shows that reward stabilizes after ~10 iterations.</p>
          <p>Compared with the rule-based tutor (Fh = 0.40, Pc = 0.70), PPO achieved Fh Pc =
0.83 (+18 %).</p>
          <p>This quantitative gain confirms the qualitative trend in the reward curves.</p>
          <p>Offline validation on real learner logs. Although the core experiments were run in simulation, we
also performed an offline evaluation on the public ASSISTments-2017 dataset, which contains 942
816 anonymized student task interactions from 10 425 learners covering 26 skills.</p>
          <p>Following standard off-policy evaluation protocols [36], we replayed the logged trajectories
through the learned policies and computed three metrics:
•
•
•</p>
          <p>Normalized Discounted Cumulative Gain (NDCG) for task accuracy.</p>
          <p>Inverse-Propensity-Scored (IPS) reward for help-request reduction.</p>
          <p>Doubly-Robust (DR) estimator for overall policy value.</p>
          <p>The comparative results of the offline evaluation are summarized in Table 1.</p>
          <p>The PPO policy improves offline task-accuracy ranking by 17 % over the rule-based baseline and
% accuracy). These findings suggest that the simulated gains transfer to authentic learner data and
reinforce the practical relevance of our approach.</p>
          <p>PPO learns much faster than DQN and achieves significantly higher rewards. The PPO agent
adapts to the environment faster and finds the optimal strategy earlier. To summaries the results of
the experiments, we compare the key performance indicators of agents in each approach. The data
are shown in the table 2.</p>
          <p>Based on the results of the experiments, the following conclusions can be drawn:
•
•
•</p>
          <p>Random Actions to solve the problem because the agent has no mechanism to optimize its
decisions.</p>
          <p>DQN improves performance but requires more iterations to stabilize.</p>
          <p>PPO provides the best performance and fast adaptation of the agent, making it the most
efficient method in this environment.
•</p>
          <p>For complex scenarios, PPO is the better choice, while DQN can be useful in cases where
stability and predictability are more important than learning speed.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Parameter.</p>
      <sec id="sec-4-1">
        <title>Random Actions DQN PPO</title>
      </sec>
      <sec id="sec-4-2">
        <title>Chaotic growth ~26</title>
      </sec>
      <sec id="sec-4-3">
        <title>None</title>
        <p>Low</p>
      </sec>
      <sec id="sec-4-4">
        <title>None</title>
        <p>~30-40 steps
~10-15 steps</p>
      </sec>
      <sec id="sec-4-5">
        <title>Moderate</title>
      </sec>
      <sec id="sec-4-6">
        <title>Medium ~319</title>
      </sec>
      <sec id="sec-4-7">
        <title>High</title>
      </sec>
      <sec id="sec-4-8">
        <title>High</title>
        <p>This study successfully demonstrated the effectiveness of Proximal Policy Optimization (PPO) in
enhancing adaptive tutoring systems aimed at improving learners' decision-making skills. The
proposed PPO-driven reinforcement learning framework significantly outperformed alternative
approaches (random actions and Deep Q-Network) by dynamically adapting instructional strategies
in response to real-time learner interactions. Specifically, PPO achieved approximately 12 times
higher cumulative rewards compared to DQN, optimizing factors such as hint delivery frequency,
task sequencing, and instructional complexity, as illustrated in Table 1.
converging to an optimal adaptive policy in fewer iterations than DQN. This rapid convergence
translated directly into improved learner outcomes, including faster decision-making, greater task
accuracy, and enhanced cognitive skill development.</p>
        <p>Thus, the findings validate the potential of PPO-based reinforcement learning models for
personalized education, addressing the fundamental limitations of traditional, static e-learning
systems. Future research will focus on deploying this approach in authentic educational settings,
integrating multimodal data sources such as eye-tracking and emotion recognition, and exploring
long-term impacts on real-world learner cohorts. All code and experimental configurations will be
made publicly available to support reproducibility and further research.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly in order to: Grammar and
spelling check.
[7] N. Axak, A. Tatarnykov, The Behavior Model of the Computer User, in: Proc. IEEE 17th Int.</p>
      <p>Conf. Comput. Sci. Inf. Technol. (CSIT), 2022, pp. 458 461.
doi:10.1109/CSIT56902.2022.10000499.
[8] N. Axak, M. Kushnaryov, A. Tatarnykov, Agent-driven approach to enhancing e-learning
efficiency, in: V. Vychuzhanin (Ed.), Advances in Information Control Systems and
Technologies, Liha-Pres, Lviv 380. doi:10.36059/978-966-397-422-4.
[9] N. Axak, A. Tatarnykov, M. Kushnaryov, Agent-based method of improving the efficiency of
the e-learning, in: Proc. 12th Int. Sci. Pract. Conf. Inf. Control Syst. Technol., CEUR Workshop
Proc., vol. 3790, 2024, pp. 63 75.
[10] A. K. Goel, L. Polepeddi, Jill Watson: A virtual teaching assistant for online education, Georgia</p>
      <p>Tech Tech. Rep. (2016).
[11] L. Labadze, M. Grigolia, L. Machaidze, Role of AI chatbots in education: Systematic literature
review, Int. J. Educ. Technol. High. Educ. 20 (2023) 56. doi:10.1186/s41239-023-00426-1.
[12] R. Cerezo, et al., Differential efficacy of an intelligent tutoring system for university students: A
case study with learning disabilities, Sustainability 12 21 (2020) 9184.
[13] J. Belda-
Human Interaction Satisfaction Model (CHISM), Int. J. Educ. Technol. High. Educ. 20 (2023) 62.
doi:10.1186/s41239-023-00432-3.
[14] I. González Díez, et al., Perceived satisfaction of university students with using chatbots as a
tool for self-regulated learning, Educ. Inf. Technol. 28 (2023) 7665 7692.
[15] R. Lindgren, S. Kakar, P. Maiti, K. Taneja, A. Goel, Does Jill Watson Increase Teaching Presence?
in: Proc. 11th ACM Conf. Learn. Scale, 2024, pp. 269 273.
[16] M. Janssen, C. LeWarne, D. Burk, B. B. Averbeck, Hierarchical reinforcement learning,
sequential behavior, and the dorsal frontostriatal system, J. Cogn. Neurosci. 34 (2022) 1307 1325.
doi:10.1162/jocn_a_01869.
[17] Y. Lei, A. Solway, Conflict and competition between model-based and model-free control, PLoS</p>
      <p>Comput. Biol. 18 (2022) e1010047. doi:10.1371/journal.pcbi.1010047.
[18] J. Jih, Reinforcement Learning with Function Approximation: From Linear to Nonlinear, J. Mach.</p>
      <p>Learn. 2 3 (2022) 161 193. doi:10.4208/jml.230105.
[19] A. Triche, A. S. Maida, A. Kumar, Exploration in neo-Hebbian reinforcement learning:
Computational approaches to the exploration exploitation balance with bio-inspired neural
networks, Neural Netw. 151 (2022) 16 33.
[20] S. Flore, L. Albin, S. Csaba, Balancing optimism and pessimism in offline-to-online learning,
arXiv:2502.08259 (2025).
[21] C. Wu, T. Li, Z. Zhang, Y. Yu, Bayesian optimistic optimization: Optimistic exploration for
model-based reinforcement learning, Adv. Neural Inf. Process. Syst. 35 (2022) 14210 14223.
[22] J. Bayrooti, C. H. Ek, A. Prorok, Efficient Model-Based Reinforcement Learning Through</p>
      <p>Optimistic Thompson Sampling, arXiv:2410.04988 (2024). doi:10.48550/arXiv.2410.04988.
[23] J. Beck, R. Vuorio, E. Z. Liu, Z. Xiong, L. Zintgraf, C. Finn, S. Whiteson, A survey of
metareinforcement learning, arXiv:2301.08028 (2023). doi:10.48550/arXiv.2301.08028.
[24] F. Robertazzi, M. Vissani, G. Schillaci, E. Falotico, Brain-inspired meta-reinforcement learning
cognitive control in conflictual inhibition decision-making task for artificial agents, Neural
Netw. 154 (2022) 283 302. doi:10.1016/j.neunet.2022.06.020.
[25] R. Hattori, N. G. Hedrick, A. Jain, et al., Meta-reinforcement learning via orbitofrontal cortex,</p>
      <p>Nat. Neurosci. 26 (2023) 2182 2191. doi:10.1038/s41593-023-01485-3.
[26] D. Arumugam, M. K. Ho, N. D. Goodman, B. Van Roy, Bayesian Reinforcement Learning with</p>
      <p>Limited Cognitive Load, Open Mind 8 (2024) 395 438. doi:10.1162/opmi_a_00132.
[27] M. Binz, E. Schulz, Modeling human exploration through resource-rational reinforcement
learning, Adv. Neural Inf. Process. Syst. 35 (2022) 31755 31768.
[28] C. Wang, Y. Chen, K. P. Murphy, Model-based policy optimization under approximate Bayesian
inference, in: ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems,
2023.
[29] M. K. Eckstein, S. L. Master, R. E. Dahl, L. Wilbrecht, A. G. Collins, Reinforcement learning and
Bayesian inference provide complementary models for the unique advantage of adolescents in
stochastic reversal, Dev. Cogn. Neurosci. 55 (2022) 101106. doi:10.1016/j.dcn.2022.101106.
[30] P. Kang, P. N. Tobler, P. Dayan, Bayesian reinforcement learning: A basic overview, Neurobiol.</p>
      <p>Learn. Mem. (2024) 107924.
[31] T. L. Griffiths, N. Chater, J. B. Tenenbaum (Eds.), Bayesian Models of Cognition: Reverse</p>
      <p>Engineering the Mind, MIT Press, 2024.
[32] [D. G. Dillon, E. L. Belleau, J. Origlio, M. McKee, A. Jahan, A. Meyer, D. A. Pizzagalli, Using Drift
Diffusion and RL Models to Disentangle Effects of Depression on Decision-Making vs. Learning
in the Probabilistic Reward Task, Comput. Psychiatry 8 1 (2024) 46. doi:10.5334/cpsy.108.
[33] Y. Lei, A. Solway, Conflict and competition between model-based and model-free control, PLoS</p>
      <p>Comput. Biol. 18 5 (2022) e1010047. doi:10.1371/journal.pcbi.1010047.
[34] R. F. Prudencio, M. R. O. A. Maximo, E. L. Colombini, A Survey on Offline Reinforcement
Learning: Taxonomy, Review, and Open Problems, IEEE Trans. Neural Netw. Learn. Syst. 35 8
(2024) 10237 10257. doi:10.1109/TNNLS.2023.3250269.
[35] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press,</p>
      <p>Cambridge, MA, 2018.
[36] N. Jiang, L. Li, Doubly robust off-policy value evaluation for reinforcement learning, in: Proc.</p>
      <p>33rd Int. Conf. Mach. Learn. (ICML), PMLR, 2016, pp. 652 661.
[37] N. De La Fuente, D. A. V. Guerra, A comparative study of deep reinforcement learning models:</p>
      <p>DQN vs PPO vs A2C, arXiv:2407.14151 (2024).
[38] L. L. Scientific, Performance comparison of reinforcement learning algorithms in the CartPole
game using Unity ML-Agents, J. Theor. Appl. Inf. Technol. 102 16 (2024).
[39] A. Moltajaei Farid, J. Roshanian, M. Mouhoub, On-policy Actor-Critic reinforcement learning
for multi-UAV exploration, arXiv:2409.XXXX (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>MacLellan, Intelligent tutors beyond K-12: An observational study of adult learner engagement and academic impact, under review (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Taneja</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Jill Watson</surname>
          </string-name>
          :
          <article-title>A Virtual Teaching Assistant powered by ChatGPT</article-title>
          , arXiv:
          <fpage>2405</fpage>
          .11070 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence in education: The three paradigms</article-title>
          ,
          <source>Comput. Educ.: Artif. Intell</source>
          .
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <article-title>100020</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.caeai.
          <year>2021</year>
          .
          <volume>100020</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Tarisayi</surname>
          </string-name>
          ,
          <article-title>A theoretical framework for interrogating the integration of AI in education</article-title>
          ,
          <source>Res. Educ. Media</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ) (
          <year>2024</year>
          )
          <fpage>38</fpage>
          44. doi:
          <volume>10</volume>
          .2478/rem-2024-0006.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Holstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>McLaren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          ,
          <article-title>A conceptual framework for human-AI hybrid adaptivity in education</article-title>
          ,
          <source>in: Proc. Int. Conf. Artif. Intell. Educ. (AIED)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>240</fpage>
          <lpage>251</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>030</fpage>
          -52237-7_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Axak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kushnaryov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tatarnykov</surname>
          </string-name>
          ,
          <article-title>The Agent-Based Learning Platform</article-title>
          ,
          <source>in: Proc. XI Int. Sci. Pract</source>
          .
          <source>Conf. Inf. Control Syst. Technol., CEUR Workshop Proc.</source>
          , vol.
          <volume>3513</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>263</fpage>
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>