=Paper=
{{Paper
|id=Vol-2872/paper02
|storemode=property
|title=A comparison of exploration strategies used in reinforcement learning for building
an intelligent tutoring system
|pdfUrl=https://ceur-ws.org/Vol-2872/paper02.pdf
|volume=Vol-2872
|authors=Jezuina Koroveshi,Ana Ktona
|dblpUrl=https://dblp.org/rec/conf/rtacsit/KoroveshiK21
}}
==A comparison of exploration strategies used in reinforcement learning for building
an intelligent tutoring system==
A comparison of exploration strategies used in reinforcement
learning for building an intelligent tutoring system
Jezuina Koroveshia, Ana Ktonab
a
University of Tirana, Faculty of Natural Sciences, Tirana, Albania
b
University of Tirana, Faculty of Natural Sciences, Tirana, Albania
Abstract
Reinforcement learning is a form of machine learning where an intelligent agent learns to
make decisions by interacting with some environment. The agent may have no prior
knowledge of the environment and discovers it through interaction. For every action that the
agent takes, the environment gives a reward signal that is used to measure how good or bad
that action was. In this way, the agent learns which are more favorable actions to take in every
state of the environment. There are different approaches to solve a reinforcement learning
problem, but one drawback that arises during this process is the tradeoff between exploration
and exploitation. In this work we focus on studying different exploration strategies and
compare their effect in the performance of an intelligent tutoring system that is modeled as a
reinforcement learning problem. An intelligent tutoring system is a system that helps in the
process of teaching and learning by adapting to student needs and behaving differently for
each student. We train this system using reinforcement learning and different exploration
strategies and compare the performance of training and testing to find which is the best
strategy.
Keywords
Reinforcement learning, exploration strategies, intelligent tutoring system
1. Introduction
In this approach every student is given the
same materials to learn regardless of his/her
Intelligent educational systems are systems
needs and preferences. These systems are not
that apply techniques from the field of
well suited for all students because they may
Artificial Intelligence to provide better support
come from different backgrounds, may have
for the users of the system [1]. Web-based
different learning styles and do not absorb the
Adaptive and Intelligent Educational Systems
lessons with the same peace. An intelligent
provide intelligence and student adaptability,
tutoring system customizes the learning
inheriting properties from Intelligent Tutoring
experience that the student perceives by taking
Systems (ITS) and Adaptive Hypermedia
into consideration factors such as pre-existing
Systems (AHS) [2]. [3] defines an Intelligent
knowledge, learning style and student
Tutoring System (ITS) as computer-aided
progress. According to [4] an intelligent
instructional system with models of
tutoring system usually has the following
instructional content that specify what to
modules: the student module that manages all
teach, and teaching strategies that specify how
the information related to the student during
to teach.
the learning process; the domain module that
Traditional tutoring systems use the one-to-
contains all the information related to the
many way of presenting the learning materials
knowledge to teach, such as topics, tasks,
to the students.
relation between them, difficulty.; the
_________________________________
Proccedings of RTA-CSIT 2021, May 2021, Tirana, Albania pedagogical module, also called tutor module
EMAIL: jezuina.koroveshi@fshn.edu.al (A.1); that decides what, how and when to teach the
ana.ktona@fshn.edu.al (A.2); learning materials.; the graphical user interface
2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
module that facilitates the communication
CEUR Workshop Proceedings (CEUR-WS.org) between the system and the student. Different
techniques from artificial intelligence can be reward in comparison to immediate
applied in order to make these systems more rewards.
“intelligent”, but our study is focused on the P defines the probability of transitions
use of reinforcement learning (RL). from s to s’ when taking action a in state s:
Reinforcement learning is a form of Pss’ = Pr{st+1 = s’ | st = s, at = a}
machine learning that is based on learning R defines the reward function for each
from experience. The learner is exposed to of the transitions, the reward we get if we
some environment, for which he may or may take action a in state s and end up in state
not have information, starts making decisions s’: Rss’ = E{rt+1 | st = s, at = a, st+1 = s’}
and gets some feedback that gives information
telling how good or bad that decision was. The goal of the agent is to maximize the total
Based on the feedback from the environment reward it receives. The agent should maximize the
the learner learns which decisions are more total cumulative reward it receives in the long run,
favorable to take. This class of machine not just the immediate reward [11]. The expected
learning has been used in modeling and discounted reward is defined as follows by [11]:
building intelligent tutoring systems such as in
Gt = Rt+1 + γ Rt+2 + γ2 Rt+3 + … = γk Rt+k+1
the works from [5], [6], [7], [8], [9], [10].
The remainder of this paper is organized as
The sequence of states that end up in a
follows: in section 2 we give an overview on
terminal state is called an episode. The general
reinforcement learning, in section 3 we
process of RL may be defined as follows:
describe the model that we have used to build
an intelligent tutoring system, in section 4 we
give the experimental results of training the 1. At each time step t, the agent is in a
model using different exploration strategies state s(t).
and is section 5 we give the conclusions of our 2. The agent choses one of the possible
work. actions in this state, a(t), and applies that
action.
3. After applying the action, the agent
2. Reinforcement learning transitions in a new state s(t+1) and gets a
numerical reward r(t) from the
Reinforcement learning is a form of environment.
machine learning in which the learner learns 4. If the new state is not terminal, the
some sequence of actions by interacting with agent repeats the step 2, otherwise the
the environment. The learner is in a state of the episode is finished.
environment, takes some action that moves it
from that state to another and after each action
the environment gives a reward signal. This
reward signal is used to learn which are the
2.1 Exploration and exploitation
best states to be in, and therefore learn which dilemma
action to take in order to go in those states. A
reinforcement learning problem can be One challenge of reinforcement learning is
modeled as a Markov Decision Process the tradeoff between exploration and
(MDP). A MDP is a stochastic process that exploitation [11]. As given by [11]: “To obtain
satisfies the Markov Property. In a finite MDP, a lot of reward, a reinforcement learning agent
the set of states, actions and rewards have a must prefer actions that it has tried in the past
finite number of elements. Formally, a finite and found to be effective in producing reward.
MDP can be defined as a tuple M = (S, A, P, But to discover such actions, it has to try
R, γ), where: actions that it has not selected before. The
agent has to exploit what it has already
S is the set of states: S = (s1, s2, …, experienced in order to obtain reward, but it
sn). also has to explore in order to make better
A is a set of actions: A = (a1, a2, …, action selections in the future”. There are
an). different strategies that can be used to handle
γ ∈ [0,1] is the discount factor and is this problem:
used to control the weight of the future
1. Random policy: during the training concepts, student knowledge and how they are
process the agent always chooses related to each other. The student starts
random actions. This means that it learning the course material. The system gives
always explores and does not exploit the student a lesson that teaches some
what it has already learned. concepts. Depending on the student ability to
2. Greedy policy: during the training learn, he/she may learn these concepts or not.
If the student does not learn all the concepts
process the agent always chooses the
given by the current lesson, the system cannot
action that gives the best reward. In
give him/her a new lesson. So, the system
this way, it is always exploiting the should make sure that the student has absorbed
knowledge that has gained and uses it all the material given by the current lesson
to choose the action that gives the best before giving the next one. We propose the use
reward. of reinforcement learning to train the
3. Epsilon-greedy: this method balances pedagogical module that based on student
the tradeoff between exploration and knowledge and the concepts that are taught by
exploitation. With probability epsilon each lesson to decide what lesson to give
ε it chooses a random action, and with him/her. The system will start by giving the
probability 1- ε it chooses the best first lesson, and then following the student
action. The epsilon value decreases progress will give every other lesson until the
with time reducing exploration and end of the course. To model this as a
reinforcement learning problem, we need to
increasing exploitation in order to
define the set of states, actions and rewards. In
make use of the knowledge is gained. [15] we have given a definition of those
4. Boltzmann (soft-max) exploration: elements that create a framework for doing the
one problem of the epsilon-greedy training using reinforcement learning
method is that the exploration action is approach. One problem that arises when
selected uniform randomly from the dealing with reinforcement learning is the fact
set of actions. This means that it is that in order to do the training, it is required a
equally likely to choose the worst relatively large number of iterations and data.
appearing action and the second-best This cannot be achieved using real students,
appearing action. The Boltzmann because the process would be very long. In
exploration uses the Boltzmann [15] we have proposed the use of a simulated
distribution [12] to assign a student that can be used during the training
process. The student has some ability to learn
probability to the actions Pt(a):
which is given in the form of a learning
probability, and this defines his/her ability to
learn every concept that is taught by the
lessons of the course.
T is a temperature parameter. When
T=0 the agent does not explore at all,
and when T → ∞ the agent selects 4. Experimental results
actions randomly.
We have done the training in a simulated
3. Proposed model environment by simulating the behavior of the
student. For every episode the student starts
The model that we propose focuses on the with knowing random concepts, and the
pedagogical module of the intelligent tutoring system tries to learn what is the next lesson to
system. This is a system for teaching lessons give. We have used the DQN algorithm as
of Python programming language based on given by [13], using memory replay and target
concepts and student knowledge. The learning network. Figure 1 gives the architecture of the
material is composed of lessons. Every lesson target and train networks.
teaches some concepts and may require some
previous concepts to be known by the student.
In [14] we give a definition of lessons,
Figure 1: The architecture of the neural
network Figure 5: Reward per episode for epsilon-
greedy strategy
The hyper parameters used during the training
are given in the Figure 2.
Figure 6: Reward per episode for Boltzmann
strategy
Figure 2: The hyper parameters used during
training/testing
The training is done using different 4.1. Testing
exploration strategies for the same number of
episodes. For each of the strategies we give the After we performed the training, we have
total reward received for every episode during tested the performance of each of the
the training process in figures 3, 4, 5, 6. models learned by using them in
simulations, for 100 episodes with a student
that knows random concepts and learning
probability the same as the one used during
the training process. For each of the tests,
we show the total reward received and the
length for each episode of the training in
figures 7 to 14.
Figure 3: Reward per episode for random
strategy
Figure 7: Reward per episode in testing
random strategy
Figure 4: Reward per episode for greedy
strategy
Figure 8: Episode length in testing random Figure 12: Episode length in testing epsilon-
strategy greedy strategy
Figure 9: Reward per episode in testing Figure 13: Reward per episode in testing
greedy strategy Boltzmann strategy
Figure 10: Episode length in testing greedy Figure 14: Episode length in testing Boltzmann
strategy strategy
5. Conclusion
In this work we have compared the
performance of different exploration strategies
used in training an intelligent tutoring system
using reinforcement learning. We took into
consideration 4 strategies: random, greedy,
epsilon-greedy and Boltzmann (soft-max). For
each of the strategies used, we have considered
the reward gained for every episode during the
Figure 11: Reward per episode in testing training and testing, to evaluate which one
epsilon-greedy strategy performed better. We saw that during the
training phase, random and greedy strategies
performed worse.
The reward was negative for every episode, [4] Burns, H. L. & Capps, C. G. (1988)
which means that they chose the worst action Foundations of intelligent tutoring
for most of the time. For the random policy systems: an introduction. In
this means that it always explores and never Foundations of Intelligent Tutoring
exploits the knowledge. For the greedy policy Systems (eds M. C. Polson & J. J.
this means that it always tries to exploit its Richardson). Lawrence Erlbaum,
knowledge, but it never explores for new London, pp. 1–19
actions that may be more profitable. On the [5] Malpani, A., Ravindran, B., &
other hand, the epsilon-greedy and Boltzmann Murthy, H. (2011). Personalized
strategies performed best during the training Intelligent Tutoring System using
phase, with Boltzmann strategy getting slightly Reinforcement Learning. In Florida
higher rewards. These strategies use a Artificial Intelligence Research
combination of exploration and exploitation, Society Conference. Retrieved from
which makes them perform better. https://aaai.org/ocs/index.php/FLAIRS
During the testing phase we see that /FLAIRS11/paper/view/2597/3105
greedy policy performs worse than every other [6] Martin, K. N., & Arroyo, I. (2004).
policy. This shows that the system has not AgentX: Using Reinforcement
learned anything during the training phase. Learning to Improve the Effectiveness
Random and epsilon-greedy policies of Intelligent Tutoring Systems.
performed well during the testing phase with Intelligent Tutoring Systems, 564–
almost the same reward gained. Even though 572. https://doi.org/10.1007/978-3-
random policy performed poorly during the 540-30139-4_53
training phase, it did quite well during testing, [7] Nasir, M., & Fellus, L. & Pitti, A.
meaning that the high level of exploration (2018). SPEAKY Project: Adaptive
learned some good actions. The Boltzmann Tutoring System based on
policy was the best during the testing phase, Reinforcement Learning for Driving
getting the highest reward values. This shows Exercizes and Analysis in ASD
that this policy learned better which are the Children. ICDL-EpiRob Workshop on
best actions to take. Also, comparing the “Understanding Developmental
episode length during the testing phase, Disorders: From Computational
Boltzmann strategy has the shortest episode Models to Assistive Technologies".
lengths. This shows that it finishes each Tokyo, Japan. ⟨ hal-01976660⟩
episode without reaching the episode length [8] Sarma, B. H. S., & Ravindran, B.
limit, meaning that it finishes the episode (2007). Intelligent Tutoring Systems
faster because it takes the right actions. using Reinforcement Learning to teach
Autistic Students. Home Informatics
6. References and Telematics: ICT for The Next
Billion, 241, 65–78.
https://doi.org/10.1007/978-0-387-
[1] Brusilovsky, P. & Peylo, C. (2003). 73697-6_5
Adaptive and Intelligent Web-based [9] Shawky, D., & Badawi, A. (2018). A
Educational Systems. Inter-national Reinforcement Learning-Based
Journal of Artificial Intelligence in Adaptive Learning System. The
Education (IJAIED),13, pp.159-172. {hal- International Conference on Advanced
00197315} Machine Learning Technologies and
[2] Iglesias, A., Martinez, P., & Fernandez, F. Applications (AMLTA2018), 221–
(2003). An Experience Applying 231. https://doi.org/10.1007/978-3-
Reinforcement Learning in a Web-Based 319-74690-6_22
Adaptive and Intelligent Educational [10] Wang, F. (2018).
System. Informatics in Education, 2(2),
Reinforcement Learning in a POMDP
223–240. Based Intelligent Tutoring System for
https://doi.org/10.15388/infedu.2003.17 Optimizing Teaching Strategies.
[3] Wenger, E. (1987). Artificial Intelligence International Journal of Information
and Tutoring Systems. Morgan Kaufman and Education Technology, 8(8), 553–
558.
https://doi.org/10.18178/ijiet.2018.8.8.
1098
[11] Sutton, R. S. and Barto, A. G.
(2018) Reinforcement Learning: An
Introduction (2nd Edition, in
preparation). MIT Press.
[12] Barto, A. G., Bradtke, S. J.,
and Singh, S. P., (1991) Real-time
learning and control using
asynchronous dynamic programming.
University of Massachusetts at
Amherst, Department of Computer
and Information Science.
[13] Mnih, V., Kavukcuoglu, K.,
Silver, D., Rusu, A. A., Veness, J.,
Bellemare, M. G., Graves, A.,
Riedmiller, M., Fidjeland, A. K.,
Ostrovski, G., Petersen, S., Beattie, C.,
Sadik, A., Antonoglou, I., King, H.,
Kumaran, D., Wierstra, D., Legg, S.,
& Hassabis, D. (2015). Human-level
control through deep reinforcement
learning. Nature, 518(7540), 529–533.
https://doi.org/10.1038/nature14236
[14] Koroveshi, J., Ktona, A.
(2020). MODELLING AN
INTELLIGENT TUTORING
SYSTEM USING
REINFORCEMENT LEARNING.
Knowledge International Journal,
43(3), 483 - 487. Retrieved from
https://ikm.mk/ojs/index.php/KIJ/articl
e/view/4745
[15] Koroveshi, J., Ana Ktona.
(2021). Training an Intelligent
Tutoring System Using Reinforcement
Learning. International Journal of
Computer Science An Information
Technolgy, 19(3), 10-18,
http://doi.org/10.5281/zenodo.466145
5