=Paper=
{{Paper
|id=Vol-2872/paper02
|storemode=property
|title=A comparison of exploration strategies used in reinforcement learning for building
                        an intelligent tutoring system
|pdfUrl=https://ceur-ws.org/Vol-2872/paper02.pdf
|volume=Vol-2872
|authors=Jezuina Koroveshi,Ana Ktona
|dblpUrl=https://dblp.org/rec/conf/rtacsit/KoroveshiK21
}}
==A comparison of exploration strategies used in reinforcement learning for building
                        an intelligent tutoring system==
<pdf width="1500px">https://ceur-ws.org/Vol-2872/paper02.pdf</pdf>
<pre>
A comparison of exploration strategies used in reinforcement
learning for building an intelligent tutoring system
Jezuina Koroveshia, Ana Ktonab
a
    University of Tirana, Faculty of Natural Sciences, Tirana, Albania
b
    University of Tirana, Faculty of Natural Sciences, Tirana, Albania


                  Abstract
                  Reinforcement learning is a form of machine learning where an intelligent agent learns to
                  make decisions by interacting with some environment. The agent may have no prior
                  knowledge of the environment and discovers it through interaction. For every action that the
                  agent takes, the environment gives a reward signal that is used to measure how good or bad
                  that action was. In this way, the agent learns which are more favorable actions to take in every
                  state of the environment. There are different approaches to solve a reinforcement learning
                  problem, but one drawback that arises during this process is the tradeoff between exploration
                  and exploitation. In this work we focus on studying different exploration strategies and
                  compare their effect in the performance of an intelligent tutoring system that is modeled as a
                  reinforcement learning problem. An intelligent tutoring system is a system that helps in the
                  process of teaching and learning by adapting to student needs and behaving differently for
                  each student. We train this system using reinforcement learning and different exploration
                  strategies and compare the performance of training and testing to find which is the best
                  strategy.

                  Keywords
                  Reinforcement learning, exploration strategies, intelligent tutoring system


1. Introduction
                                                                                            In this approach every student is given the
                                                                                            same materials to learn regardless of his/her
    Intelligent educational systems are systems
                                                                                            needs and preferences. These systems are not
that apply techniques from the field of
                                                                                            well suited for all students because they may
Artificial Intelligence to provide better support
                                                                                            come from different backgrounds, may have
for the users of the system [1]. Web-based
                                                                                            different learning styles and do not absorb the
Adaptive and Intelligent Educational Systems
                                                                                            lessons with the same peace. An intelligent
provide intelligence and student adaptability,
                                                                                            tutoring system customizes the learning
inheriting properties from Intelligent Tutoring
                                                                                            experience that the student perceives by taking
Systems (ITS) and Adaptive Hypermedia
                                                                                            into consideration factors such as pre-existing
Systems (AHS) [2]. [3] defines an Intelligent
                                                                                            knowledge, learning style and student
Tutoring System (ITS) as computer-aided
                                                                                            progress. According to [4] an intelligent
instructional system with models of
                                                                                            tutoring system usually has the following
instructional content that specify what to
                                                                                            modules: the student module that manages all
teach, and teaching strategies that specify how
                                                                                            the information related to the student during
to teach.
                                                                                            the learning process; the domain module that
    Traditional tutoring systems use the one-to-
                                                                                            contains all the information related to the
many way of presenting the learning materials
                                                                                            knowledge to teach, such as topics, tasks,
to the students.
                                                                                            relation between them, difficulty.; the
    _________________________________
Proccedings of RTA-CSIT 2021, May 2021, Tirana, Albania                                     pedagogical module, also called tutor module
EMAIL: jezuina.koroveshi@fshn.edu.al (A.1);                                                 that decides what, how and when to teach the
ana.ktona@fshn.edu.al (A.2);                                                                learning materials.; the graphical user interface
               2021 Copyright for this paper by its authors. Use permitted under Creative
            Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                            module that facilitates the communication
            CEUR Workshop Proceedings (CEUR-WS.org)                                         between the system and the student. Different
techniques from artificial intelligence can be         reward in comparison to immediate
applied in order to make these systems more            rewards.
“intelligent”, but our study is focused on the             P defines the probability of transitions
use of reinforcement learning (RL).                    from s to s’ when taking action a in state s:
    Reinforcement learning is a form of                     Pss’ = Pr{st+1 = s’ | st = s, at = a}
machine learning that is based on learning                 R defines the reward function for each
from experience. The learner is exposed to             of the transitions, the reward we get if we
some environment, for which he may or may              take action a in state s and end up in state
not have information, starts making decisions          s’: Rss’ = E{rt+1 | st = s, at = a, st+1 = s’}
and gets some feedback that gives information
telling how good or bad that decision was.          The goal of the agent is to maximize the total
Based on the feedback from the environment          reward it receives. The agent should maximize the
the learner learns which decisions are more         total cumulative reward it receives in the long run,
favorable to take. This class of machine            not just the immediate reward [11]. The expected
learning has been used in modeling and              discounted reward is defined as follows by [11]:
building intelligent tutoring systems such as in
                                                    Gt = Rt+1 + γ Rt+2 + γ2 Rt+3 + … =      γk Rt+k+1
the works from [5], [6], [7], [8], [9], [10].
    The remainder of this paper is organized as
                                                    The sequence of states that end up in a
follows: in section 2 we give an overview on
                                                    terminal state is called an episode. The general
reinforcement learning, in section 3 we
                                                    process of RL may be defined as follows:
describe the model that we have used to build
an intelligent tutoring system, in section 4 we
give the experimental results of training the          1. At each time step t, the agent is in a
model using different exploration strategies           state s(t).
and is section 5 we give the conclusions of our        2. The agent choses one of the possible
work.                                                  actions in this state, a(t), and applies that
                                                       action.
                                                       3. After applying the action, the agent
2. Reinforcement learning                              transitions in a new state s(t+1) and gets a
                                                       numerical      reward     r(t)   from     the
    Reinforcement learning is a form of                environment.
machine learning in which the learner learns           4. If the new state is not terminal, the
some sequence of actions by interacting with           agent repeats the step 2, otherwise the
the environment. The learner is in a state of the      episode is finished.
environment, takes some action that moves it
from that state to another and after each action
the environment gives a reward signal. This
reward signal is used to learn which are the
                                                    2.1 Exploration and exploitation
best states to be in, and therefore learn which     dilemma
action to take in order to go in those states. A
reinforcement learning problem can be                   One challenge of reinforcement learning is
modeled as a Markov Decision Process                the tradeoff between exploration and
(MDP). A MDP is a stochastic process that           exploitation [11]. As given by [11]: “To obtain
satisfies the Markov Property. In a finite MDP,     a lot of reward, a reinforcement learning agent
the set of states, actions and rewards have a       must prefer actions that it has tried in the past
finite number of elements. Formally, a finite       and found to be effective in producing reward.
MDP can be defined as a tuple M = (S, A, P,         But to discover such actions, it has to try
R, γ), where:                                       actions that it has not selected before. The
                                                    agent has to exploit what it has already
       S is the set of states: S = (s1, s2, …,     experienced in order to obtain reward, but it
   sn).                                             also has to explore in order to make better
       A is a set of actions: A = (a1, a2, …,      action selections in the future”. There are
   an).                                             different strategies that can be used to handle
       γ ∈ [0,1] is the discount factor and is     this problem:
   used to control the weight of the future
   1. Random policy: during the training         concepts, student knowledge and how they are
      process the agent always chooses           related to each other. The student starts
      random actions. This means that it         learning the course material. The system gives
      always explores and does not exploit       the student a lesson that teaches some
      what it has already learned.               concepts. Depending on the student ability to
   2. Greedy policy: during the training         learn, he/she may learn these concepts or not.
                                                 If the student does not learn all the concepts
      process the agent always chooses the
                                                 given by the current lesson, the system cannot
      action that gives the best reward. In
                                                 give him/her a new lesson. So, the system
      this way, it is always exploiting the      should make sure that the student has absorbed
      knowledge that has gained and uses it      all the material given by the current lesson
      to choose the action that gives the best   before giving the next one. We propose the use
      reward.                                    of reinforcement learning to train the
   3. Epsilon-greedy: this method balances       pedagogical module that based on student
      the tradeoff between exploration and       knowledge and the concepts that are taught by
      exploitation. With probability epsilon     each lesson to decide what lesson to give
      ε it chooses a random action, and with     him/her. The system will start by giving the
      probability 1- ε it chooses the best       first lesson, and then following the student
      action. The epsilon value decreases        progress will give every other lesson until the
      with time reducing exploration and         end of the course. To model this as a
                                                 reinforcement learning problem, we need to
      increasing exploitation in order to
                                                 define the set of states, actions and rewards. In
      make use of the knowledge is gained.       [15] we have given a definition of those
   4. Boltzmann (soft-max) exploration:          elements that create a framework for doing the
      one problem of the epsilon-greedy          training     using     reinforcement      learning
      method is that the exploration action is   approach. One problem that arises when
      selected uniform randomly from the         dealing with reinforcement learning is the fact
      set of actions. This means that it is      that in order to do the training, it is required a
      equally likely to choose the worst         relatively large number of iterations and data.
      appearing action and the second-best       This cannot be achieved using real students,
      appearing action. The Boltzmann            because the process would be very long. In
      exploration uses the Boltzmann             [15] we have proposed the use of a simulated
      distribution [12] to assign a              student that can be used during the training
                                                 process. The student has some ability to learn
      probability to the actions Pt(a):
                                                 which is given in the form of a learning
                                                 probability, and this defines his/her ability to
                                                 learn every concept that is taught by the
                                                 lessons of the course.
       T is a temperature parameter. When
       T=0 the agent does not explore at all,
       and when T → ∞ the agent selects          4. Experimental results
       actions randomly.
                                                    We have done the training in a simulated
3. Proposed model                                environment by simulating the behavior of the
                                                 student. For every episode the student starts
   The model that we propose focuses on the      with knowing random concepts, and the
pedagogical module of the intelligent tutoring   system tries to learn what is the next lesson to
system. This is a system for teaching lessons    give. We have used the DQN algorithm as
of Python programming language based on          given by [13], using memory replay and target
concepts and student knowledge. The learning     network. Figure 1 gives the architecture of the
material is composed of lessons. Every lesson    target and train networks.
teaches some concepts and may require some
previous concepts to be known by the student.
In [14] we give a definition of lessons,
   Figure 1: The architecture of the neural
                   network                          Figure 5: Reward per episode for epsilon-
                                                                 greedy strategy
The hyper parameters used during the training
are given in the Figure 2.


                                                   Figure 6: Reward per episode for Boltzmann
                                                                    strategy
 Figure 2: The hyper parameters used during
               training/testing

The training is done using different                 4.1.        Testing
exploration strategies for the same number of
episodes. For each of the strategies we give the    After we performed the training, we have
total reward received for every episode during      tested the performance of each of the
the training process in figures 3, 4, 5, 6.         models learned by using them in
                                                    simulations, for 100 episodes with a student
                                                    that knows random concepts and learning
                                                    probability the same as the one used during
                                                    the training process. For each of the tests,
                                                    we show the total reward received and the
                                                    length for each episode of the training in
                                                    figures 7 to 14.


  Figure 3: Reward per episode for random
                  strategy


                                                     Figure 7: Reward per episode in testing
                                                                random strategy

   Figure 4: Reward per episode for greedy
                  strategy
Figure 8: Episode length in testing random     Figure 12: Episode length in testing epsilon-
                 strategy                                    greedy strategy


 Figure 9: Reward per episode in testing         Figure 13: Reward per episode in testing
             greedy strategy                                Boltzmann strategy


Figure 10: Episode length in testing greedy   Figure 14: Episode length in testing Boltzmann
                 strategy                                        strategy

                                              5. Conclusion
                                                  In this work we have compared the
                                              performance of different exploration strategies
                                              used in training an intelligent tutoring system
                                              using reinforcement learning. We took into
                                              consideration 4 strategies: random, greedy,
                                              epsilon-greedy and Boltzmann (soft-max). For
                                              each of the strategies used, we have considered
                                              the reward gained for every episode during the
 Figure 11: Reward per episode in testing     training and testing, to evaluate which one
         epsilon-greedy strategy              performed better. We saw that during the
                                              training phase, random and greedy strategies
                                              performed worse.
The reward was negative for every episode,            [4] Burns, H. L. & Capps, C. G. (1988)
which means that they chose the worst action              Foundations of intelligent tutoring
for most of the time. For the random policy               systems:      an      introduction.     In
this means that it always explores and never              Foundations of Intelligent Tutoring
exploits the knowledge. For the greedy policy             Systems (eds M. C. Polson & J. J.
this means that it always tries to exploit its            Richardson). Lawrence Erlbaum,
knowledge, but it never explores for new                  London, pp. 1–19
actions that may be more profitable. On the           [5] Malpani, A., Ravindran, B., &
other hand, the epsilon-greedy and Boltzmann              Murthy, H. (2011). Personalized
strategies performed best during the training             Intelligent Tutoring System using
phase, with Boltzmann strategy getting slightly           Reinforcement Learning. In Florida
higher rewards. These strategies use a                    Artificial     Intelligence     Research
combination of exploration and exploitation,              Society Conference. Retrieved from
which makes them perform better.                          https://aaai.org/ocs/index.php/FLAIRS
     During the testing phase we see that                 /FLAIRS11/paper/view/2597/3105
greedy policy performs worse than every other         [6] Martin, K. N., & Arroyo, I. (2004).
policy. This shows that the system has not                AgentX:        Using       Reinforcement
learned anything during the training phase.               Learning to Improve the Effectiveness
Random       and      epsilon-greedy       policies       of Intelligent Tutoring Systems.
performed well during the testing phase with              Intelligent Tutoring Systems, 564–
almost the same reward gained. Even though                572.      https://doi.org/10.1007/978-3-
random policy performed poorly during the                 540-30139-4_53
training phase, it did quite well during testing,     [7] Nasir, M., & Fellus, L. & Pitti, A.
meaning that the high level of exploration                (2018). SPEAKY Project: Adaptive
learned some good actions. The Boltzmann                  Tutoring       System       based      on
policy was the best during the testing phase,             Reinforcement Learning for Driving
getting the highest reward values. This shows             Exercizes and Analysis in ASD
that this policy learned better which are the             Children. ICDL-EpiRob Workshop on
best actions to take. Also, comparing the                 “Understanding            Developmental
episode length during the testing phase,                  Disorders:      From       Computational
Boltzmann strategy has the shortest episode               Models to Assistive Technologies".
lengths. This shows that it finishes each                 Tokyo, Japan. ⟨ hal-01976660⟩
episode without reaching the episode length           [8] Sarma, B. H. S., & Ravindran, B.
limit, meaning that it finishes the episode               (2007). Intelligent Tutoring Systems
faster because it takes the right actions.                using Reinforcement Learning to teach
                                                          Autistic Students. Home Informatics
6. References                                             and Telematics: ICT for The Next
                                                          Billion,            241,           65–78.
                                                          https://doi.org/10.1007/978-0-387-
[1] Brusilovsky, P. & Peylo, C. (2003).                   73697-6_5
    Adaptive and Intelligent Web-based                [9] Shawky, D., & Badawi, A. (2018). A
    Educational      Systems.     Inter-national          Reinforcement             Learning-Based
    Journal of Artificial Intelligence in                 Adaptive Learning System. The
    Education (IJAIED),13, pp.159-172. {hal-              International Conference on Advanced
    00197315}                                             Machine Learning Technologies and
[2] Iglesias, A., Martinez, P., & Fernandez, F.           Applications (AMLTA2018), 221–
    (2003).     An     Experience     Applying            231.      https://doi.org/10.1007/978-3-
    Reinforcement Learning in a Web-Based                 319-74690-6_22
    Adaptive and Intelligent Educational              [10]         Wang,           F.       (2018).
    System. Informatics in Education, 2(2),
                                                          Reinforcement Learning in a POMDP
    223–240.                                              Based Intelligent Tutoring System for
    https://doi.org/10.15388/infedu.2003.17               Optimizing       Teaching      Strategies.
[3] Wenger, E. (1987). Artificial Intelligence            International Journal of Information
    and Tutoring Systems. Morgan Kaufman                  and Education Technology, 8(8), 553–
                                                          558.
    https://doi.org/10.18178/ijiet.2018.8.8.
    1098
[11]         Sutton, R. S. and Barto, A. G.
    (2018) Reinforcement Learning: An
    Introduction (2nd            Edition, in
    preparation). MIT Press.
[12]         Barto, A. G., Bradtke, S. J.,
    and Singh, S. P., (1991) Real-time
    learning      and      control     using
    asynchronous dynamic programming.
    University of Massachusetts at
    Amherst, Department of Computer
    and Information Science.
[13]         Mnih, V., Kavukcuoglu, K.,
    Silver, D., Rusu, A. A., Veness, J.,
    Bellemare, M. G., Graves, A.,
    Riedmiller, M., Fidjeland, A. K.,
    Ostrovski, G., Petersen, S., Beattie, C.,
    Sadik, A., Antonoglou, I., King, H.,
    Kumaran, D., Wierstra, D., Legg, S.,
    & Hassabis, D. (2015). Human-level
    control through deep reinforcement
    learning. Nature, 518(7540), 529–533.
    https://doi.org/10.1038/nature14236
[14]         Koroveshi, J., Ktona, A.
    (2020).        MODELLING             AN
    INTELLIGENT                TUTORING
    SYSTEM                           USING
    REINFORCEMENT              LEARNING.
    Knowledge International Journal,
    43(3), 483 - 487. Retrieved from
    https://ikm.mk/ojs/index.php/KIJ/articl
    e/view/4745
[15]         Koroveshi, J., Ana Ktona.
    (2021). Training an Intelligent
    Tutoring System Using Reinforcement
    Learning. International Journal of
    Computer Science An Information
    Technolgy,          19(3),        10-18,
    http://doi.org/10.5281/zenodo.466145
    5

</pre>