=Paper= {{Paper |id=Vol-1315/paper8 |storemode=property |title=Learning Graspability of Unknown Objects via Intrinsic Motivation |pdfUrl=https://ceur-ws.org/Vol-1315/paper8.pdf |volume=Vol-1315 |dblpUrl=https://dblp.org/rec/conf/aic/TemelGS14 }} ==Learning Graspability of Unknown Objects via Intrinsic Motivation== https://ceur-ws.org/Vol-1315/paper8.pdf
      Learning Graspability of Unknown
       Objects via Intrinsic Motivation

              Ercin Temel1 , Beata J. Grzyb2 , and Sanem Sariel1
                 1
                   Artificial Intelligence and Robotics Laboratory
                       Computer Engineering Department,
                  Istanbul Technical University, Istanbul, Turkey
                        {ercintemel,sariel}@itu.edu.tr
                    2
                      Centre for Robotics and Neural Systems,
                 Plymouth University, Plymouth, United Kingdom
                           beata.grzyb@plymouth.ac.uk



      Abstract. Interacting with unknown objects, and learning and produc-
      ing e↵ective grasping procedures in particular, are challenging problems
      for robots. This paper proposes an intrinsically motivated reinforcement
      learning mechanism for learning to grasp uknown objects. The mech-
      anism uses frustration to determine when grasping of an object is not
      possible. The critical threshold of frustration is dynamically regulated
      by impulsiveness of the robot. Here, the artificial emotions regulate the
      learning rate according to the current task and performance of the robot.
      The proposed mechanism is tested in a real world scenario where the
      robot, using the grasp pairs generated in simulation, has to learn which
      objects are graspable. The results shows that the robot equipped with
      frustration and impulsiveness learns faster than the robot with standard
      action selection strategies providing some evidence that the use of arti-
      ficial emotions can improve the learning time.

      Keywords: Reinforcement Learning, Intrinsic motivation, Grasping un-
      known objects, Frustration, Impulsiveness, Visual scene representation,
      Vision-based grasping


1   Introduction
Robots need e↵ective grasp procedures to interact with and manipulate unknown
objects. In unstructured environments, challenges arise mainly due to uncertain-
ties in sensing and control, and lack of prior knowledge and model of objects.
E↵ective learning methods are essential to deal with these challenges. One clas-
sic approach here is to use reinforcement learning (RL) where an agent actively
interacts with an environment and learns from the consequences of its actions,
rather than from being explicitly taught. An agent selects its actions on basis
of its past experiences (exploitation) and also by new choices (exploration). The
goal of an agent is to maximize the global reward, therefore the agent needs to
rely on actions that led to high rewards in the past. However, if the agent is




                                  Page 98 of 171
too greedy and neglects exploration, it might never find the optimal strategy for
the task. Hence, to find the best ways to perform an action they need to find a
balance between exploitation of current knowledge and exploration to discover
new knowledge that might lead to better performance in the future.
     We propose a competence-based approach to reinforcement learning where
exploration and exploitation is balanced while learning to grasp novel objects. In
our approach, the dynamics of balancing between exploration and exploitation
is tightly related to the level of frustration. The failures in obtaining a new
goal may significantly increase the robot’s level of frustration, and push it into
searching new solutions in order to achieve its goal. However, a prolonged state
of frustration, when no solution can been found, will lead to a state of learned
helplessness, and the goal will be marked as unachievable at the current state
(i.e., object not graspable). Simply speaking, an optimal level of frustration
favours more explorative behaviour, whereas low or high level of frustration
favours more exploitative behaviour. Additionally, we dynamically change the
robot’s impulsiveness that influences how fast the robot gets frustrated, and
indirectly how much time it devotes to learning a particular task.
     To demonstrate the advantages of our approach, we compare it with three
other action selection methods: "-greedy algorithm, softmax function with con-
stant temperature parameter, softmax function with variable temperature de-
pending on agent’s overall frustration level. The results shows that the robot
equipped with frustration and impulsiveness learns faster than the robot with
standard action selection strategies providing some evidence that the use of ar-
tificial emotions can improve the learning time.
     The rest of the paper is organized as follows. We first present related work in
the area. Then, we give the details of the learning system including visual pro-
cessing of objects, the RL framework and the proposed action selection strategies.
In the next section, we present the experimental results and then conclude the
paper.


2   Related Work

Our main focus is on learning graspability of objects. Previously, analytical meth-
ods are proposed for grasping objects [3], [8], [4]. These methods use contact
point locations on objects and the gripper, and then find the friction coefficients
by tactile sensors to compute force [15]. With these data, grasp stability values
or promising grasp positions can be determined. Another approach for grasping
is learning by exploration. In a recent work [6], grasp successes are associated
with 3D object models which can lead algorithms to memorize object grasp co-
ordination. According to their work, grasping unknown objects is a challenging
problem and it varies in accordance with system complexity. This complexity
depends on the chosen sensors, prior knowledge about environment and scene
configuration. In [11], 2D contours are used for approximating the center of mass
of objects for grasping.




                                  Page 99 of 171
    In our work, we use reinforcement learning (RL) framework for learning and
incorporate competence-based intrinsic motivation for guidance in search. The
complexity of reinforcement learning is high in terms of the number of state-
action pairs and the computations needed to determine utility values [14]. Ap-
proximate policy iteration methods can be used to alleviate this problem based
on sampling [7]. Imitation learning before reinforcement learning [12] is one of
the methods for decreasing the complexity in RL [5]. Furthermore, it is also used
for robots learn crucial parameters in movement to accomplish the task.
    In our work, we use a competence-based approach for intrinsic motivation
for balancing exploration in RL. Frustration level of the robot is taken into
account. We further extend this approach by adopting an adaptive frustration
level depending on a task. Intrinsic motivation is investigated in earlier works.
Lenat [13] propose a system considering ”interestingness” and Schmidhuber in-
troduce curiosity concept for reinforcement learning [19]. Uchibe and Doya [22]
also consider intrinsic motivation as learning objective. Di↵erent from curiosity
and reward functions, Wong [24] point out that ideal level of frustration is ben-
eficial for exploration and faster learning. In addition, Baranes and Oudeyer [1]
propose competence-based intrinsic motivation for learning. In our work, main
di↵erence is that impulsiveness [20] is adapted into the frustration rate in order
to change the learning rate dynamically based on a task in real world environ-
ment for robots.


3     Learning to Grasp Unknown Objects

We propose an intrinsically motivated reinforcement learning system for robots
to learn graspability of unknown objects. The system includes two main phases
for determination of grasp points on objects and experimentation of them in the
real world (Fig. 1). The first phase includes the required methods to determine
candidate grasp point pairs in simulation. Note that a robot arm with a two-
fingered end e↵ector is selected as the target platform. For this reason, grasp
points are determined as point pairs. In the second phase of the system, the
grasp points determined in the first phase are experimented in the real world
through reinforcement learning. The following subsections explain the details of
these processes.


3.1   Visual Representation of Objects

In our system, objects are detected in the scene by using an ASUS Xtion Pro
Live RGB-D camera mounted on a linear platform for interpreting the scene for
tabletop manipulation scenarios by a robotic arm. We use a scene interpretation
system that can both recognize known objects and detect unknown objects in
the scene [9]. For unknown object detection, Organized Point Cloud Segmen-
tation with Connected Components algorithm [21] from PCL [16] is used. This
algorithm finds and marks connected pixels coming from the RGB-D camera




                                Page 100 of 171
         3D Edge                         Possible Grasp Point                        Candidate Grasp
        Detection                          Determination                              Pairs Selection


                                      The Simulation Environment

               Transfer Point Cloud                                     Transfer Candidates
                      to the                                                   to the
             Simulation Environment                                         Real World



                                                                                     Real World
                                                    Frustration Based
      3D Point Cloud                                                             Experimentation via
                                                     Decision About
       From Camera                                                                 Reinforcement
                                                       Graspability
                                                                                      Learning

                             The Real World Environment with the Robot and Objects



   Fig. 1. Overview of the intrinsically motivated reinforcement learning system.




and finds the outlier 3D edges by RANdom SAmple Consensus (RANSAC) al-
gorithm [17]. Hence, the object’s center of mass and its edges are detected to
be used by the grasp point detection algorithm that finds candidate grasp point
pairs for a two-fingered robotic hand.



3.2    Detection of Candidate Grasp Points in the Simulator


Objects are represented by their center of masses (µ) and 3D edges (H). Then
candidate grasp point pairs (⇢ =[p1 , p2 ]) are determined as in Algorithm 1. In
the algorithm, initially the reference points are determined. The center of mass,
the upside and the bottom side center points are chosen as references. Based on
these points, cross section points coplanar with the reference points and parallel
to the table surface are determined. In the next step, the algorithm detects
the closest point to the reference points on the same planar and draw a line
crossing with reference points and closest to it. The second step is determining
the opposite point to the closest one on the same line. This procedure continues
until all points are tested. The algorithm produces the candidate grasp pairs
(two grasp points with x,y,z values) and orientation of each pair according to
(0,0) point in 2D (x,y) plane. These grasp points are tested in the simulator for
finding out only the feasible ones.
   In Fig. 2, the edges and sample grasp points for six di↵erent objects along
with the number of grasp points are presented.




                                         Page 101 of 171
Algorithm 1 Grasp Point Detection (µ, H)
  Input: Object Center Of Mass µ, Edge Point Cloud H
  Output: Grasp Pairs P
  Detect maxZ, minZ and C as reference point ref .
  for each reference point do
    cP oints = findPointsOnTheSamePlane()
    mP oint = findClosestPointToReferencePoint(cP oints)
    slope =findSlope(mP oint,ref )
    for each p 2 cP oints do
       P slope =findSlope(mP oint,p)
       if onTheSameLine(P slope,slope) then
          P    { p, mP oint }
       end if
    end for
  end for



      Edge Detection
                                                Edge Detection
                                                                                     Edge Detection


                                                                 Simulation                         Simulation
                           Simulation                                                             Representation
                         Representation                        Representation


               84 Grasp points are detected           48 Grasp points are detected           56 Grasp points are detected



                                                                                     Edge Detection
      Edge Detection                          Edge Detection




                                                            Simulation                                  Simulation
                         Simulation                                                                   Representation
                       Representation                     Representation


            46 Grasp points are detected         92 Grasp points are detected               116 Grasp points are detected




Fig. 2. Candidate grasp points on unknown objects are determined through a sequence
of processes. Samples points for six objects are illustrated. The first step is 3D edge de-
tection from 3D point cloud data. The second step is determination of candidate grasp
point pairs for which samples are marked with red points on the 3D edges extracted
from point clouds of objects. The number of feasible grasp points for each object is
presented.


3.3      Learning When to Give Up Grasping

In the system, the output of the simulation environment is fed to the robotic arm
to apply real-world experimentation. Intrinsic motivation with frustration level
and new proposed impulsiveness method are evaluated to increase the learning
process speed for the robot in order to give up quickly for the objects that are
not graspable.




                                                 Page 102 of 171
    The main task of the robot is to learn which objects are graspable. We use
a Reinforcement Learning (RL) framework with Q-learning [23] algorithm and
softmax action selection [2] strategy. The state space here are all grasp point
pairs generated during the simulation phase. A general state S is defined as:

                                S = [µ, ⇢, , !, Ov ]                            (1)
    where, µ is the center of mass of the object, ⇢ is the selected set of two
grasp points ⇢ =[p1 , p2 ], is the grasp orientation, ! is the approach direction
of the gripper and Ov is the 3D translation vector for object during grasp trial.
A collision between the robotic arm and the object may occur when there is a
trajectory error that results in a non-zero vector.
    Actions can be represented as follows,

                                  A = [||Rv ||, !]                              (2)
   where, ||Rv || is the slide amount on the x axis and ! represents the approach
vector to the object of interest.
   In our framework, the robot receives the reward value of 10 (Rmax ) when
the grasp is successful and 0.1 (Rmin ) when the grasp is unsuccessful [10]. The
Q-values are updated according to Eq. 3.

            Q0 (s, a) = Q(s, a) + ↵ ⇤ [R + ( ⇤ maxQ(s0 , a))    Q(s, a)]        (3)
             0                                                         0
    where, Q (s, a) the next Q-value for state action pair (s, a), Q (s, a) is the
current Q-value, ↵ is the learning rate, R is the immediate reward after perform-
ing an action a in state s, is the discount factor, maxQ(s0 , a) is the maximum
estimate of optimal future value.
    We investigate four action selection strategies. The first (and the simplest)
one is the "-greedy action selection method (M1 ). This method most of the
time selects the action with the highest estimated action value, but once in
a while (with a small probability "), selects an action at random, uniformly,
independently of the action-value estimates.
    The second one is the SoftMax Action selection (Eq. 4) method (M2 )with
constant temperature value [2]:

                                        eQt (a)/⌧
                             P (a)t = Pn     Qt (b)/⌧
                                                                                (4)
                                       b=1 e

    where, P (a)t is the probability of selecting an action a at the time step t,
Qt (a) is the value function for an action a, and ⌧ is the positive parameter called
the temperature that controls the stochasticity of a decision. A high value of the
temperature will cause the actions to be almost equiprobable and a low value
will cause a greater di↵erence in selection probability for actions that di↵er in
their value estimates.
    The third strategy (M3 ) also uses the Softmax action selection rule. In this
approach, however, the ⌧ parameter is flexible and changes dynamically in re-
lation to the robot’s level of frustration and sense of control [10]. An optimal




                                 Page 103 of 171
level of frustration favours more explorative behavior, whereas low or high level
of frustration leads to a more exploitative behavior. For the purpose of our sim-
ulations, frustration was represented as a simple leaky integrator:

                                df
                                   = L ⇤ f + A0                                (5)
                                dt
    where, f is the current level of frustration, A0 is the outcome of the action
(success or f ailure) and L is the fixed rate of the ’Leak’.
    In Eq. 5 the ’leak’ rate (L) was fixed and kept at value 1 for all simula-
tions [10]. Higher values of L cause the frustration rate to increase slower com-
pared to smaller values of L. That means that the robot with a high value of L
spends more time on exploration and possibly learns faster. Hence, we propose
the forth method (M4 ) that builds on this method and changes the value of L
dynamically using an expected utilization motivation formula [20]:

                                  expectancy ⇤ value
                             L=                                                 (6)
                                    Z + (T t)
    where expectancy represents the probability of getting the highest estimated
action value (as in the greedy action selection method), value refers to the ex-
pected action reward (here value = Rmax ), Z is a constant derived from when
rewards are immediate, indicates agent’s sensitivity to delay (impulsiveness)
and (T t) refers to the delay of the reward in terms of “time reward” minus
“time now”.
    The impulsiveness is main focus of ours to develop interaction with frus-
tration rate competence based motivation. According to triad ”Frustration -
Impulse - Temper”, a person who has high impulsiveness is considered as ”short
tempered” and it means quickly get frustrated so that changes on frustration
level for learning behavior. Our proposal with that, di↵erent values on impul-
siveness directly a↵ect rate of leak, L, on frustration formula so frustration rate
of agent also will be dependent on impulsiveness.
    The robot apart from learning how to grasp an object, also needs to learn
whether the target object is graspable or not. The learning of a selected grasp pair
⇢ and action a finishes when overall frustration level becomes equal or greater
than a certain threshold value. This value is determined based on a tolerance
formula:
                              T olerance = e ||Ov ||⇤'                           (7)
    where, ||Ov || denotes the translation of the object on the table because of
the collision with the end e↵ector and ' the number of trials from the beginning
of learning.
    Additionally, the online learning process may also end when the following
criterium has been met:
                                                     p
                           F rustrationLimit = e1/ n                            (8)

   where, n refers to the number of grasp pairs.




                                  Page 104 of 171
3.4   Impulsiveness and Learning Rate

The main focus of the presented work is investigating an e↵ect that impulsive-
ness has on frustration level and on learning. The learning rate and the speed
of decision making is an important issue in human-robot interaction [18]. For
example, when a robot plays a quick game with a human, it has to learn quickly.
However, when the robot is alone, it can spend relatively more time on explor-
ing di↵erent states. By changing the impulsiveness, the robot may dynamically
control its level of frustration and therefore the time devoted for learning a par-
ticular task. Hence, the robot could behave di↵erently in di↵erent environments
and for di↵erent tasks.


4     Experimental Results

As mentioned before, the candidate grasping points are first determined in sim-
ulation, and then transferred to a robotic arm for real-word experimentation.
V-REP simulator is used as the simulator and the Cyton-Veta Robotic 7-DOF
robot arm by Robai (shown in Fig 3) is used as the experimental platform. The
reachability of the arm is about 45 cm. Also in the experiments, we used three
objects of di↵erent size and shape (i.e., a small cubic plastic block, a plastic
bowling pin and a spherical plastic ball). We compare the performance of four




            (a) Success   of        (b) Success of       (c) Failure   of
            lengthwise grasp        transverse grasp     lengthwise grasp
            on the block.           on the pin.          on the ball.

Fig. 3. Illustrative examples for grasping three di↵erent objects. (a)A cubic block which
is relatively easy to grasp (b) a plastic bowling pin which can be grasped from top but
not for all prasp points (c) a plastic ball which cannot be grasped as it is solid and too
large.


di↵erent action selection methods discussed in the previous section. A high value




                                   Page 105 of 171
of impulsiveness results in a faster increase in a frustration level (in other words,
in a “short-tempered” agent). For comparison reasons, we use here two di↵erent
values of impulsiveness: a low value of 0.01 and a high value of 100. The results
of our experiments support our proposed hypothesis. An agent with low impul-
siveness spends more time on exploration, testing more grasp pair possibilities
than an agent with a higher value of impulsiveness. For demonstration purposes,
we chose three di↵erent objects that vary in their graspability properties: a cube
that is relatively easy to grasp, a plastic bowling pin that is easily graspable
but it is liable of toppling down, and finally, a ball that is not graspable at all.
We compare the decision and learning rate of the robot that uses our proposed
strategy (M4 ) with the one based only on frustration (M3 ). Fig. 4 shows robot’s
level of frustration for each learning epoch while the robot was learning how to
grasp the block. The 84 possible grasp pairs generated in simulation were used in
a real world scenario. Since the robot can easily grasp the cube, the frustration
level is kept low and the learning process terminates before it reaches its limit
value, 1.115 (i.e., according to Eq. 8.). In case of the pin (see Fig. 5), the simu-




 Fig. 4. Frustration Rate Changes For Block Grasping with Methods M3 and M4 .



lation generated 116 possible grasp pair candidates that were subsequently used
by the robotic arm. Since the pin is quite light, the arm pulls it down for some
grasp pairs. When the pin fells down, the frustration threshold is decreased for
the related grasp pairs according to the Eq. 7. Hence, the robot learns that these
grasp pairs should be eliminated from the set and immediately proceeds to test
another grasp pair. While for some grasp pairs grasping of the pin was possible,
the robot was not able to grasp the ball for any of grasp pairs. The ball was made
of a hard plastic material and quite light, so every robot’s attempt to grasp it
resulted in a ball rolling over on the scene Fig. 3(c). After each trial, the robot’s
tolerance for frustration decreased rapidly resulting in that the robot switches
to another grasp pair. With each failure, the overall frustration level was raising
and quickly exceeded the tolerance threshold (that at the same time was being
decreased). Although 92 grasp pairs were transferred to the real world scenario,
only after a few steps the robot learned that the object is not graspable. Fig.




                                 Page 106 of 171
  Fig. 5. Frustration Rate Changes For Pin Grasping with Methods M3 and M4 .




  Fig. 6. Frustration Rate Changes For Ball Grasping with Methods M3 and M4 .




    Fig. 7. Trial Count for Three Selected Object and Action Selection Method.



7 shows the comparison of the results for all four strategies of action selection.
The frustration-based action selection methods require a lower number of trials




                                 Page 107 of 171
to learn the graspability of the objects compared to the standard softmax action
selection with fixed temperature parameter and "-greedy action selection. The
agent with higher value of impulsiveness performs slightly better than the agent
with low value.


5   Conclusion
We have presented our intrinsically motivated reinforcement learning system
for learning graspability of novel objects. Intrinsic motivation is provided by
frustration-based action selection methods during learning, and tolerance values
are determined based on impulsiveness of the robot. Our claim is that impul-
siveness can be adjusted based on the task that the robot is executing. We have
analyzed this mechanism on a robotic arm to learn graspability of di↵erent-
shaped objects. Our results reveal that the intrinsic motivation helps the robot
learn faster. Furthermore, the decision on graspability is made earlier by taking
impulsiveness into account. Our future work includes extending the experiment
set and investigating impulsiveness parameters in detail for di↵erent domains
with varying time constraints.


Acknowledgment
This research is funded by a grant from the Scientific and Technological Research
Council of Turkey (TUBITAK), Grant No. 111E-286. TUBITAK’s support is
gratefully acknowledged. We thank Burak Topal for his contribution for robotic
arm movement and also thank Mehmet Biberci for his e↵ort on vision algorithms.


References
 1. Baranes, A., Oudeyer, P.Y.: Maturationally-constrained competence-based intrin-
    sically motivated learning. In: Development and Learning (ICDL), 2010 IEEE 9th
    International Conference on. pp. 197–203. IEEE (2010)
 2. Barto, A.G.: Reinforcement learning: An introduction. MIT press (1998)
 3. Bicchi, A.: On the closure properties of robotic grasping. The International Journal
    of Robotics Research 14(4), 319–334 (1995)
 4. Buss, M., Hashimoto, H., Moore, J.B.: Dextrous hand grasping force optimization.
    Robotics and Automation, IEEE Transactions on 12(3), 406–418 (1996)
 5. Chebotar, Y., Kroemer, O., Peters, J.: Learning robot tactile sensing for object
    manipulation
 6. Detry, R., Baseski, E., Popovic, M., Touati, Y., Kruger, N., Kroemer, O., Peters,
    J., Piater, J.: Learning object-specific grasp a↵ordance densities. In: Development
    and Learning, 2009. ICDL 2009. IEEE 8th International Conference on. pp. 1–7.
    IEEE (2009)
 7. Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy itera-
    tion. Machine Learning 72(3), 157–171 (2008)
 8. Ding, D., Liu, Y.H., Wang, S.: The synthesis of 3-d form-closure grasps. Robotica
    18(01), 51–58 (2000)




                                   Page 108 of 171
 9. Ersen, M., Ozturk, M.D., Biberci, M., Sariel, S., Yalcin, H.: Scene interpreta-
    tion for lifelong robot learning. In: The 9th International Workshop on Cogni-
    tive Robotics (CogRob 2014) held in conjunction with ECAI-2014. Prague, Czech
    Republic (2014)
10. Grzyb, B., Boedecker, J., Asada, M., Del Pobil, A.P., Smith, L.B.: Between frus-
    tration and elation: Sense of control regulates the lntrinsic motivation for motor
    learning. In: Lifelong Learning (2011)
11. Huebner, K., Ruthotto, S., Kragic, D.: Minimum volume bounding box decompo-
    sition for shape approximation in robot grasping. In: Robotics and Automation,
    2008. ICRA 2008. IEEE International Conference on. pp. 1628–1633. IEEE (2008)
12. Kober, J., Peters, J.: Learning motor primitives for robotics. In: Robotics and
    Automation, 2009. ICRA’09. IEEE International Conference on. pp. 2112–2118.
    IEEE (2009)
13. Lenat, D.B.: Am: An artificial intelligence approach to discovery in mathematics
    as heuristic search. Tech. rep., DTIC Document (1976)
14. Peters: Machine learning of motor skills for robotics (2007)
15. Platt, R.: Learning grasp strategies composed of contact relative motions. In: Hu-
    manoid Robots, 2007 7th IEEE-RAS International Conference on. pp. 49–56. IEEE
    (2007)
16. Rusu, R.B., Cousins, S.: 3D is here: Point Cloud Library (PCL). In: IEEE Inter-
    national Conference on Robotics and Automation (ICRA). Shanghai, China (May
    9-13 2011)
17. Rusu, R.B., Cousins, S.: 3d is here: Point cloud library (pcl). In: Robotics and
    Automation (ICRA), 2011 IEEE International Conference on. pp. 1–4. IEEE (2011)
18. Sauser, E.L., Billard, A.G.: Biologically inspired multimodal integration: Interfer-
    ences in a human-robot interaction game. In: Intelligent Robots and Systems, 2006
    IEEE/RSJ International Conference on. pp. 5619–5624. IEEE (2006)
19. iirgen Schmidhuber, J.: A possibility for lmplementing curiosity and boredom in
    model-building neural controllers (1991)
20. Steel, P., König, C.J.: Integrating theories of motivation. Academy of Management
    Review 31(4), 889–913 (2006)
21. Trevor, A., Gedikli, S., Rusu, R., Christensen, H.: Efficient organized point cloud
    segmentation with connected components. Semantic Perception Mapping and Ex-
    ploration (SPME) (2013)
22. Uchibe, E., Doya, K.: Finding intrinsic rewards by embodied evolution and con-
    strained reinforcement learning. Neural Networks 21(10), 1447–1455 (2008)
23. Watkins, C.J., Dayan, P.: Q-learning. Machine learning 8(3-4), 279–292 (1992)
24. Wong, P.T.: Frustration, exploration, and learning. Canadian Psychological Re-
    view/Psychologie canadienne 20(3), 133 (1979)




                                   Page 109 of 171