<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Techniques for Service Robotics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Romisaa Ali</string-name>
          <email>romisaa.ali@polito.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Rome, Italy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. Computer and Control Engineering (DAUIN), Politecnico di Torino University</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Doctoral project supervised by Marcello Chiaberge (Department of Electronics and Telecommunications</institution>
          ,
          <addr-line>DET</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Optimal control system</institution>
          ,
          <addr-line>Autonomous robots, Twin Delayed Deep Deterministic Policy Gradient, TD3</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Politecnico di Torino University</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy)</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this thesis, we propose the development of an optimal control system for autonomous robots. Our design aims to eficiently guide the robot, determining the best possible route to its destination. We leverage the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm to direct the robot. By utilizing a precise navigation system, we can ascertain the robot's position in real-time and manage its movements by adjusting its components. This algorithm, which integrates principles from deep learning and reinforcement learning, ofers superior optimization capabilities for robot navigation and control. Notably, our approach facilitates navigation optimization without relying on a pre-existing map and ensures collision avoidance throughout the journey.</p>
      </abstract>
      <kwd-group>
        <kwd>Navigation system</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Reinforcement learning</kwd>
        <kwd>Optimization capabilities</kwd>
        <kwd>Collision avoidance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Reinforcement learning (RL) has become a key method for making smart control policies in
service robotics. For these robots to perform efectively in navigating and adapting to dynamic
environments. As we use service robots for more tasks, they should be flexible and adaptive
in navigation.[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. The big challenge in service robotics is making good navigation
plans that can deal with the surprises of everyday settings. When choosing RL method, a
robot’s performance and ability to adapt can change a lot. In this paper, we will discuss two
significant algorithms: TRPO (Trust Region Policy Optimization) and PPO (Proximal Policy
Optimization)[
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Both algorithms utilize the Trust Region Method (TRM) for optimization
[
        <xref ref-type="bibr" rid="ref7">7, 8</xref>
        ]. TRM focuses on eficiently refining policies within a designated region, using the
KullbackLeibler divergence as a tool for gauging diferences between policies. We will focus on two
Actor-Critic (SAC), which represent the latest advancements in the field[ 9, 10, 11, 12, 13], Twin
Delayed Deep Deterministic Policy Gradient (TD3) and Soft Actor-Critic (SAC)[ 14, 15, 16], have
stood out. This research looks at how well policy gradient methods[17], especially TD3 and SAC,
work in planning paths for service robots. We use a free online platform and special control
settings to test this. We’ll see how strong the plans are, if they can work in diferent situations,
and how well the robots move in new places. The paper is set up like this: In Section 1, it
confirms the significance of TRPO, PPO, TD3, and SAC algorithms. Section 2 ofers a firsthand
account of the author’s PhD journey, detailing three experiments that explore and compare
these algorithms. Lastly, Section 3 outlines a forward-looking plan, pinpointing challenges and
queries aimed at optimizing deep reinforcement learning models.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. PhD Research Journey: From Deep Reinforcement Learning</title>
    </sec>
    <sec id="sec-4">
      <title>Review to Comparative Experiments</title>
      <p>Since the beginning of my PhD research, I began with an extensive literature review on recent
advancements in deep reinforcement learning, paying particular attention to state-of-the-art
(SOTA) Policy gradient methodologies. This review process also involved a thorough selection of
SOTA algorithms based on various established criteria. My research then progressed to a series
of experiments: The initial experiment was dedicated to comparing the performance of the Trust
Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) algorithms in the
context of robot control. Subsequently, in the second experiment I investigated the eficiency of
the Twin Delayed Deep Deterministic Policy Gradient (TD3) in optimizing navigation strategies.
The final experiment sought to compare TD3 with another recent SOTA method, the Soft
Actor-Critic (SAC), particularly in navigating unseen motion control environments[18].</p>
      <sec id="sec-4-1">
        <title>2.1. First Experiment</title>
        <p>In this Experiment I aimed to compare the efectiveness of the TRPO and PPO algorithms in
controlling robots within two specific environments: ANT and Humanoid, with the goal of
directing the robot to move forward and fast. This experiment provides a basis for upcoming,
deeper explorations in this field. this experiment presented several limitations.</p>
        <sec id="sec-4-1-1">
          <title>2.1.1. The limitations</title>
          <p>Reliability in Real-world Scenarios: The study encountered potential issues in achieving
consistent results, suggesting that the solutions might not be reliable when implemented in real
scenarios.</p>
          <p>Overfitting: There was a notable risk of the algorithms fitting too closely to the training
data, especially in the Humanoid environment, which may afect their performance in unseen
or varied situations.</p>
          <p>Transferability: The results were tailored for ANT and Humanoid robots, limiting their
broader application to diferent robotic designs or environments.</p>
          <p>Evaluation metrics: The primary metrics used for evaluation were average returns and
training time. However, important aspects such as algorithm robustness, safety concerns, and
potential scalability were not evaluated.</p>
          <p>Absence of ROS Integration: The study did not utilize the ROS (Robot Operating System)
framework, a standard in many robotic applications. This omission could pose challenges when
trying to integrate or deploy the solutions on platforms that rely on ROS[19].</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>2.2. Second Experiment</title>
        <p>The goal of this experiment is to assess how well the TD3 algorithm optimizes navigation
policies in three varied environments: static, dynamic-wall, and dynamic-box. assessing its
adaptability, efectiveness in handling diverse challenges, and its ability to generalize across
diferent environments, Expanding upon the insights from our first experiment, this experiment
addresses several of its limitations. Specifically, the training of the model was conducted within
the ROS operating system.</p>
        <sec id="sec-4-2-1">
          <title>2.2.1. The limitations</title>
          <p>Limited Training Scenarios: If the range of training environments is too narrow, the learned
policies may face dificulties when introduced to new or diferent settings.</p>
          <p>Overfitting: When an algorithm is trained on a limited number of environments or datasets,
it may become specialized, which can afect its performance in unfamiliar scenarios.
Incomplete Training: The training process might require additional time to fully converge
and identify the most optimal policy.</p>
          <p>Resource Limitations: Having enough computing power, including CPU and GPU, along
with memory and storage, is vital. Lack of these resources can afect both learning and
deployment of the algorithms.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>2.3. Third Experiment</title>
        <p>In our third experiment, our research eforts are conducted within the Robotic Operating System
(ROS) framework. One of the major adjustments made in this experiment, compared to the
previous ones, is the expansion of the training environments. This strategic adjustment was
based on challenges identified in our previous experiments. By incorporating more numbers
of environments, we aim to enhance the adaptability of our model from training scenarios
to test environments. The primary objective of this experiment is to reevaluate the TD3
algorithm, which was a significant component of our second experiment. We are particularly
interested in comparing its performance with the SAC algorithm. SAC, known for its
highentropy policy methodology, ofers a diferent approach to robotic navigation optimization.
This comparison, in both training and testing stages, aims to understand the efectiveness and
robustness of these algorithms when applied to robotic navigation within the ROS framework.
In this third experiment, we made eforts to reduce some of the limitations identified in the
second experiment, robot start point and end point simulation, As shown in Figure 1, the robot
navigates from the starting point to the destination..</p>
        <sec id="sec-4-3-1">
          <title>2.3.1. The limitations</title>
          <p>Time to Convergence: The model requires an extended period to converge and ascertain
optimal policies.</p>
          <p>Computational Resources: Addressing the convergence time limitation necessitates the
deployment of heightened computational resources.</p>
          <p>Distribution Shift: diference between the training and testing environments. If the test
scenarios have diferent characteristics or distributions of states that the model has not encountered
during training.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. PhD Final Year: A Comprehensive Plan challenges</title>
      <p>Deep reinforcement learning (DRL) is rapidly advancing, and there are key challenges we need
to address. in this section, there are important elements we can consider to improve the DRL
model, In light of the unresolved issues in my research, during my participation in the Doctoral
Consortium, I aim to garner answers to the following key questions, with the guidance of my
mentors and feedback from attendees:</p>
      <sec id="sec-5-1">
        <title>Transfer-ability with Minimal Distribution Shift: Optimizing the packages to simplify</title>
        <p>the transition from training to testing, aiming to minimize distribution shift from training
environments to testing, and ultimately to deployment scenarios, especially when each of these
stages possesses its own distinct characteristics and complexities?</p>
      </sec>
      <sec id="sec-5-2">
        <title>Adaptabilitywith Minimal Distribution Shift: How can we incorporate advanced tech</title>
        <p>niques within the deep reinforcement learning model, that enable it to adapt in real-time to
environmental changes, ensuring efective navigation?
Generalizability: What technique can we employ in deep reinforcement learning models to
enhance navigation efectively in unseen environments?
Eficient Training in DRL Models: How can I optimize deep reinforcement learning models
to significantly reduce both training time and the need for computational resources, while still
ensuring their ability to adapt to unfamiliar environments? Additionally, which strategies are
most efective in achieving this goal?</p>
      </sec>
      <sec id="sec-5-3">
        <title>Addressing Memory and Speed Challengess: How can we design training scenarios</title>
        <p>that eliminate the need for detailed simulations, thereby cutting down on memory usage and
possibly accelerating the speed at which models learn? This approach aims to overcome the
high memory consumption and ineficiencies often seen with traditional simulation methods.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. ACKNOWLEDGMENT</title>
      <p>Deep appreciation is conveyed to Professor Marcello Chiaberge for his invaluable supervision
and guidance throughout this research. Warm thanks go to the SCUDO Ofice at Politecnico
di Torino for their essential support and assistance. Recognition must also be given to REPLY
Concepts company, especially to Mr. Maurizio Griva and Mr. Simone Voto, for their invaluable
input and collaboration. Finally, sincere gratitude is extended to the Department of DAUIN at
Politecnico di Torino, as well as to the dedicated staf of PIC4SeR, for their cooperation and
support in the execution of this work. The contributions of Mr. Zifan Xu from the University
of Texas at Austin were particularly instrumental to the success of this work and are deeply
acknowledged.
analytics-vidhya/trust-region-methods-for-deep-reinforcement-learning-e7e2a8460284,
2023.
[8] K. Lange, MM Optimization Algorithms, Technical Report, SIAM, 2016. https://www.siam.</p>
      <p>org/Publications/Books/Call-for-Book-Proposals/MM-Optimization-Algorithms.
[9] Z. Xu, B. Liu, X. Xiao, A. Nair, P. Stone, Benchmarking reinforcement learning techniques
for autonomous navigation, CoRR abs/2210.04839 (2022). https://doi.org/10.48550/arXiv.
2210.04839.
[10] Z. Xu, B. Liu, X. Xiao, A. Nair, P. Stone, Benchmarking reinforcement learning
techniques for autonomous navigation, in: IEEE International Conference on Robotics and
Automation, ICRA 2023, London, UK, May 29 - June 2, 2023, IEEE, 2023, pp. 9224–9230.
https://doi.org/10.1109/ICRA48891.2023.10160583.
[11] Y. He, Y. Kim, HEV energy management strategy based on TD3 with prioritized exploration
and experience replay, in: American Control Conference, ACC 2023, San Diego, CA, USA,
May 31 - June 2, 2023, IEEE, 2023, pp. 1753–1758. https://doi.org/10.23919/ACC55779.2023.
10156220.
[12] J. Wu, Q. M. J. Wu, S. Chen, F. Pourpanah, D. Huang, A-TD3: an adaptive asynchronous
twin delayed deep deterministic for continuous action spaces, IEEE Access 10 (2022)
128077–128089. https://doi.org/10.1109/ACCESS.2022.3226446.
[13] Y. Tan, Y. Lin, T. Liu, H. Min, PL-TD3: A dynamic path planning algorithm of mobile robot,
in: IEEE International Conference on Systems, Man, and Cybernetics, SMC 2022, Prague,
Czech Republic, October 9-12, 2022, IEEE, 2022, pp. 3040–3045. https://doi.org/10.1109/
SMC53654.2022.9945119.
[14] K. Nakhleh, M. Raza, M. Tang, M. Andrews, R. Boney, I. Hadzic, J. Lee, A. Mohajeri,
K. Palyutina, Sacplanner: Real-world collision avoidance with a soft actor critic local
planner and polar state representations, in: IEEE International Conference on Robotics
and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023, IEEE, 2023, pp. 9464–9470.
https://doi.org/10.1109/ICRA48891.2023.10161129.
[15] J. B. Martin, R. Chekroun, F. Moutarde, Learning from demonstrations with SACR2: soft
actor-critic with reward relabeling, volume abs/2110.14464, 2021. https://arxiv.org/abs/
2110.14464.
[16] L. Chavali, T. Gupta, P. Saxena, SAC-AP: soft actor critic based deep reinforcement learning
for alert prioritization, in: IEEE Congress on Evolutionary Computation, CEC 2022, Padua,
Italy, July 18-23, 2022, IEEE, 2022, pp. 1–8. https://doi.org/10.1109/CEC55065.2022.9870423.
[17] S. Fujimoto, H. van Hoof, D. Meger, Addressing function approximation error in actor-critic
methods, in: Proceedings of the 35th International Conference on Machine Learning, ICML
2018, PMLR, 2018, pp. 1582–1591. http://proceedings.mlr.press/v80/fujimoto18a.html.
[18] B. Siciliano, O. Khatib, Robotics and the handbook, in: B. Siciliano, O. Khatib (Eds.),
Springer Handbook of Robotics, Springer Handbooks, Springer, 2016, pp. 1–10. https:
//doi.org/10.1007/978-3-319-32552-1_1.
[19] K. Lange, MM Optimization Algorithms, Technical Report, SIAM, 2016. https://www.siam.
org/Publications/Books/Call-for-Book-Proposals/MM-Optimization-Algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning - an introduction, Adaptive computation and machine learning</article-title>
          , MIT Press,
          <year>1998</year>
          . URL: https://www.worldcat.org/oclc/37293240.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sewak</surname>
          </string-name>
          ,
          <source>Deep Reinforcement Learning - Frontiers of Artificial Intelligence</source>
          , Springer,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          - 981- 13- 8285- 7, https://doi.org/10.1007/
          <fpage>978</fpage>
          -981-13-8285-7.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sehgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>La</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Louis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning using genetic algorithm for parameter optimization</article-title>
          , CoRR abs/
          <year>1905</year>
          .04100 (
          <year>2019</year>
          ). http://arxiv.org/abs/
          <year>1905</year>
          .04100.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Kveen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Abu-Dakka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. I.</given-names>
            <surname>Grøtli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Gravdahl</surname>
          </string-name>
          ,
          <article-title>Addressing sample eficiency and model-bias in model-based reinforcement learning</article-title>
          ,
          <source>in: 21st IEEE International Conference on Machine Learning and Applications, ICMLA</source>
          <year>2022</year>
          , Nassau, Bahamas,
          <source>December 12-14</source>
          ,
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICMLA55696.
          <year>2022</year>
          .
          <volume>00009</volume>
          , https://doi.org/10.1109/ICMLA55696.
          <year>2022</year>
          .
          <volume>00009</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>Trust region policy optimization</article-title>
          ,
          <source>CoRR abs/1502</source>
          .05477 (
          <year>2015</year>
          ). http://arxiv.org/abs/1502.05477.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          ,
          <source>CoRR abs/1707</source>
          .06347 (
          <year>2017</year>
          ). http://arxiv.org/abs/1707.06347.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohapatra</surname>
          </string-name>
          ,
          <article-title>Trust region methods for deep reinforcement learning</article-title>
          , https://medium.com/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>