<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Path Planning of Architectural Robot Based on Active Inference Algorithm 1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xun Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tongying Guo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shenyang Jianzhu University</institution>
          ,
          <addr-line>Shenyang</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>86</fpage>
      <lpage>93</lpage>
      <abstract>
        <p>In order to solve the problem of slow convergence of traditional reinforcement learning algorithm, a quasi-reinforcement learning algorithm Active Inference algorithm is proposed to solve the path planning problem of construction transportation robot. Simplify the complexity of the construction site and construct a two-dimensional grid map. In the improved Frozen-lake environment, the same starting point and end point are set in the same building map, and the algorithm and the Q-Learning algorithm with different parameters are simulated respectively. The results show that in the case of a given step size, the three algorithms can converge to get the equivalent optimal solution, while the Active Inference algorithm converges faster and obtains the optimal solution early.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Active Inference algorithm</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>path planning</kwd>
        <kwd>architectural robot</kwd>
        <kwd>QLearning</kwd>
        <kwd>grid map</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Path planning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is one of the research hotspots of mobile robot technology. When the map is
known or unknown, providing the optimal solution of the collision-free path for the robot is an
important method in the navigation technology of the mobile robot, and it is also the most important for
the mobile construction robot with autonomous operation function. Path planning methods can be
divided into two categories: global path planning and local path planning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The path planning
algorithm proposed in this paper belongs to the former, that is, path planning under the condition of
known environmental information (map). Local path planning is that when all or part of the
environmental information is unknown, the robot carries sensors to collect environmental information and
carry out path planning in a dynamic environment. Path planning algorithms mainly include
traditional algorithms based on search and graph theory, heuristic algorithms, intelligent bionics algorithms,
reinforcement learning algorithms and deep learning algorithms, as well as the fusion and
improvement of the above algorithms[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Traditional reinforcement learning algorithm is a kind of artificial intelligence algorithm which
does not need prior knowledge. Under the premise of a given reward function, it is brought into the
environment for many iterations to optimize the strategy. The Active Inference algorithm proposed in
this paper is an artificial intelligence algorithm similar to reinforcement learning, which optimizes the
steps of artificially constructing reward function compared with traditional reinforcement learning
algorithm, so as to eliminate the potential influence of reward function with personal preference on the
results. By constructing a two-dimensional raster map to simulate the construction site and importing
the map into the Frozen-lake environment, the algorithm proposed in this paper is verified in the
improved Frozen-lake environment.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Environment Construction</title>
      <p>
        The raster method is used to establish two-dimensional raster map to construct environmental
information. When the two-dimensional plan of the building is known, using raster map for path
planning has the following advantages. First of all, the building plan can be easily converted into a raster
map by means of image processing, such as binarizing the pixels of the building plan to get a
blackand-white pixel map. Black and white pixels represent obstacles (walls, materials, etc.) and open
spaces (free movable space, Free Space). Grayscale image processing can also be carried out to
indicate the mobility of the space referred to by the pixel and whether it can be passed by the robot
according to the gray value; secondly, the grid map can represent the complex and unstructured
environment in a relatively simple way, which can effectively reduce the complexity of environmental
modeling, and it is more convenient to judge the relative position of obstacles and robots. Therefore,
the grid method has been widely used in path planning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Taking a typical building construction site
as an example, as shown in figure 1, areas 1 to 5 on the left are different material storage sites, and
areas An and B are construction areas, build a grid map with 12 grids in length and width, white color
block as a free movable space area, black color block as a wall or obstacle, collision should be
avoided and a two-dimensional grid map is generated as shown in figure 2.
      </p>
      <p>A
B</p>
      <p>The mobile mode of the robot is defined as the step-by-step mode of adjacent grids, and the
movable distance is one grid at a time, as shown in figure 3. When the robot is located at O point, there are
eight directions from A to H. combined with the actual situation, only four directions are selected, that
is, A, B, C and D are shown in the picture.</p>
      <p>The raster map is represented in the program in the form of an array matrix. G (i ,j) = 0 indicates
that the grid area is a movable space region, that is, a white color block that can be passed by the robot,
and G (i ,j) = 1 means that the grid area is a wall obstacle, that is, a black color block that should
avoid collision.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Active Inference Algorithm</title>
      <p>
        Active Inference is a theory proposed by the theoretical neuroscientist Karl Friston to speculate,
understand and describe the perception, plan and behavior of the brain according to probability, which
provides a comprehensive perspective to explain the coordinated relationship among brain, cognition
and behavior [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The core of Active Inference is to transform behavioral action into perception based
on a single command (Imperative) that minimizes free energy. The theory is considered to be the
"first rule" for understanding the brain and behavior. It also explains how biology or artificial
intelligence (Artificial) carries on cognition and reasoning in unsteady (Non-stationary) environment [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        As an indispensable part of reinforcement learning, reward signal is not indispensable in Active
Inference algorithm. Reward is regarded as some kind of Observation result of our Preference, and even
when there is no reward at all, the behavior of Agent is learned through preference. Compared with
the Q-Learning algorithm in the same test, the Active Inference agent can infer the behavior in the
non-reward environment, and learn similar considerable preference (Prior Preferences) to the
algorithms that contain reward signals, such as Q-Learning, by putting zero preference above reward. In
the Active Inference algorithm, the interaction between the agent and the environment is determined
by the action sequence that minimizes the expected free energy, rather than the expectation of the
reward signal in reinforcement learning. In standard reinforcement learning, the reward function defines
the agent's subject goal and allows the agent to learn how to best act or get the optimal solution in the
environment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Reward in reinforcement learning can be seen as a scalar signal, and any Goal or
Purpose is achieved by maximizing the sum of the cumulative expected values of the scalar signal [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The
reward result of a single study is output to the action and behavior of the study. The repeated learning
cycle for many times, gradually approaches the maximum value of the sum of reward expectations,
which is the final output of the reinforcement learning. The difference of reinforcement learning is
that the difference of scalar signals will not affect the final output of Active Inference algorithm, that
is, different reward functions in the same environment will not lead to differences in learning results.
In other words, any implicit reward related to results in reinforcement learning comes from the
considerable features observed by the agent, rather than the features obtained from the environment
through learning (reasoning), including the non-objective parts.
      </p>
      <p>The variational free energy formula is derived from a simple result generation model, which is the
starting point of the derivation process of Active Inference algorithm. For example, in figure 4, the
random variable s ∈ S is introduced to represent the specific hidden state of the environment, where
S is a finite set of all possible hidden states, and the random variable o ∈ O is introduced to represent
the specific results received by the agent. The O is a finite set of all possible results. The generation
model abstractly believes that there is a real hidden state s * ∈ S in the world. The o is generated by
the generation process, and the agent deduces s from the generation model. Hidden state is a
combination of agent-related features (such as location, etc.), and the result is information from the
environment (such as feedback, speed, reward, etc.). Through the reverse process of mapping the hidden state
to the result, the agent explains how the result is caused by the hidden state, that is, the inversion
Bayesian model for reasoning.</p>
      <p>Environment
Generative process: true state</p>
      <p>P*(s*,o) s*</p>
      <p>Agent
probability
o of outcome Generative model:</p>
      <p>P(o|s) P(s,o)=P(o|s)P(s)</p>
      <p>The generation model is defined as a partially observable Markov decision process, which contains
the joint probability P(o|s) consisting of the result o∈O and the state s∈S. The joint probability can
be decomposed into likelihood function P(o|s) and internal state P(s).</p>
      <p>P(o，s)=P(o|s)P(s)
P(o) =  P(o | s)P(s)</p>
      <p>s∈S</p>
      <p>In order for the agent to converge stably, we need to edge all the states that may lead to a given
result, decomposed by equation 2.
(2)</p>
      <p>
        Because the dimensions of hidden state and action sequence space may be very large, and the
calculation of the above formula may be extremely complex, we consider using variational method to
approximate P(o), which is more feasible and easier to deal with. Therefore, we introduce an
important concept, difference[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (Surprise or Surprisal, which can also be called Shannon information
quantity Shannon Information Content), which is used to measure the uncertainty of information. In
order to distinguish it from state S, we use C to represent difference. Define the probability that the
difference is the negative logarithm of outcome, such as formula 3. The P represents the probability
distribution.
      </p>
      <p>C(o) = − log P(o)
(3)</p>
      <p>
        The variational free energy F is defined as the upper limit of the difference, which is derived from
the Jensen inequality. In the literature of variational reasoning, it is usually called the (negative)
evidence lower bound(ELBO)[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>C(o) = − log  P(o,s)</p>
      <p>s∈S
≤ − Q(s) log
s∈S</p>
      <p>Q(s)
F =  Q(s) log
s∈S</p>
      <p>Q(s)</p>
      <p>P(o, s)
= DKL [Q(s) || P(s | o)] − log P(o)
(6)
(9)
(5)</p>
      <p>Q(s) is a variational distribution. The concrete expression of variational free energy F is obtained
by further derivation.
(7)</p>
      <p>The relationship between the difference C and the variational free energy F is shown below, and
the KL divergence is always greater than zero.</p>
      <p>C(o)=F − DKL [Q(s) || P(s | o)]
(8)</p>
      <p>Variational free energy provides us with a way to perceive the environment, that is, to determine
the environment state s from the output o, which is also the process of Inference to the environment.
But at present, there is still a lack of action in the above equation, so we introduce a new variable and
define the action strategy as π ∈ fl, where π is a finite set of policies that may be allowed, so that the
agent can transfer the state directly in the hidden state. On the condition that formula 6 and formula 7
bring in strategy π, we further derive the available formula as follows:</p>
      <p>F = DKL [Q(s | π) || P(s | o, π)] − log P(o)
F(τ, π) =  Q(sτ|π)Q(sτ−1|π) log</p>
      <p>sτπ P(oτ , sτ|sτ−1,π) (10)</p>
      <p>
        These strategies are the actions that the agent will take. In the process of interacting with the
environment (that is, the agent learns in the environment), the action sequence affects the result to get the
reasoning cognition of the environment, while the environment and the result affect the next action
sequence. Through repeated iterations with predictability, the Active Inference algorithm obtains an
action sequence composed of strategy Pi to minimize the free energy on the premise of keeping the
difference minimized to maintain the intelligent body steady state[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The learning result output under
this condition is the optimal result we expect.
      </p>
      <p>Q(sτ|π)
4</p>
    </sec>
    <sec id="sec-4">
      <title>Simulation and results</title>
      <p>Frozen-lake is an environment for training and testing agents launched by OpenAI in 2016. The
previous initial version of Frozen-lake, as shown in figure 5, contains a grid map of 3x3 and three
states, including a frozen lake (gray grid), a hole (grid with irregular blue shapes), and a destination
(grid with pentagram). The starting point is fixed to the upper left corner of the map, through the
frozen lake and avoid falling into the ice hole, reaching the destination is victory.</p>
      <p>The agent runs in an improved Frozen-lake environment, in which four states are defined: starting
point, ice, wall, and end point. Of all the states, only the "wall" is the unsafe state, corresponding to
the "hole" in the initial version. Each training (Episode) starts from the "starting point" , and "ice" is a
grid through which the agent can safely pass. When the agent moves to the grid where the "wall" is
located, the task fails and the training ends and returns to the initial position "starting point" to start
the next round of training. The agent starts from the "starting point" and moves safely to the "end
point" through the "ice". The task completes the end of the training and gets 100 points. This scoring
standard is not the reward function of the agent, but the score set to facilitate comparison with other
reinforcement learning algorithms. After the training, the Frozen-lake environment gives a score
according to the learning results of the agent, and the agent with faster convergence speed can get a
higher score. This paper chooses Q-Learning algorithm citation to compare with Active Inference
algorithm. In order to compare the effects of the two algorithms intuitively and efficiently, the
Frozenlake environment is improved and the raster map of figure 2 is optimized to create a 12-12 grid map
as shown in figure 6.</p>
      <p>In this environment, the generation model of Active Inference agent is defined as follows, as
shown in figure 7, there are four action states: upper, lower, left and right, and are not allowed to stay
in place, and each motion state will move the agent to a different location. The starting point is
defined at position 14, the end point is defined at position 131, the dark grid is a dangerous grid wall
(the task fails when the agent moves to the "wall" grid), and the light grid is safe grid ice; we limit the
maximum number of training movements to 50. The action state controls the agent to switch between
position states, for example, at the starting position 39, the agent can move to position 27 (up),
position 51 (down), position 38 (to the left) or position 40 (to the right). Position 27 and position 40 are
walls, so the agent is judged to be the end of the round after moving to the above two positions, and
the task fails. Both position 38 and position 51 are safe areas and can be moved again. The agent
infers and observes two types of results at each point in time: the current grid state and the score.
left
up
down
right</p>
      <p>This paper compares the performance of Q-Learning algorithm in the same Frozen-lake
environment. Perhaps the more advanced and complex reinforcement learning algorithm will perform better
under the same conditions. In order to compare with the Naive reinforcement learning algorithm
under the standard framework, the Q-Learning algorithm is chosen. Q-Learning algorithm uses ε-Greed
strategy, define greed = 0.1, learning rate A = 0.5, discount factor γ = 0.99. At the same time, in order
to compare the influence of artificially defined reward function on agent learning results, two different
reward functions are defined under this condition. The first is the reward function with cost, that is,
one point is deducted for each move; the other is cost-free, and no points are deducted for a single
move and moving to a dangerous area. The Q-Learning algorithm and Active Inference algorithm
with two different reward functions both reward 100 points when moving to the end point, and the
number of moves in a single round is limited to 50 steps, and is not allowed to stay in place.</p>
      <p>The above three algorithms are run in the improved Frozen-lake environment, and each agent
undergoes no more than 200 rounds of training. Within this framework, all three agents can begin to
converge within 120 rounds of training and successfully reach the end point. After training, all agents
can output several equivalent shortest paths.</p>
      <p>
        In all the results of the output, taking the three paths shown in figure 8 as an example, the three
paths all pass through 18 grids, purple and yellow paths all have six turns, while the blue path has
only three turns. Combined with the actual situation, the less the number of turns, the shorter the
working time of the robot, and the higher the transportation stability. At present, none of the three
algorithms can distinguish the advantages and disadvantages of the three paths, so they can only select the
shortest path from the number of steps, and the 18-grid path is the shortest path under the map.
Among them, the average score of Active Inference agent in training is 81, the average score of
QLearning agent with cost is 77, while the performance of non-cost Q-Learning agent is the worst, with
an average score of 72. The convergence speed of Active Inference agent is faster than that of
QLearning agent. Because Q-Learning algorithm will lead to serious dimension disaster in large map
environment[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and Active Inference algorithm can avoid dimension disaster very well, it can be
inferred that the latter will perform better when realizing path planning function in a larger range of
maps.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Concluding remarks</title>
      <p>Active Inference algorithm is a relatively new algorithm that has been applied in practice in recent
years, and it is also the first time to apply it to the robot raster map path planning task. There are
similarities between the reinforcement learning algorithm based on Bayesian model and the Active
Inference algorithm, which are different from the reinforcement learning algorithm in some key points. In
reinforcement learning, the goal is defined by the reward function, that is, the scalar signal from the
environment. On the contrary, the target in the Active Inference algorithm is defined by the agent's a
priori preference for the result. In contrast, the Q-Learning algorithm seems to be more sensitive to
the shaping of the reward function, and the Q-Learning agents under different reward functions show
different performance, which also shows that the artificial reward function is subjective and directly
affects the output of the agent. It is not ruled out that through reasonable algorithm optimization or
setting a more reasonable reward function, or using a better algorithm, reinforcement learning agents
may perform similar to or better than Active Inference.</p>
      <p>Through simple experiments, we make a preliminary comparison between Active Inference
algorithm and naive reinforcement learning algorithm. The research content of this paper is relatively
limited, for example, the time consumption of turning has not been distinguished and only the length of
the path has been considered. Under the background of this application, we can explore the potential
of Active Inference algorithm more deeply. For example, more complex coding methods can be used
to give more meaning to the environment, path planning can be carried out in semantic maps with
more complete environmental content, or more complex tasks can be learned in non-stationary or
even unstructured environments. This is still a more prominent problem at present.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] Ma Zhiliang M.
          <article-title>: Interpretation of the Development report of Informatization in China's Construction Industry (</article-title>
          <year>2021</year>
          )
          <article-title>Application and Development of Intelligent Construction, China's construction informationization</article-title>
          . Vol.
          <volume>24</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>15</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Qi</surname>
            <given-names>Jialin D.:</given-names>
          </string-name>
          <article-title>Research on obstacle Detection and obstacle avoidance Strategy of Building handling Robot, Xi'an University of Architectural Science and</article-title>
          <string-name>
            <surname>Technology. 2020</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Shen</surname>
            <given-names>Mingxin D.</given-names>
          </string-name>
          :
          <article-title>Research and implementation of Autonomous Positioning and Navigation system for Indoor Construction Robot</article-title>
          , University of Electronic Science and Technology.
          <source>2020</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sun</given-names>
            <surname>Shangjie</surname>
          </string-name>
          , Jiang Shuhal, Cui Haohe J.: Cui Haohe.
          <article-title>Path Planning of Forest Fire fighting Robot based on Deep Learning, Forest engineering</article-title>
          . Vol.
          <volume>36</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>57</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Friston</surname>
            <given-names>K. J.</given-names>
          </string-name>
          :
          <article-title>Active inference: A process theory</article-title>
          , MIT PRESS.
          <year>2017</year>
          (
          <volume>1)</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Friston</surname>
            <given-names>K. J.</given-names>
          </string-name>
          :
          <article-title>Deep temporal models and active inference</article-title>
          ,
          <source>Neuroscience and Biobehavioral Reviews</source>
          . Vol.
          <volume>90</volume>
          ,
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Paper D M.</surname>
          </string-name>
          <article-title>: An Introduction to Reinforcement Learning. 2022</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Sajid</surname>
            <given-names>N J.</given-names>
          </string-name>
          :
          <source>Active Inference: Demystified and Compared, Neural Computation</source>
          . Vol.
          <volume>90</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Zénon</surname>
            <given-names>Alexandre J.:</given-names>
          </string-name>
          <article-title>An information-theoretic perspective on the costs of cognition</article-title>
          ,
          <source>Neuropsychologia</source>
          .
          <year>2018</year>
          , pp.
          <fpage>123</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Variational Inference: A Review for Statisticians</article-title>
          ,
          <source>Journal of the American Statistical Association</source>
          . Vol.
          <volume>518</volume>
          ,
          <year>2017</year>
          , pp.
          <fpage>112</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rafal Bogacz J.:</surname>
          </string-name>
          <article-title>A tutorial on the free-energy framework for modelling perception and learning</article-title>
          ,
          <source>Journal of Mathematical Psychology</source>
          . Vol. B,
          <year>2017</year>
          , pp.
          <fpage>76</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Song</surname>
            <given-names>Qisong D.:</given-names>
          </string-name>
          <article-title>Research on Optimization of path Planning algorithm for Mobile Robot</article-title>
          , Guizhou University,
          <year>2021</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>