Intelligent Battery Management for aquatic drones based on Task Difficulty driven POMDPs Alberto Castellini∗ , Jason J. Blum, Domenico D. Bloisi, and Alessandro Farinelli Verona University, Department of Computer Science, Strada Le Grazie 15, 37134 Verona, Italy, {name.surname}@univr.it Abstract. Autonomous drones typically have limited battery capacity, which presents a major challenge for persistent, long-term deployments. In the con- text of water quality monitoring, aquatic drones must navigate rivers and lakes to collect real-time data, where the battery consumption is heavily influenced by dynamic environmental factors such as flowing current and wind. Intelligent bat- tery management is a requirement for the success of these missions. We propose a formalization of the battery management problem in terms of Partially Observ- able Markov Decision Processes (POMDPs). We model the problem as a POMDP where the agent modulates the energy provided to its propellers. The drone can estimate the “difficulty” to traverse an interval by observing the environment. An important aspect of this formalization is the prediction of future intervals’ difficulty values based on current observations by exploiting a priori geometric information, which we call Task Difficulty Propagation (TDP). We investigate variations of this approach and analyze related performance. 1 Introduction Autonomous aquatic drones are increasingly used for water monitoring. A key factor for the success of data acquisition campaigns is mission awareness, which is composed of three main elements [2]: knowledge of mission objectives, internal self-situational awareness, and external self-situational awareness. Due to the limited energy on-board and the requirements for long-term autonomy and reliability, the ability to forecast the mission-specific energy consumption is necessary to complete missions reliably, smoothly, and efficiently. In this work we focus on battery management, which is an aspect of self-situational awareness. Existing approaches for estimating the battery consumption in autonomous systems can be divided into two main categories, namely decision [3,8,2] and predictive models [4,11,9]. All these approaches deal with different aspects of a common problem: mak- ing optimal decisions for battery management in uncertain environments. In this paper, we propose a methodology based on Partially Observable Markov Decision Processes (POMDPs) [6], a decisional model that combines the strength of Hidden Markov Mod- els with that of Markov Decision Process by capturing both dynamics that depend on unobserved states and effects of decisions over time. We apply our method to the following case study: given i) a path made of four perpendicular segments (Figure 1.c), in which each segment has a specific degree of difficulty (i.e., a measure of the forward speed achieved for a given amount of energy consumption) and ii) an initial amount of battery energy, the agent has to make decisions about the amount of engine power needed at each time instant to minimize the time spent to reach the end of the path and maximize the probability to complete the mission before battery depletion. To this end POMDPs are extended with a novel simulation strategy called Task Difficulty Propagation (TDP) which is based on Markov Random Fields (MRF), and allows to represent the dependencies between segment difficulties and to propagate this information for improving the belief over the difficulties in the entire path. First experiments show promising enhancements to mission performance. Future developments aim at using the method on the INTCATCH [5] drones (Fig. 1.a). Fig. 1: a) Intcatch system for autonomous monitoring of catchments; b) Real path from a water monitoring campaign; c) Shape and cosine similarity matrices for a square path; d) POMDP model for battery management in water monitoring drones. 2 Methods A POMDP is a tuple (S, A, O, T, Z, R, γ), where S is a finite set of states, A is a finite set of actions, Z is a finite set of observations, T : S × A → Π(S) is the state-transition model, O: S × A → Π(Z) is the observation model, R: S × A → R is the reward function and γ ∈ [0, 1) is the discount factor. The goal of an agent based on POMDPs is to maximize its expected total reward ∞ E[ t=0 γ t R(st , at )], by choosing the best action at in each state st at time t. P Figure 1.d shows the proposed POMDP for battery management as a graph, where states are represented by circles, actions by rectangles and conditional dependencies between states by arrows. The states are: i) battery voltage interval B ∈ {b0 , b1 , b2 , b3 }, where b0 represents empty battery, ii) speed interval V ∈ {v1 , v2 , v3 }, iii) distance to path end D ∈ {d0 , d1 , ..., d8 }, where d8 is the starting interval of the path (i.e., the max- imal distance) and d0 is the end point of the path (i.e., the goal), iv) time interval T ∈ {t1 , t2 , t3 , t4 , t5 }, where t1 is the time interval in which the path starts, v) current seg- ment S ∈ {s1 , s2 , s3 , s4 }, vi) difficulty of segments F1 , F2 , F3 , F4 | Fj ∈ {L, M, H}, where L=low, M=medium, H=high, and difficulties may depend on water flow or other environmental factors, vii) difficulty of current segment Fc ∈ {L, M, H}, viii) engine power P ∈ {L, M, H}, which corresponds to a uniform discretization of the combined effort of the drone’s propellers. States B, V, D, T and S are observable, while states F1 , . . . , F4 and Fc are hidden, hence the model has a mixed observability [10]. Rewards depend on engine power (i.e., -0.01 for low power, -0.02 for medium power and -0.05 for high power), time of arrival (i.e., +500 for arriving at time t5 , +1000 for arriving at time t4 , +1500 for arriving at time t3 , +2000 for arriving at time t2 , +2500 for arriving at time t1 ) and battery depletion before arrival (i.e., -5000). The discount factor is γ = 0.95. The five hidden variables can assume three possible values each (i.e., L, M or H), for a total of 35 = 243 configurations of the hidden state (e.g., f = (L, M, H, L, L)). The belief bt provides the probability over all possible configurations at time t. It is a vector (b1t , . . . , b243 t ), such P243 that bjt ≥ 0 and j=1 bjt = 1. We set the initial belief b0 to the uniform distribution 1 1 ( 243 , . . . , 243 ). In order to represent dependencies between segment difficulties we define two ele- ments: i) a cosine similarity matrix which provides a measure of geometric similarity between pairs of segments in the path (see matrix K in Figure 1.c), ii) a pairwise MRF to compute a joint probability function over configurations of segment difficulties (see the graph in Figure 1.c). The MRF is used to factorize the joint probability of seg- ment difficulties as a product of potential functions over theQ maximal cliques of the 1 graph. The joint probability is then written as p(f |θ) = Z(θ) q∈Q ψq (fq |θq ), where f is a vector of difficulties (e.g., (H, M, L, L)), θ is a parametrization of the MRF, Q is the set of (maximal) cliques of the graph, ψq (fq |θq ) is a potential function for clique q and Z(θ) is a normalization factor called partition function. We use a sim- plified version of MRF called pairwise MRF in which the parametrization is restricted to the edges of the graph rather than to the maximal cliques and we conveniently ex- press potentials as exponentials with Boltzmann distribution. In this way the product of potentials in p(f |θ) can be computed by adding the energies of each of the maximal cliques. Since our difficulty variables are discrete, we represent the energy functions as tables of (non-negative) numbers representing the relative “compatibility” between the different assignments of segment difficulties. In particular, given a pair of segments (si , sj ) | i, j ∈ {1, 2, 3, 4} having cosine similarity equal to 1, 0 or -1, the energy func- tions of the corresponding pair of difficulties (fi , fj ) | fi , fj ∈ {L, M, H} are defined as E1 (v, w), E0 (v, w) and E−1 (v, w) where v = I(fi ), w = I(fj ) and I(·) is the in- dex function. To compute the energy function of couples of segments having generic cosine similarity k ∈ [−1, 1] we interpolate the parameters in E1 (v, w), E0 (v, w) and E−1 (v, w) by quadratic polynomials. The joint probability computed by the MRF is finally used to update the beliefs generated step-by-step by the TDP simulator according to the formula b̄jt = α · bjt + (1 − α) · bjt · p(f j ), where t ∈ N is the time step, j ∈ {1, . . . , 243} is the index of the difficulty configuration (i.e., hidden state), α ∈ [0, 1] is a weighting factor, bjt is the probability of the j-th difficulty configuration provided by standard simulator, p(f j )) is the joint probability of the j-th difficulty configuration provided by the MRF, and b̄jt is the probability of the j-th difficulty configuration provided by TDP simulator. 3 Results We implemented the POMDP model in a POMDPX XML file [1] and computed a pol- icy for this model using function pomdpsol of the SARSOP solver [7] in the APPL Toolkit [1]. Then we simulated the model 100 times using the function pomdpsim of the APPL toolkit, thus obtaining 100 time-evolutions of the system. In all these simu- lations the true hidden state was manually fixed to f¯∗ = (H, H, L, L). We call these time evolutions standard (STD) simulations since they were computed by the standard simulator and do not consider the path geometry to improve the belief over segment dif- ficulties. Each simulation has 300 time steps and stores information about the evolution of observable state variables, belief over hidden state, actions and rewards. Subsequently, we modified the pomdpsim simulator in two ways: i) by forcing the belief to the true hidden state (i.e., the probability distribution has all the mass into the real state and zero probability for other states), we call this simulator oracle (ORC) since it has complete information about the true hidden state, ii) by implementing the TDP approach presented above, we call this simulator TDP since it considers path ge- ometry to improve the belief update over segment (i.e., task) difficulties. We performed 100 simulations by ORC and TDP simulators, respectively, and compared results with those achieved by STD simulator, in terms of belief evolution, average number of steps to reach the goal, and average number of battery failures. Figure 2.a shows the evolution of beliefs in the three simulation cases: columns represent different simulation strategies, namely, STD, TDP and ORC, while rows rep- resent segments 1, 2, 3 and 4 of the square path. Each chart shows the average be- lief (over 100 simulations) for a particular simulation strategy in a specific segment. A consequence of the improved belief update strategy in TDP is the improvement of performance in terms of (reduced) number of steps to reach the end of the path. This is observable in Figure 2.b which shows the average number of steps required by each ap- proach to reach the end of segments s1 , s2 , s3 and s4 . We observe that using TDP (blue line) the agent is able to reach the end of the path with less steps than using STD (red line) and with a similar failure rate (i.e., the percentage of times in which the battery ends before the end of the path). ORC has the best performance with 96.4 steps, on av- erage, to reach the end of the square and 24% of failures. The second best performance is that of TDP with 102.1 steps and a failure rate of 22%. Both approaches outperform STD which needs 111.6 steps and has a failure rate of 22%. A key extension of the proposed approach concerns scaling on more complex scenarios. To this end we are currently exploring the usage of Partially Observable Monte Carlo Planning (POMCP) methods [12]. Acknowledgments This work is partially funded by the European Union’s Horizon 2020 research and in- novation programme under grant agreement No 689341. Fig. 2: a) Time evolution of the belief state. Red lines identify the true hidden state (H, H, L, L). Asteriscs in notation H ∗ ∗∗ represents any difficulty. The belief of all simulators evolve towards the true hidden state but ORC overperformes TDP, and TDP overperforms STD. b) Number of steps needed to reach the end of each segment. Averages are computed over 100 simulations. ORC (green) reaches the end first, then TDP and finally STD. References 1. APPL Toolkit. http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/. 2. V. Berenz, F. Tanaka, and K. Suzuki. Autonomous battery management for mobile robots based on risk and gain assessment. Artificial Intelligence Review, 37(3):217–237, 2012. 3. F. Dressler and G. Fuchs. Energy-aware operation and task allocation of autonomous robots. In Proc. of the 5th Int. Workshop on Robot Motion and Control, pages 163–168, 2005. 4. A. Hamza and N. Ayanian. Forecasting battery state of charge for robot missions. In SAC, pages 249–255, 2017. 5. INTCATCH H2020 EU project website. http://www.intcatch.eu/. 6. L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially ob- servable stochastic domains. Artif. Intell., 101(1-2):99–134, 1998. 7. H. Kurniawati, D. Hsu, and W. S. Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, 2008. 8. Malrey Lee, Mahmoud Tarokh, and Matthew Cross. Fuzzy logic decision making for multi- robot security systems. Artificial Intelligence Review, 34(2):177–194, 2010. 9. J. LeSage and R. Longoria. Characterization of load uncertainty in unstructured terrains and applications to battery remaining run-time prediction. J. Field Rob., 30(3):472–487, 2013. 10. S. C. W. Ong, S. W. Png, D. Hsu, and W. S. Lee. Planning under uncertainty for robotic tasks with mixed observability. Int. J. of Robotics Research, 29(8):1053–1068, 2010. 11. A. Sadrpour, J. Jin, and A. G. Ulsoy. Mission energy prediction for unmanned ground vehi- cles. In ICRA, pages 2229–2234, 2012. 12. D. Silver and J. Veness. Monte-carlo planning in large pomdps. In Advances in Neural Information Processing Systems 23, pages 2164–2172, 2010.