Free-Energy Advantage Functions for Policy Transfer to Noisy Environments with Safety Constraints

Free-Energy Advantage Functions for Policy Transfer to Noisy Environments with Safety Constraints PierreHaritz pierre.haritz@tu-dortmund.de Chair of Artificial Intelligence Faculty of Computer Science TU Dortmund University

Dortmund Germany

Lamarr Institute for Machine Learning and Artificial Intelligence

Dortmund Germany

ThomasLiebig thomas.liebig@cs.tu-dortmund.de Chair of Artificial Intelligence Faculty of Computer Science TU Dortmund University

Dortmund Germany

Lamarr Institute for Machine Learning and Artificial Intelligence

Dortmund Germany

Free-Energy Advantage Functions for Policy Transfer to Noisy Environments with Safety Constraints 1613-0073 7556ADA5300CD260BF6C02F4A885E4D5 GROBID - A machine learning software for extracting information from scholarly documents reinforcement learning transfer learning safety

Training acting agents for the goal of controlling complex live systems on the system itself is often an unfeasible task, either due to the high cost or the potential dangers that might arise. In this paper, we take a step towards identifying ways to evaluate the transferability of models for the class of constrained Reinforcement Learning problems. Furthermore, we present an approach based on free-energy advantage functions to improve adaptability and in turn transferability for constrained Reinforcement Learning problems and subsequently manage to increase the performance of a baseline algorithm, CPO, with regard to safety constraints in noisy environments.

Introduction

AI systems can have significant real-world impact, and if not designed and deployed with safety in mind, they can cause harm to individuals, organizations, or society as a whole. Ensuring safety is crucial to prevent accidents, unintended consequences, or malicious uses of AI. When deploying trained models to large-scale industrial applications, unstable live systems can cause damage of economic or other nature. Because of the high complexity, cost, and potential danger of training live systems from scratch, usually, these models are trained on historical or simulation data, which may or may not accurately reflect the actual use case environment. Specifically, in some instances, knowledge of the actual environment dynamics is only partially available, and algorithms need to be able to handle situations where there is a degree of uncertainty. Classically, in control environments, robustness can be achieved with Model Predictive Control approaches ( [1]) when plant dynamics are known.

Reinforcement Learning (RL) is a machine learning paradigm that includes a variety of algorithmic approaches, foremost in sequential decision-making environments. Recently, RL has become a promising way to solve sequential decision-making tasks such as in marketing, gaming, and control tasks, such as robotics and autonomous cars, where the aspect of safety and trustworthiness in the agent is an important factor.

We argue that in real-world applications that require safety guarantees, RL methods that transfer well could improve upon satisfying certain thresholds.

Transfer learning is an established concept in areas such as image classification and natural language processing ([2]), with the goal of reducing training time for Machine Learning models and improving their performance. In this paper, we first give an overview of how Transfer Learning is interpreted in Reinforcement Learning and discuss the benefit of transferability in constrained Reinforcement Learning. Our contribution in this paper can be stated as such:

• We propose criteria to evaluate policy transfer in constrained RL.

• We present a method for improving performance regarding safety after transferring pre-trained policies to a noisy environment through the use of free-energy advantage functions.

Background and Related Work

In this section, we will introduce the mathematical framework for the problem setting.

Reinforcement Learning

Reinforcement Learning problems can typically be modeled with the help of a Markov Decision Process (MDP) 𝑀 = (𝑆, 𝐴, 𝑇 , 𝛾 , 𝑅) with a state space 𝑆, an action space 𝐴, a transition probability function 𝑇 ∶ 𝑆 × 𝐴 × 𝑆 → [0, 1], a discount factor 𝛾 ∈ [0, 1] and a reward function 𝑅 ∶ 𝑆 × 𝐴 → ℝ.

To extend this to safety critical problems, one possibility is to introduce a constraint cost function 𝐶 ∶ 𝑆 × 𝐴 → ℝ analogue to the reward function and a safety threshold 𝑐 ∈ ℝ. We define a Constrained Markov Decision Process (from now on referred to as CMDP) 𝑀 𝐶 = (𝑆, 𝐴, 𝑇 , 𝛾 , 𝑅, 𝐶, 𝑐). We can calculate a weighted return value for constrained problems with

𝐽 𝐶 (𝜋) = 𝔼 𝜏 ∼𝜋 [∑ ∞ 𝑡=0 𝛾 𝑡 𝐶(𝑠 𝑡 , 𝑎 𝑡 )]

of a policy 𝜋 ∶ 𝑆 → 𝐴 with 𝜋 ∈ Π for the set of all policies Π and a trajectory 𝜏 = (𝑠 0 , 𝑎 0 , 𝑠 1 , 𝑎 1 , … ). Let Π 𝐶 = {𝜋 ∈ Π ∶ 𝐽 𝐶 (𝜋) ≤ 𝑐} be the set of policies that satisfy the constraint 𝑐. Then we can calculate the optimal policy 𝜋 * = arg max 𝜋∈Π 𝐶

𝐽 (𝜋).

In real-life applications of Reinforcement Learning, environment dynamics, especially state transitions, can be unknown. Therefore, we introduce a generalization of the MDP model by assuming transition probabilities 𝑇 ⋆ 𝑠,𝑎 ∈ Δ 𝑆 for finite states and actions and probability simplex Δ 𝑆 ⊂ ℝ 𝑆 + . A common way to learn the objective under the assumption of unknown transition probabilities is to maximize a lower bound on the return.

Transfer Learning in the Reinforcement Learning Context

In a mathematical sense, given a source domain 𝑀 S and a target domain 𝑀 T , Transfer Learning (TL) is used to learn an optimal policy 𝜋 * for 𝑀 T by incorporating both external information from the source ℐ S and internal information ℐ T gathered from 𝑀 T . The optimal policy can be written as

𝜋 * = arg 𝑚𝑎𝑥 𝜋 𝔼 𝑥∼𝜇 𝑡 ,𝑎∼𝜋 [𝑄 𝜋

𝑀 (𝑥, 𝑎)] for initial set of states 𝜇. Taylor and Stone [3] highlight the benefits of using transfer methods in RL tasks and categorize measurements as such:

• Performance improvement of the initial policy by transferring an agent from a source task to a target task. • Performance improvement of the final learned policy of an agent on a target task by transferring. • The gained total cumulative reward from a transfer strategy compared to a non-transfer strategy. • The ratio of the total reward accumulated by the transfer learner and the total reward accumulated by the non-transfer learner. • The reduction of learning time needed by the agent to achieve a pre-specified performance level via knowledge transfer.

In literature ( [4]) a variety of TL approaches that fall under this category, are mentioned: In Imitation learning, the agent is trained to mimic a policy of a source policy, called the expert. This is a way of training without having access to feedback from the environment. A framework for Imitation Learning in partially-observable settings based on the Free-Energy Principle has been proposed in [5]. In cases where the reward signal is available, Learning from Demonstrations (LfD) is a possible way of training an agent. The way agents combine their knowledge (inter-agent or intra-agent) in Cooperative Multi-Agent RL can also be described as a form of TL.

In TL, domains can be described by MDPs, and any parts of it can have differences between the source and target domain. Consider state spaces 𝑆 S and 𝑆 T . Any of these relations might be true, depending on the problem: 𝑆 S ⊂ 𝑆 T , 𝑆 S ≡ 𝑆 T or 𝑆 S ⊂ 𝑆 T . Differences for the action spaces 𝐴 S and 𝐴 T are analogs. Since both state and action spaces can differ, reward functions can also be defined differently for both domains. Ultimately, trajectories can differ for problems where reaching a goal can be achieved differently (e.g., path-finding tasks).

This can be further extended to safety critical applications. Differing state spaces can be the result of failed sensors, differing action spaces are the result of hard constraints implemented by the system. Additionally, reward functions might yield different values in cases where sensors supply noisy data. In the case of CMDPs, for similar reasons, differences can be found in both constrained cost function and safety threshold.

On the topic of which kind of knowledge is transferable, we can define multiple forms. The transfer of trajectories is the main subject of LfD. Furthermore, the transfer of model dynamics is possible when an approximation by offline learning algorithms trained on historical data, and before getting transferred to an online system, is feasible. Offline RL algorithms usually mitigate the impact of the gap between real and estimated values by adding a pessimism factor to these learned values ( [6]) or learned dynamic models ( [7]).

The transfer of policies has been discussed by [8]. They propose to extend the explorationexploitation choice with the option to reuse an older policy and consequently test the transfer performance. Reward Shaping (as presented in [9]) speeds up the RL training process by guiding the exploration process by transforming the reward function into a potential-based reward function.

Transfer by starting from prior distributions has been explored by [10]. Instead of finding trajectories that maximize expected rewards, inference formulations start from a prior distribution over trajectories, condition the desired outcome, such as achieving a goal state, and then estimate the posterior distribution over trajectories consistent with this outcome. Since imitation learning provides a teacher policy to learn from, it interprets the teacher policy as a prior policy distribution.

Using Free-Energy Priors to improve Robustness after Policy Transfer

In real-world applications, such as robotics, it can be hard to separate signals from noise, especially at the early stages after deploying a learned strategy. We consider a scenario where there is a cost to receiving state data from an actor, e.g., sensor data from a robot's joints. Since we are considering the case of a SimToReal-transfer, we assume the existence of priors learned from simulation interactions. In this section, we propose the use of an advantage function over the simulation priors based on the free-energy principle to improve the agent's robustness.

Free-Energy Functions

Free-energy functions are fundamental concepts in thermodynamics and statistical mechanics that describe the energy available to do work in a system while accounting for both its internal energy and its entropy.

Quanitifying the cost of Control

Rubin et al. [11] borrow the term to define free-energy functions in the RL context to derive optimal policies and explore the tradeoff between value and control information. The idea is that optimal policies reflect a balance between maximizing expected rewards (value) and minimizing the information cost that comes with control.

With the help of information theory, we can quantify the expected cost of executing a policy 𝜋 in state 𝑠 ∈ 𝑆 as Δ𝐼 (𝑠) = ∑ 𝑎 𝜋 T 𝑠 (𝑎) log( 𝜋 T 𝑠 (𝑎) 𝜋 S 𝑠 (𝑎) ) with Δ𝐼 (𝑠 𝑇 ) = 0 for a terminal state 𝑠 𝑇 . With this, we are able to measure the relative entropy between the source policy 𝜋 S and target policy 𝜋 T . The source policy is used by the agent in the absence of information from its new noisy environment. For any state 𝑠, Δ𝐼 (𝑠) describes the minimal number of bits required to describe the outcome, or action sampled, of the random variable 𝑎 ∼ 𝜋 T . In our case, it serves as a measure for the cost of control. Similar to the value function 𝑉 𝜋 (𝑠 0 ), we can define the total control information involved in executing policy 𝜋 starting from the initial state 𝑠 0 :

𝐼 𝜋 (𝑠 0 ) = lim 𝑇 →∞ 𝔼[Δ𝐼 (𝑠 𝑡 )] = lim 𝑇 →∞ 𝔼[ 𝑇 −1 ∑ 𝑡=0 log 𝜋 T 𝑠 𝑡 (𝑎 𝑡 ) 𝜋 S 𝑠 𝑡 (𝑎 𝑡 ) ] = lim 𝑇 →∞ 𝔼[ log 𝑃𝑟(𝑎 0 , 𝑎 1 , … , 𝑎 𝑇 −1 |𝑠 0 , 𝜋 T ) 𝑃𝑟(𝑎 0 , 𝑎 1 , … , 𝑎 𝑇 −1 |𝑠 0 , 𝜋 S ) ](1)

Here, the optimal target policy 𝜋 * ,T should minimize the control information cost and at the same time maximize the reward while respecting environmental constraints.

Optimization with constrained Policies

In safety-critical domains, RL optimization problems are typically subject to constraints. For a distance measure 𝑑 ∶ Π × Π → ℝ and step size 𝛿, trust-region policy optimization algorithms makes sure that the new policy is within a so-called trust region of the previous one: 𝜋 𝑡+1 = arg max

where

𝐷 𝐾 𝐿 (𝜋||𝜋 𝑘 ) = 𝔼 𝑠∼𝑑 𝜋 𝑘 [𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 )[𝑠]

), and 𝛿 > 0 is the step size.

The advantage functions calculates the expected reward gain along a trajectory and is given by:

𝐴 𝜋 (𝑠, 𝑎) = 𝑄 𝜋 (𝑠, 𝑎) − 𝑉 𝜋 (𝑠) = 𝔼 𝜏 ∼𝜋 [𝑅(𝜏 )|𝑠 0 = 𝑠, 𝑎 0 = 𝑎] − 𝔼 𝜏 ∼𝜋 [𝑅(𝜏 )|𝑠 0 = 𝑠](3)

The trust region is then defined by the set {𝜋 𝜃 ∈ Π 𝜃 ∶ 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 )}.

CPO solves the CMDP problem approximately by calculating the update𝜋 𝑘+1 = arg max 𝜋∈Π 𝜃 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [𝐴 𝜋 𝑘 (𝑠, 𝑎)] s.t. 𝐽 𝐶 𝑖 (𝜋 𝑘 ) + 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [ 𝐴 𝜋 𝑘 𝐶 𝑖 (𝑠, 𝑎) 1 − 𝛾 ] ≤ 𝑐 𝑖 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 ) ≤ 𝛿(4)

Using Free-Energy Functions to improve Transferability

We aim to use a free-energy function to derive optimal policies while balancing the tradeoff between value and information during exploring. Early works ( [14]) propose using advantage functions in noisy environments to mitigate undesired approximation effects by reducing the action gap ( [15]). We assume a stochastic prior policy 𝜋 S (𝑎|𝑠) from the source task. Fox et al ( [16]) propose that we can measure the information cost of a policy 𝜋 T (𝑎|𝑠) with 𝑔 𝜋 T (𝑠, 𝑎) = log

𝜋 T (𝑎|𝑠)

𝜋 S (𝑎|𝑠) . The expected information cost of the target policy 𝜋 T can be written as 𝔼[𝑔 𝜋 T (𝑠 𝑡 , 𝑎 𝑡 |𝑠)] = 𝐷 𝐾 𝐿 (𝜋 T 𝑠 ‖𝜋 S 𝑠 ). Considering the dynamics induced by the transition probabilities 𝑇 (𝑠 𝑡+1 |𝑠 𝑡 , 𝑎 𝑡 ) of the underlying MDP, we can now consider the total discounted expected information cost for the target policy:

𝐼 𝜋 T (𝑠) = ∞ ∑ 𝑡=0 𝛾 𝑡 𝐷 𝐾 𝐿 (𝜋 T 𝑠 𝑡 ,𝑎 𝑡 ‖𝜋 S 𝑠 𝑡 ,𝑎 𝑡 ).(5)

We define

𝐹 𝜋 T (𝑠, 𝑎) = 𝑉 𝜋 T (𝑠) + 1 𝛽 𝐼 𝜋 T (𝑠)(6)

as a 𝛽-weighted free-energy function with 𝛽 controlling the tradeoff between value and information. From this we get a state-action free-energy function

𝐺 𝜋 T (𝑠, 𝑎) = 𝔼 𝜃 [𝑅|𝑠, 𝑎] + 𝛾 𝔼 𝑇 [𝐹 𝜋 T (𝑠 ′ )|𝑠, 𝑎].(7)

Now, we define the free-energy advantage function as:

𝐵 𝜋 T (𝑠, 𝑎) = 𝐺 𝜋 T (𝑠, 𝑎) − 𝑉 𝜋 T (𝑠) = 𝔼 𝜏 ∼𝜋 T [𝐶(𝜏 ) + 𝛾 𝛽 𝑔 𝜋 T (𝑠 𝑡+1 , 𝑎 𝑡+1 )|𝑠 0 = 𝑠, 𝑎 0 = 𝑎] − 𝔼 𝜏 ∼𝜋 T [𝐶(𝜏 )|𝑠 0 = 𝑠](8)

Here, 𝐶(𝜏 ) represents the cumulative sum of constraint costs along the trajectory 𝜏.

Finally, we can calculate the free-energy advantage transfer policy update:

𝜋 𝑘+1 = arg max 𝜋∈Π 𝜃 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [𝐵 𝜋 𝑘 (𝑠, 𝑎)] s.t. 𝐽 𝐶 𝑖 (𝜋 𝑘 ) + 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [ 𝐵 𝜋 𝑘 𝐶 𝑖 (𝑠, 𝑎) 1 − 𝛾 ] ≤ 𝑐 𝑖 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 ) ≤ 𝛿(9)

Results

In this section, we will present the evaluation framework, metrics and results.

Experiments

In this section we will first evaluate the performance of the Constrained Policy Optimization algorithm [17] for constrained RL problems. CPO yields better performance on constrained tasks than methods such as Trust Region Policy Optimization or Primal-Dual Optimization ( [12,18]). We conduct the experiments on an exemplary robotics learning task, specifically the HalfCheetah environment within the MuJoCo1 physics engine embedded in OpenAI Gym2 . The HalfCheetah is a two-dimensional simulated robot with six controllable joints, as depicted in figure 1. We use a continuous action space with 𝐴 = [−1, 1] 6 , where each entry of the action vector represents the torque [Nm] applied to the respective motorized joint. The constraint is placed on an angle, in which the HalfCheetah is considered to be fallen over and would not be able to recover to a standing position without external help.

Evaluating Transferability for Safety-Critical Applications

For safety-critical applications at any scale, the best direct improvement of TL would generally be starting from accurate prior distributions, because we can expect a reduced exploratory period. While this is expected to reduce training time, prevention of constraint violations is not necessarily guaranteed. Having reliable algorithms should also make it possible to train an agent in a simulation and then transfer the model to safety-critical applications in the real world without violating constraints imposed by the task. We, therefore, extend the list by the following measurements:

• The ratio of total constraint cost accumulated by the transfer learner and total constraint cost accumulated by the non-transfer learner or between different transfer learners. • The sum of constraint violations committed by the transfer learner compared to the non-transfer learner (or between multiple transfer learners) above a specified threshold.

Note that we hypothesize that measuring the robustness gained by simultaneously learning system dynamics ( [19]) could a valid metric, which we intend to examine in the future.

Evaluation

We hence compare the CPO algorithm with and without free-energy advantage policy transfer (FEAT) on noisy environments with a noise factor 𝑈 𝑗 ∼ 𝒩 (1, 𝜎 ) for every state variable index 𝑗 ∈ {1, … , |𝑠|} by evaluation the post-transfer performance according to the formerly proposed criteria. In all experiments, we first pre-train an agent with an implementation of the CPO algorithm in a simulated environment without noise for 2500 iterations. After the final iteration, the agent is able to control the HalfCheetah at a satisfactory level.

Comparison of ratios of total constraint costs

Figure 2 shows the mean constraint costs over a post-transfer training process of 𝑇 = 1000 iterations. Our approach, CPO+FEAT (orange), manages to stay below the curve of the baseline approach, CPO (green).

Comparison of the sum of constraint violations

For the criterion of constraint violations, we define a constraint threshold 𝑐. Like above, we train the agents for a total of 𝑇 = 1000 iterations. In a noisy environment with 𝜎 = 0.1, we evaluate both agents with a strict safety threshold of 𝑐 = 0.02. Here, the value for 𝑐 means that the HalfCheetah is not allowed to show signs of falling over. While CPO without FEAT violates the threshold 7.2% of the time, CPO with added FEAT evaluates at only 3.5%.

For 𝜎 = 0.2, we chose a higher threshold of 𝑐 = 0.15 (the agent is allowed to appear unstable, but is not allowed to fall over). CPO without FEAT violates the threshold in 86.7% of iterations, while CPO with FEAT is significantly lower, with only 32.3% violations. Unfortunately, both algorithms still lack the necessary robustness to guarantee safety for environments with higher noise levels.

Conclusion and Future Work

In this paper, we highlighted how Transfer Learning can be interpreted in the context of constrained Reinforcement Learning and proposed a way that transferability can be evaluated. The experiments indicate that our approach improves the transferability of policies for constrained problems in the specific case of the Constrained Policy Optimization algorithm.

In the future, we aim to research further how this approach is applicable for similar policy based RL algorithms and extended this to a more general case. Furthermore, to reflect real-world problems more accurately, we plan to add further restrictions to the actor's perception of the environment, such as partial observability.

Figure 1 :1Figure 1: A rendering of the MuJoCo HalfCheetah environment in its initial state. Its controllable joints are highlighted in red.

Figure 2 :2Figure 2: A comparison of mean constraint costs over 𝑇 = 1000 iterations between CPO (green) and CPO with FEAT (orange) in a noisy environment with 𝜎 = 0.1.

𝐽 𝐶 (𝜋) ≤ 𝑐 and 𝑑(𝜋, 𝜋 𝑡 ) ≤ 𝛿. Here, Π 𝜃 ⊂ Π denotes a 𝜃-parameterized policy subset that filters for relevant parameters. Trust region algorithms for reinforcement learning ([12,13], such as CPO, have policy updates of the form𝜋 𝑘+1 = arg max 𝜋∈Π 𝜃 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [𝐴 𝜋 𝑘 (𝑠, 𝑎)],s.t. 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 ) ≤ 𝛿𝜋∈Π 𝜃𝐽 (𝜋) s.t.

https://github.com/openai/mujoco-py https://github.com/openai/gym

Acknowledgments

This research has been funded by the Federal Ministry of Education and Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning and Artificial Intelligence.

Robust constrained model predictive control using linear matrix inequalities MVKothare VBalakrishnan MMorari Automatica 32 1996 A comprehensive survey on transfer learning FZhuang ZQi KDuan DXi YZhu HZhu HXiong QHe Proceedings of the IEEE 109 2020 Transfer learning for reinforcement learning domains: A survey METaylor PStone Journal of Machine Learning Research 10 2009 ZZhu KLin JZhou arXiv:2009.07888 Transfer learning in deep reinforcement learning: A survey 2020 arXiv preprint ROgishima IKarino YKuniyoshi arXiv:2107.11811 Reinforced imitation learning by free energy principle 2021 arXiv preprint Off-policy deep reinforcement learning without exploration SFujimoto DMeger DPrecup International conference on machine learning

PMLR

2019 Morel: Model-based offline reinforcement learning RKidambi ARajeswaran PNetrapalli TJoachims Advances in neural information processing systems 33 2020 Probabilistic policy reuse for inter-task transfer learning FFernández JGarcía MVeloso Robotics and Autonomous Systems 58 2010 Policy transfer using reward shaping TBrys AHarutyunyan METaylor ANowé AAMAS 2015 AAbdolmaleki JTSpringenberg YTassa RMunos NHeess MRiedmiller arXiv:1806.06920 Maximum a posteriori policy optimisation 2018 arXiv preprint Trading value and information in mdps JRubin OShamir NTishby Decision Making with Imperfect Decision Makers Springer 2012 Trust region policy optimization JSchulman SLevine PAbbeel MJordan PMoritz International conference on machine learning

PMLR

2015 JSchulman PMoritz SLevine MJordan PAbbeel arXiv:1506.02438 High-dimensional continuous control using generalized advantage estimation 2015 arXiv preprint Reinforcement learning in continuous time: Advantage updating LCBaird Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94) 1994 IEEE International Conference on Neural Networks (ICNN'94) IEEE 1994 4 Increasing the action gap: New operators for reinforcement learning MGBellemare GOstrovski AGuez PThomas RMunos Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2016 30 RFox APakman NTishby arXiv:1512.08562 Taming the noise in reinforcement learning via soft updates 2015 arXiv preprint Constrained policy optimization JAchiam DHeld ATamar PAbbeel International conference on machine learning

PMLR

2017 Risk-constrained reinforcement learning with percentile risk criteria YChow MGhavamzadeh LJanson MPavone The Journal of Machine Learning Research 18 2017 Mixed strategies for robust optimization of unknown objectives PGSessa IBogunovic MKamgarpour AKrause International Conference on Artificial Intelligence and Statistics

PMLR

2020