Adversarial Attacks on Deep Algorithmic Trading Policies Nancirose Piazza, ∗1 , Yaser Faghan, †2 , Vahid Behzadan, ‡1 , and Ali Fathi §3 1 Secure and Assured Intelligent Learning (SAIL) Lab University of New Haven, USA 2 Instituto Superior de Economia e Gestão and CEMAPRE Universidade de Lisboa, Portugal 3 Enterprise Model Risk Management Group, Royal Bank of Canada (RBC) Abstract Deep Reinforcement Learning (DRL) has become an appealing solution to algorithmic trading such as high frequency trading of stocks and cyptocurrencies. However, DRL poli- cies are shown to be susceptible to adversarial attacks. It follows that algorithmic trading DRL agents may also be compromised by such adversarial techniques, leading to policy ma- nipulation. In this paper, we develop a threat model for deep trading policies, and propose two active attack techniques for manipulating the performance of such policies at test-time. Additionally, we explore the exploitation of a passive attack based on adversarial policy imitation. Furthermore, we demonstrate the effectiveness of the proposed attacks against benchmark and real-world DQN trading agents. 1 Introduction The pursuit of intelligent agents for automated financial trading is a challenge that has captured the interest of researchers and analysts for decades according to Cartea et al. [2015]. The process of trading is well depicted as an online decision making problem involving two critical steps of summarizing the market condition and execution of optimal actions. For many years, algorithmic trading suffered from various problems ranging from difficulties in representations of the complex market conditions to real-time approaches to optimal decision-making in the trading environment. With recent advances in Machine Learning (ML), particularly in deep learning and Deep Reinforcement Learning (DRL), such challenges are dramatically alleviated via numerous novel proposals and architectures that enable end-to-end approaches to algorithmic trading (Pricope [2021]). In this context, end-to-end refers to the direct mapping of high- dimensional raw market and environment observations to optimal decisions in real-time. As data-driven agents, many such algorithms rely on sources of data that are either externally or collectively maintained, examples of which include market indicators (Cartea et al. [2015]) and social indicators (e.g., sentiment analysis from Twitter feeds by Kaur [2017]). While the growing interest in adoption of DRL techniques for algorithmic trading is justified by their impressive success in other domains, the threat of adversarial manipulation in such ∗ Secure and Assured Intelligent Learning (SAIL) Lab, University of New Haven, New Haven, USA. npiaz1@newhaven.edu † Instituto Superior de Economia e Gestão and CEMAPRE, Universidade de Lisboa, Portugal. yaser.kord@yahoo.com ‡ Secure and Assured Intelligence Learning (SAIL) Lab, University of New Haven, New Haven, USA. vbehzadan@newhaven.edu § Enterprise Model Risk Management Group, Royal Bank of Canada (RBC). ali.fathi@rbc.com Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al systems is yet to be explored. Recent developments in the domain of adversarial machine learning have brought attention to the security challenges in regards to the vulnerability of machine learning models to adversarial attacks, a paper by Papernot et al. [2018]. Instances of such attacks include adversarial examples like Fast Gradient Sign Method by Goodfellow et al. [2014], which are strategically induced perturbations in the input vectors that are not easily detectable by human observers. Adversarial attacks can impact all deep learning and classical machine learning models, in- cluding DRL agents, investigated by Behzadan and Munir [2018]. Recent work by Behzadan Behzadan and Munir [2017a, 2018], Behzadan [2019] establish that DRL algorithms are vul- nerable to adversarial actions at both training and inference phases of their deployment. This discovery is further verified in settings such as video games (Huang et al. [2017]), robotics (Clark et al. [2018]), autonomous navigation (Behzadan and Munir [2019]), and cybersecurity (Han et al. [2018]). Yet, the extent, severity, and the dynamics of such vulnerabilities in DRL trading agents are yet to be addressed. Adversarial perturbations of DRL trading policies are also significant form the financial Model Risk Managment (MRM) point of view (Reserve [2011], of the Superintendent of Fi- nancial Institutions [OSFI], Morini [2011]) since the existence of such vulnerabilities can be traced back to the algorithmic underpinnings of these systems. However, principal differences between traditional financial models and algorithmic trading systems pose additional challenges for quantifying the resulting model risk. For instance, the number of model components involved in an algorithmic trading system can be large and hence, fusion of otherwise individually negli- gible residual model risk may result in significant system errors. Furthermore, There exist the adaptive nature of DRL based algorithms where the model components are re-calibrated (e.g., through retraining) based on a low latency schedule. It should also be noted that unlike other areas of quantitative modelling in finance (such as asset pricing or credit risk), benchmarking of model components in algorithmic systems is difficult due to competition considerations, as there may be restrictions for conducting open box validation of proprietary models within a firm. In this paper, we investigate test-time adversarial attacks against DRL trading agents. The main contributions are: • We present a threat model for DRL trading policies, identifying susceptible attack surfaces and practical attack vectors at test-time. • We establish the vulnerability of current DRL trading policies to adversarial manipulation for active test-time attacks. • We explore Imitation Learning for adversarial purposes after acquisition of expert demon- strations, both perfect and imperfect, from a passive test-time attack for policy imitation. • We investigate the transferability of our perturbation attacks from the imitated agents to the target agent. • We demonstrate the efficacy of the proposed attack vectors in manipulating DRL trading agents. The remainder of the paper is as followed: Section 2 presents an overview of reinforcement learning and a review of the security issues in electronic trading platforms. Section 3 proposes a DRL threat model for trading DRL agents, outlining various attack surfaces and vectors that can be exploited by an adversary. Section 4 provides the details of our experimental setup for investigating the proposed attack mechanisms, the results of which are presented in Section 5 and 6. The paper concludes in Section 7 with a summary of our findings, as well as discussions on future directions of research on the security of deep trading policies. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al 2 Background 2.1 Reinforcement Learning, Value Iteration & Deep Q-Learning Reinforcement learning (RL) is concerned with agents that interact with an environment and exploit their experiences to optimize a sequential decision-making policy. RL can be formally modeled as learning to control a Markov Decision Process (MDP) M = (S, A, R, P ), where S is the set of reachable states in the process, A is the set of available actions, R is the mapping of transitions to the immediate reward, and P represents the transition probabilities (i.e., state dynamics), which are initially unknown to RL agents. At any given time-step t, the agent is at a state st ∈ S, chooses an at ∈ A, transitions from st to a state st+1 according to the transition probability P (st+1 |st , at ) and receives a reward rt+1 = R(st , at , st+1 ). The solution to an MDP problem is a policy π(s) that is a mapping from states to actions. The goalPof RL is to learn a policy that maximizes the expected discounted return E[Rt ], where Rt = N t=0 γ rt ; with rt k denoting the instantaneous reward received at time t, and γ is a discount factor γ ∈ [0, 1]. The value of a state st is defined as the expected discounted return from st following a policy π, that is, V π (st ) = E[Rt |st , π]. The state-action value (Q-value) Qπ (st , at ) = E[Rt |st , at , π] is the value of state st after applying action at and following a policy π thereafter. The solution approaches to RL include value iteration algorithms that optimize a value function (e.g., V (.) or Q(., .)) to extract the optimal policy from it. As an instance of value iteration algorithms, Q-Learning aims to maximize for the action-value function Q through the iterative formulation of Eq. (1): Q(s, a) = R(s, a) + γmaxa′ (Q(s′ , a′ )) (1) Where s′ is the state that emerges as a result of action a, and a′ is a possible action in state s′ . The optimal Q value given a policy π is defined as: Q∗ (s, a) = maxπ Qπ (s, a), and the optimal policy is given by π ∗ (s) = arg maxa Q(s, a). The Q-learning method estimates the optimal action policies by using the Bellman formula- tion to iteratively reduce the TD-Error given by Qi+1 (s, a) − E[r + γ maxa Qi ] for the iterative update of a value iteration technique. Practical implementation of Q-learning is commonly based on function approximation of the parametrized Q-function Q(s, a; θ) ≈ Q∗ (s, a). A com- mon technique for approximating the parametrized non-linear Q-function is via neural network models whose weights correspond to the parameter vector θ. Such neural networks, commonly referred to as Q-networks, are trained such that at every iteration i, the following loss function is minimized: Li (θi ) = Es,a∼ρ(.) [(yi − Q(s, a, ; θi ))2 ] (2) where yi = E[r + γ maxa′ Q(s′ , a′ ; θi−1 )|s, a], and ρ(s, a) is a probability distribution over states s and actions a. This optimization problem is typically solved using computationally efficient techniques such as Stochastic Gradient Descent (SGD). A Deep Q-Network (DQN) by Mnih et al. [2015] is a training algorithm and implementation of Q-value estimation by a neural network function approximator. Techniques such as experience replay and a target network are used in an DQN to stabilize the training process and maintain the i.i.d. (Independent and Identically Distributed) property of the data. Mnih et al. [2015] demonstrate the application of this new Q-network technique to end-to-end learning of Q-values in playing Atari games based on observations of pixel values in the game environment. 2.2 State of Security in Algorithmic Trading In recent years, electronic trading platforms have made access to global capital markets easier for the public, resulting in a lower barrier to entry and influx of traffic across these platforms. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al The growing interest in such trading platforms and technologies is however accompanied by the increasing risks of cyber attacks. While the literature on the cybersecurity issues of current trading platforms is scarce, few industry-sponsored studies report concerning issues in deployed trading platforms. One such study on the exposure of security flaws in trading technologies by Hernandez [2018] evaluates various popular desktop, mobile and web trading service plat- forms against a standard list of security checks, and reports that these trading technologies are in general far more susceptible to cyber attacks than previously-reviewed personal banking applications from 2013 and 2015. The security checks consisted of features such as 2-Factor Authentication (2FA), encrypted communications, privacy mode, anti-reverse engineering, and hard-coded secrets. This study reports that 64% of the reviewed trading applications rely on un- encrypted communication channels for authentication and trading data. Also, the author finds that many trading applications utilize poor session management and SSL certificate validation, thereby enabling Man-in-The-Middle (MITM) attacks. Furthermore, this report points out the wide-scale susceptibility of such platforms to remote Denial of Service (DoS) attacks, which may render the applications useless. Building on the findings of this study, our paper investigates attacks that leverage the aforementioned vulnerabilities to manipulate deep algorithmic trading policies. 3 Threat Model of DRL Trading Agents Adversarial attacks against DRL policies aim to compromise one or more aspects of the Con- fidentiality, Integrity, and Availability (CIA) triad in the targeted agents Behzadan and Munir [2018]. More specifically, the Confidentiality of a DRL agent refers to the need for confidentiality of an agent’s parameters, such as the policy or reward function. The Integrity of a DRL agent relies on the policy behaving as intended by the user. Availability refers to the agent’s capability to execute its task when needed. At a high-level, the threat landscape of DRL agents can be captured in terms of the Attack Surface and Attack Model of the agent by Behzadan [2019], as outlined below. 3.1 Attack Surface and Vectors Adversarial attacks may target all components of a DRL agent, including the environment, agent’s observation channel, reward channel, actuators, and training components (e.g., experi- ence storage and selection), as identified by Behzadan Behzadan [2019]. Figure 1 illustrates the components of a DRL trading agent at test-time. In the context of algorithmic trading, the observation of the environment is gathered from various sources such as market indicators, social media indicators, and exchanges– we refer to these sources as input channels. This data is prepossessed and feature engineered to create the agent’s observation of the state. These states are part of the observation returned by the environment to the agent along with the reward. Through the observation channel, an adversary may intercept the observation and exchange it for a perturbed observation, otherwise called a Man-In-The-Middle (MITM) attack. An adversary may also impose a delay the observation channel through a Denial of Service (DoS) attack. It has been shown that slight perturbations of the observation state impact DRL agent performance by Ding and Dong [2020]. The reward channel is often tied to internal securities such as bank accounts or portfolios, and hence are less susceptible to external adversarial manipulation. However, any external component reachable by the agent can be compromised implicitly. 3.2 Attack Model The capabilities of an adversary are defined by two factors: actions available to the adversary and information available about the target. This section presents a classification of attacks Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al Delay through Denial of Service (DoS) Man-In-The-Middle (MITM) Input Channels Perturbations Exchange Data Market Reinforcement Indicators State Learning Agent Actuator Channel Social Media Indicators External Securities (E.g. Bank) Reward Channel More Difficult to Access Figure 1: Attack Surface and Vectors of a DRL Trading Agent at Test-Time and adversaries at the inference phase based on the aforementioned factors. According to the available information, attacks are classified as whitebox or blackbox. Whitebox refers to when the adversary has sufficient knowledge of the target’s parameter to directly craft an effective perturbation, and blackbox refers to the vice versa scenario. Perturbations in observation affect both test-time and train-time. While this paper focuses on test-time attacks, it is noteworthy that during training, additional error is bootstrapped, poten- tially impacting learned policies. Work by Behzadan and Munir [2017b] show that training-time attacks under certain conditions with sufficiently high perturbation rates resulted in the agent’s inability to recover performance upon test-time evaluation under non-adversarial conditions. 3.2.1 Test-Time Attacks Test-time or inference-time attacks may be active or passive. Active attacks require adversarial intervention to manipulate the DRL policy. Instances of such attacks include adversarial exam- ples (Goodfellow et al. [2014],Carlini and Wagner [2017],Su et al. [2019]) and delay induction in observations. Passive attacks gather information about the target agent by observing the target’s behavior in various states. With sufficient observations of state-action pairs, the ad- versary can reconstruct the targeted policy and compromise the Confidentiality of the targeted, proprietary agents Behzadan and Hsu [2019]. Active attacks can be classified under targeted and non-targeted attacks. Successful non- targeted aim to have the policy select any action other than the one prescribed by the policy modifying (i.e., perturbing) the true observation with a perturbed observation. Targeted attacks craft perturbations such that the target selects a particular sub-optimal action a′ . In the category of passive attacks, Imitation Learning and Inverse Reinforcement Learning are avenues an adversary may exploit to either attack their target agent or steal components of the agent such as its policy. As demonstrated in work by Behzadan and Hsu [2019], adversaries can gather additional information through policy imitation, thereby enabling whitebox attacks against blackbox targets. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al 3.2.2 Training-Time Attacks Training-time attacks are also referred as data poisoning attacks by impairing an agent’s capabil- ity to learn optimally. In such attacks, the adversary manipulates the training data via injecting false samples, mislabeled samples, or overrepresented samples to manipulate the distribution of the training data according to Goldblum et al. [2021]. Though typically studied in supervised and unsupervised learning tasks, data poisoning can also apply to DRL as demonstrated by Behzadan and Munir [2017b]. 4 Experimental Setup We demonstrate the proposed attacks on two trading agents based on DQN policies with varying complexity, one we refer to as basic DQN which uses a simple OpenAI Gym1 environment to emulate trading, and the other is based on an open-source framework called TensorTrade2 which leverages a more realistic OpenAI Gym environment mimicking real-world trading settings. Our basic DQN represents less complex agents while TensorTrade’s DQN will demonstrate the real- world impact of such attacks that have external components tied to the agent like a portfolio. In fact, TensorTrade is currently used and deployed for actual DRL-based trading in online cryptocurrency and stock exchanges. There are general choices for the components of MDP M . The state space may contain a subset of four common prices such as open, high, low, and close. Technical Indicators refer to other measurements traders use to assess a stock are used in the state space. The duration of a timestep can be any interval, eg. milliseconds, minutes, hours and each interval is called a bar. The action space may include buy/sell/hold quantities, which can be continuous or discrete. Environments will implement a commission fee upon changing position (buy/sell). The reward function can be profit/loss or a more detailed metric such as the Sharpe value. Training is usually on historical data. 4.1 Basic Trading Environment In the basic trading environment, the historical data is sourced from Yandex N.V. (YNDX)(yan) between the period of 2015-2016. The dataset is comprised of samples representing a one-minute temporal resolution, and the dynamic of the price during that minute is captured by four values: open price, high price, low price, and close price. Our agent can only hold, sell or buy a single stock. Table 1 details the specifications of the Basic Stock Environment. Table 2 contains hyperparameters of the DQN agent trained in this environment. 4.2 TensorTrade Environment The TensorTrade environment (TT) can implement a portfolio that holds wallets of various coins or currencies. The data used for this setup is included with TT as a demonstration of training. This dataset is dated from the start of 2020, and contains the open, high, low, close and volume prices at hourly intervals. It also includes technical indicators such as the Relative Strength Indicator (RSI) and Moving Average Convergence Divergence (MACD) and log(Ct ) − log(Ct−1 ) where Ct is the closing price at timestep t as the dataset features. Our portfolio starts with 10,000 USD and 10 BTC. We use the risk-adjusted reward scheme and manage-risk action scheme provided by TT. The risk-adjusted reward scheme uses the Sharpe Ratio which is defined by the equation below: E[Ra − Rb ] Sa = σa 1 OpenAI Gym, (2016), GitHub repository, https://github.com/openai/gym 2 TensorTrade, (2019), GitHub repository, https://github.com/tensortrade-org/tensortrade Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al Table 1: Specifications of the Basic Stock Environment & TensorTrade’s Environment Basic DQN TensorTrade’s DQN Observation Space – Past 10 bars consisting of: RHP, RLP, RCP Past 20 tuples of: – [0 or 1] bought share indicator – log(Ct ) - log(Ct−1 ) – Profit or loss from current position Ct is the closing price at timestep t RHP is Relative High Price – MACD (fast=10, slow=50, signal=5) RLP is Relative Low Price – RSI(period=20) – RCP is Relative Close Price Action Space - Buy a share Managed Risk Scheme - Wait – Product(stop, take, trade size, [buy, sell]) - Close the position (sell) (180 actions) – Wait/hold action (indexed at 0) Reward - No position: [100 * ( SP-BP ) / BP ]% - C% Risk-Adjusted Scheme - Position: - C% Sharpe Ratio SP is Sold Price BP is Bought Price C is Commission Termination Episode length > 250 Timestep > 250 Basic DQN TensorTrade’s DQN No. Timesteps 105 No. Timesteps 250 γ 0.99 Episodes 100 Learning Rate 10−4 Epochs 80 Replay Buffer Size 105 γ 0.9999 First Learning Step 1000 Learning Rate 10−5 Target Network Update Freq. 1000 Replay Buffer Size 103 Exploration PSN Target Network Update Freq. 103 Exploration Fraction 0.1 Exploration ϵ-greedy Final Exploration Prob. 0.02 Optimistic Initialization ϵ 0.9 Max. Total Reward 250 Minimum ϵ 0.05 Note: Parameter-Space Noise (PSN) Decay ϵ every N steps 200 Table 2: Training Hyperparameters where Ra is the asset return, Rb is the risk-free return, and σa is the standard deviation of the asset excess return. The manage-risk action scheme scales the action space depending on provided arguments such as trade size, stop and take. The default trade size is 10 which implies there will be a list of 10 trade sizes that are uniformly spaced. For instance, trade size of 3 implies 33.3%, 66.6%, and 99.9% of the balance can be traded. Take is a list of possible take profit percentages from an order, and stop is a list of possible stop loss percentages from an order. The action space is the resulting product of take, stop, trade size, and action type which is buy or sell. There is one additional action: wait/hold. In our case, we have an action space size of 181. This information as well as training hyperparameters are summarized in Table 1 and Table 2, respectively. There are other simpler reward (e.g., SimpleProfit) and action (e.g., Buy Sell Hold BSH) schemes available with TT. 5 Active Test-Time Attacks In this section, we investigate the impact of adversarial attacks on deep trading agents at test- time. To preserve the realism of our study, we limit the scope of our investigation to attacks that satisfy the following constraints: (1) Attacks are limited to manipulating the observation channel of the target. (2) Attacks are limited to perturbations that are not immediately detected by common human or automated anomaly detection mechanisms. We implement 2 different types of attack namely untargeted delay attacks, and untar- geted/targeted adversarial perturbation attacks. This study considers whitebox attacks only. However, as demonstrated in Behzadan and Hsu [2019], it is also feasible to reverse-engineer blackbox policies via imitation learning, thereby converting blackbox attacks to whitebox. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al 5.1 Non-Targeted Delay Attacks We evaluate through non-targeted attacks on the observation channel through a single, most recent window history tuple of their features. The observation delay is of 1 timestep where a tuple of values seen at timestep t − 1 will be received at timestep t. This is both practical and representative of minimal interference. Because there is no adversarial preference of when to implement the delay, this is non-targeted. Likewise, a targeted delay attack implements an intended timing; however, we did not pursue this. Results are presented in Figure 2. This type of non-targeted attack should be of concern to traders because of lack of computational expense, and adversarial predisposition because anomalies are masked by time-series locality. 1e8 80 No Delay 1.50 No Delay δ=1, P=1.0 1.25 δ=1, P=1.0 60 δ=1, P=0.1 δ=1, P=0.9 δ=1, P=0.5 1.00 Total Reward δ=1, P=0.5 40 δ=1, P=0.9 δ=1, P=0.1 Return 0.75 20 0.50 0.25 0 0.00 0 5000 10000 15000 20000 25000 30000 0 50 100 150 200 250 Timesteps Timesteps (a) Basic DQN (b) TensorTrade DQN Figure 2: Observational Delay (δ is delayed timesteps) 5.2 Non-Targeted Perturbation Attacks To investigate the effectiveness of adversarial example attacks on DRL policies, we implemented Fast Gradient Sign (FGSM) by Goodfellow et al. [2014] and Carlini and Wagner ( C&W) Carlini and Wagner [2017] adversarial sample attacks using L2 loss for both DQNs. In Table 4, there are failure counts and other notable counts for the basic DQN and TT’s DQN. In this experiment, we perturb a single, most recent tuple of values in the observation space for all FGSM and C&W attacks. Our definition of a failure for non-target attacks is the failure to change the action or action type. Post-constraints are applied where adversarial samples must fall within a realistic distribution of the true data. There were some modifications to the C&W implementation for TT due to non-normalized data. Representative samples of a perturbed tuple of values from successful attacks are presented in Table 3. See Table 3 for the Basic DQN return performance comparison when under attack vs. not under attack. Additionally, we have provided the total reward difference and net-worth difference between the TT target agent and TT target agent under attack in Figure 5 and Figure 7, respectively. Through these results, we establish that the test-time performance of the target policy in regards to its total reward is negatively impacted by our attacks. We have also shown that the agent’s net-worth is also impacted, but not necessarily reflected by total reward. 5.3 Targeted Perturbation Attacks Targeted attacks aim to manipulate a policy into taking an adversarial action a′t instead of action at at a timestep t. We have evaluated against targeted FGSM and targeted C&W attacks using L2 loss for both DQNs with minimal Q-value actions as our selected adversarial actions. However, it is noteworthy to remember that the function approximator can regress values which Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al NT FGSM timestep original observation perturbed observation a a′ 894 0.0,-0.00354677, -0.00354677 0.0000, -0.0045, -0.0025 1 0 3973 0.0, -0.00048828,-0.00048828 0.0000, -0.0006, -0.0004 1 0 9599 0.00294118,-0.0004902,0.00294118 0.0027, -0.0002, 0.0027 0 1 16323 0.00435098,0.0, 0.00290065 0.0041, 0.0000, 0.0032 2 0 23283 0.00074322,-0.00371609,0.00074322 0.0001, -0.0044, 0.0001 0 1 NT C&W timestep original observation perturbed observation a a′ 1602 0.00203314, 0.0, 0.00203314 0.0003, 0.0000, 0.0003 0 1 4735 0.00707071, 0.0,0.00707071 0.0002, 0.0000, 0.0002 0 1 5346 0.0032695 ,-0.00140121,0.0032695 0.0002, -0.0002, 0.0002 0 1 17424 0.0010985,-0.0010985 , 0.0010985 0.0002, -0.0002, 0.0002 2 0 29779 0.00039904,-0.00079808, 0.00039904 0.0003, -0.0003, 0.0003 0 1 Table 3: Successful Non-Target (NT) FGSM & C&W Attacks against Basic DQN NT FGSM NT C&W DQN P Opt Fail N.C.N P Opt Fail N.C.N 0.1 26 25 2 0.1 17 17 0 TT 0.5 123 117 7 0.5 114 110 3 1.0 242 236 7 1.0 246 240 3 0.01 286 6 - 0.01 329 163 - Basic 0.1 3349 176 - 0.1 3016 1751 - 0.5 15818 3329 - 0.5 15979 9358 - 1.0 31779 10778 - 1.0 31779 18716 - T FGSM T C&W P Opt Fail NT PS P Opt Fail NT PS 0.1 248 248 146 230 0.1 26 26 5 26 TT 0.5 123 123 65 122 0.5 131 131 25 127 1.0 28 28 9 27 1.0 249 249 70 243 0.01 337 6 4 - 0.01 327 294 89 - Basic 0.1 3148 191 98 - 0.1 3135 2915 903 - 0.5 15905 4666 1581 - 0.5 15882 15291 4837 - 1.0 31779 16000 5334 - 1.0 31779 30779 9953 - Table 4: Non-Target FGSM (NT FGSM), C&W Attacks (NT CW), Target FGSM (T FGSM), and Target CW (T CW) Opportunities (Opt), Fails, and Partial Success (PS) on TensorTrade’s DQN and Basic DQN are no longer representations of Q-values, implying min Q-value regressed adversarial actions may not be the best for the adversary. Again we leave the exact implementation details to the full paper. See Table 4 for failure counts, attempts, and partial successes (PS). We define partial success if the attack results in an action am where the action type of am is the adversarial action type for action a′t . Our adver- sarial tuples are simple but should emphasize that adversarial attacks crafted under expensive parameters like low learning rate, high number of iterations, and high confidence can produce more human convincing adversarial samples. Performance under attack for Basic DQN can be found in Figure 4, TT’s total reward difference can be found in Figure 6, and TT’s net-worth dif- ference in Figure 8. We thus establish the impact of targeted attacks on TT’s DQN on test-time performance as well as its significant impact on the agent’s net-worth. 6 Passive Test-Time Attacks and Policy Imitation An adversary performs a passive test-time attack when they observe the target policy rollout trajectories through some active interception like a MiTM. An adversary may use other learning methods such like Imitation Learning (IL) to leverage these demonstrations training given an appropriate environment. We will make two strong assumptions to investigate an adversarial approach to IL which will require perfect adversarial information: (1) Access to an identical MDP that produced the target policy. (2) Ability to observe complete trajectories. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al 80 80 No Perturbation NT C&W P=1.0 NT C&W P=0.1 60 60 NT C&W P=0.5 NT C&W P=0.01 40 Return 40 Return No Perturbation 20 NT FGSM P=1.0 20 NT FGSM P=0.1 NT FGSM P=0.5 0 0 NT FGSM P=0.01 0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000 Timesteps Timesteps (a) Non-Targeted FGSM Attack (b) Non-Targeted C&W Attack Figure 3: Non-Targeted Attacks on the Basic DQN 80 No Perturbation T FGSM P=1.0 0 T FGSM P=0.1 60 T FGSM P=0.5 T FGSM P=0.01 −500 40 Return Return −1000 No Perturbation 20 T C&W P=1.0 −1500 T C&W P=0.1 0 T C&W P=0.5 T C&W P=0.01 0 5000 10000 15000 20000 Timesteps 25000 30000 0 5000 10000 15000 20000 25000 30000 Timesteps (a) Targeted FGSM Attack (b) Targeted C&W Attack Figure 4: Targeted Attacks on Basic DQN We use Deep Q-Learning from Demonstration (DQfD) as our IL method. There are two adversarial objectives for imitated agents: the first is policy imitation and the second is prof- itability relative to training cost. We define policy imitation for this paper as the training of a policy π ′ where its objective is to mimic a target policy’s observable behavior. Policy imitation can result in additional adversarial knowledge, providing an adversary a way to perform more whitebox attacks based on attack transferability. Policy imitation can possibly lead to similar performances of the target agent. Depending on the adversary’s objective, policy imitation can be feasible. The second objective refers to having an imitated policy converge sooner to an optimal policy which may imply a smaller adversary’s budget than the target agent’s training budget. 6.1 Imitation Learning & Deep Q-Learning from Demonstration IL is a learning framework for imitating an expert policy through demonstrations. There are two objectives to IL: to imitate behavior exhibited in the demonstrations or to learn an underlying task from demonstrations. Agents that follow the first are often modeled as naive supervised learners known as behavioral clones (BC). The second objective is also referred to as Appren- ticeship Learning. Learning frameworks like Inverse RL and RL are often used for this objective, but there are other methods that use supervised learning architectures. DQfD Hester et al. [2017] has a few components: the pretraining phase, cost function, and prioritized demonstration sampling. We use the default parameters set by the authors. The pretraining phase is training prior to interaction with the environment. There is the prioritization of sampling expert demonstrations. The cost function is the sum of four loss functions: a Temporal Difference (TD) with 1 step double DQN loss JDQ (Q), n-step double DQN loss Jn (Q), Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al 1e7 1e8 0.25 0.0 Total Reward Difference (TDR) Total Rewa d Diffe ence (TRD) 0.00 −0.2 −0.25 −0.4 −0.50 −0.6 −0.75 −0.8 −1.00 −1.0 −1.25 NT FGSM - Target TRD P=1.0 −1.2 −1.50 NT FGSM - Target TRD P=0.5 NT C&W - Target TRD P=1.0 NT FGSM - Target TRD P=0.1 −1.4 NT C&W - Target TRD P=0.5 0 50 100 150 200 250 0 50 100 150 200 250 Timesteps Timesteps (a) Non-Targeted FGSM Attack (b) Non-Targeted C&W Attack Reward Difference Reward Difference Figure 5: Reward Differences between Control Total Reward and Non-Targeted Attacks on Tensor- Trade’s DQN Total Reward 0.2 1e7 1e8 T FGSM - Target TRD P=1.0 0.0 T FGSM - Target TRD P=0.5 Total Reward Difference (TRD) 0.0 Total Rewa d Diffe ence (TRD) T FGSM - Target TRD P=0.1 −0.2 −0.2 −0.4 −0.4 −0.6 −0.6 −0.8 −0.8 −1.0 −1.0 −1.2 T C&W - Target TRD P=1.0 −1.2 T C&W - Target TRD P=0.5 −1.4 T C&W - Target TRD P=0.1 0 50 100 150 200 250 0 50 100 150 200 250 Timesteps Timesteps (a) Reward Difference Targeted (b) Reward Difference Targeted FGSM Attack C&W Attack Figure 6: Reward Difference between Control Total Reward and Targeted Attacks on TensorTrade’s DQN Total Reward 200 20 Total Net-Worth Difference (TNWD) Total Net-Worth Difference (TNWD) 0 0 −200 −20 −400 −40 −600 −60 −800 −80 −1000 −100 NT FGSM - Target TNWD P=1.0 −1200 NT FGSM - Target TNWD P=0.5 −120 NT C&W - Target TNWD P=1.0 NT FGSM - Target TNWD P=0.1 NT C&W - Target TNWD P=0.5 −1400 0 50 100 150 200 250 0 50 100 150 200 250 Timesteps Timesteps (a) Non-Targeted FGSM Attack (b) Non-Targeted C&W Attack Total Net-Worth Difference Total Net-Worth Difference Figure 7: Net-Worth Differences between Control Total Net-Worth and Non-Targeted Attacks on Ten- sorTrade’s DQN Total Net-Worth NT FGSM - Target TNWD P=1.0 Total Net-Worth Difference (TNWD) 0 Total Net-Worth Difference (TNWD) 0 NT FGSM - Target TNWD P=0.5 NT FGSM - Target TNWD P=0.1 −200 −200 −400 −400 −600 −600 −800 −800 −1000 T C&W - Target TNWD P=1.0 −1000 T C&W - Target TNWD P=0.5 −1200 T C&W - Target TNWD P=0.1 0 50 100 150 200 250 0 50 100 150 200 250 Timesteps Timesteps (a) Net-Worth Difference Targeted (b) Net-Worth Difference Targeted FGSM Attack C&W Attack Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Figure 8: Net-Worth Difference between Control Total Reward and Targeted Attacks on TensorTrade’s DQN Net-Worth Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al a large margin supervised loss JE (Q), and L2 regularization JL2 (Q). The proper equation is as followed where λ1 , λ2 , λ3 are scalars Hester et al. [2017]: J(Q) = JDQ (Q) + λ1 Jn (Q) + λ2 JE (Q) + λ3 JL2 (Q) 6.2 Perfect Demonstrations & Imperfect Demonstrations Perfect demonstrations refers to observing complete trajectories that inititates from a start state s0 and terminates at a state sT . DQfD’s pretraining phase uses perfect demonstrations but we may also be interested in imperfect demonstrations. Once again we leave exact implementation details to full paper however we have tested various amounts of demonstrations among various quantities of timesteps. We included a be- haviorial clone as a form of alternative method to DQfD. We fixed the number of iterations in the pretraining phase and quantity of episodes based on a parameter search. For policy behavior evaluation during both the randomized evaluation and training evaluation, we include Table 5. As expected for the supervised learner (S.L.), it is capable of imitating the training trajectory, but not necessarily the testing trajectories. Additionally, the difference in tasks such as Q-value regression vs. maximum likelihood contribute to the expectation of policy imitation. We provide Figure 9(b) as net-worth comparison among agents. Randomized Evaluation Agent Succ. Succ. No Attack Attack Transfer Needed 250 7 1 10 Training Evaluation Randomized Evaluation 300 166 44 253 Agent Exact A. Type Length Exact A. Type Length 350 15 5 17 250 19 207 250 207 1797 2490 400 2 1 10 300 27 162 300 146 1305 2490 450 10 2 11 350 26 61 350 233 251 2490 500 88 24 75 400 27 322 400 207 1799 2490 250 (P) 422 166 427 450 32 353 450 220 1772 2490 300 (P) 195 70 245 500 31 399 400 188 1819 2490 350 (P) 169 57 216 B.C. 345 349 350 341 1431 2490 400 (P) 938 500 702 450 (P) 230 103 291 Table 5: DQfD Agents Policy Action Match or Action 500 (P) 228 98 260 Type (A. Type) with Target Agent B.C. 1545 979 544 Table 6: Successful (succ.) Transferred FGSM Non- Targeted Attacks (P is pretraining phase agents) Imperfect demonstrations refer to a non-continuous set of demonstrations. One method is to choose another IL method without the perfect demonstration assumption; however, we choose to estimate the missing demonstrations. The full paper presents further work on this and its limitations; however, it is possible for competitive performance as seen with a boxplot of the average agent’s net-worth in Figure 9(a). 6.3 Transferability to Target Agent We present Table 6 which contains the count of successful transferred non-targeted FGSM attacks from our imitated agents to the target agent. We considered an attack successful at a timestep t if an imitated agent changes to any action other than optimal action a and a successful transfer if observation is shown to target agent also results in change of action. We note that all pretraining phase policies were more susceptible to our FGSM attacks and had higher successful transfer to the target agent. For the intentions of attacking the target agent, it is plausible that the pretraining phase agents and B.C. agent are sufficient. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al 15000 Average Net-Worth Gain 10000 15000 5000 Average Net-Worth Gain 10000 0 5000 −5000 0 5000 −10000 10000 −15000 15000 Target 250 300 350 400 450 500 Supervised Target 50% 10% Supervised Learning Agent Learning Agent 100% Agents trained on Percentage of Demonstration Agents trained on Timesteps (a) Average Net-Worth Gain over 10 (b) Average Net-Worth Gain over 10 Randomized Starts Randomized Starts Figure 9: Imperfect Demonstration Net-Worth Comparison (Left) & Perfect Demonstration Net-Worth Comparison (Right) 7 Conclusion We investigated the vulnerability of DRL trading agents to adversarial attacks at inference time. We identified the attack surface and vectors of algorithmic trading policies in a novel threat model, and proposed 2 active test-time attack techniques namely: non-targeted DoS-based delay induction and targeted/non-targeted MITM-based adversarial perturbation. We demonstrate the susceptibility of a benchmark DRL trading agent and an agent based on a popular open- source framework for algorithmic trading called TensorTrade. Through perturbation of a single tuple from the history window, we show adversarial intervention can easily result in sub-optimal performance upon test-time. The results demonstrate that our target agents are sensitive to even weak attacks such as FGSM, as well as and more powerful attacks like C&W which provide human-fooling adversarial samples. Furthermore, portfolios tied to the agent may be impacted in ways that is not directly reflected in the performance metric at test-time, namely total reward. With TensorTrade’s DQN, our attacks were shown to adversely affect the agent’s net-worth. This finding may have significant repercussions on risk mitigation, as test-time performance through total reward may not alert human traders of the severity of impact upon external securities tied to the agent. We looked to Imitation Learning methods for adversary usage through a passive, test-time attack. We have considered objectives for an adversary such as: policy imitation, self-gain or knowledge gain for whitebox attacks. We have shown that imitated agents can possibly perform competitively or better for equivalent or less computational expense than its target. There are limitations when using supervised learners but may be useful enough to an adversary for whitebox optimization attacking. With the use of our imitated agents, we were able to show the transfer-ability of an adversarial attack to our target agent. The reported findings establish the need for further research on various aspects of security in DRL trading agents such as a need for metrics and measurement techniques for benchmarking the resilience and robustness of trading policies to adversarial attacks. Furthermore, our results call for further studies on mitigation and defense techniques against adversarial manipulation. These studies are likely to find current risk-aware DRL approaches of limited utility in this domain, as such techniques are typically addressing accidental (i.e., non-adversarial) noises in the dynamics of the environment. Lastly, considering the significance of R&D efforts in developing and acquiring proprietary algorithmic trading policies, there remains a critical need to study the impact of policy imitation attacks highlighted by Behzadan and Hsu [2019] targeting algorithmic trading. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 Adversarial Attacks on Deep Algorithmic Trading Policies Piazza et. al References Yandex n v: Yndx - stock price: Live quote: Historical chart. URL https://tradingeconomics.com/yndx:rm. V. Behzadan. Security of deep reinforcement learning. PhD thesis, Kansas State University, 2019. V. Behzadan and W. Hsu. Adversarial exploitation of policy imitation. arXiv preprint arXiv:1906.01121, 2019. V. Behzadan and A. Munir. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recog- nition, pages 262–275. Springer, 2017a. V. Behzadan and A. Munir. Whatever does not kill deep reinforcement learning, makes it stronger. arXiv preprint arXiv:1712.09344, 2017b. V. Behzadan and A. Munir. The faults in our pi stars: Security issues and open challenges in deep reinforcement learning. arXiv preprint arXiv:1810.10369, 2018. V. Behzadan and A. Munir. Adversarial reinforcement learning framework for benchmarking col- lision avoidance mechanisms in autonomous vehicles. IEEE Intelligent Transportation Systems Magazine, 2019. N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017. Á. Cartea, S. Jaimungal, and J. Penalva. Algorithmic and high-frequency trading. Cambridge University Press, 2015. G. Clark, M. Doran, and W. Glisson. A malicious attack on the machine learning policy of a robotic system. In 2018 17th IEEE International Conference On Trust, Security And Pri- vacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), pages 516–521. IEEE, 2018. Z. Ding and H. Dong. Challenges of reinforcement learning. In Deep Reinforcement Learning, pages 249–272. Springer Singapore, 2020. doi: 10.1007/978-981-15-4095-07 . U RL Proceedings of the Conference on Applied Machine Learning for Information Security, 2021