<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>TAagregnett</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Adversarial Attacks on Deep Algorithmic Trading Policies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nancirose Piazza</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaser Faghan</string-name>
          <email>yaser.kord@yahoo.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vahid Behzadan</string-name>
          <email>vbehzadan@newhaven.edu</email>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali Fathi</string-name>
          <email>ali.fathi@rbc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Enterprise Model Risk Management Group, Royal Bank of Canada</institution>
          ,
          <addr-line>RBC</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Enterprise Model Risk Management Group, Royal Bank of Canada</institution>
          ,
          <addr-line>RBC</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto Superior de Economia e Gestão and CEMAPRE Universidade de Lisboa</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Instituto Superior de Economia e Gestão and CEMAPRE, Universidade de Lisboa</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Secure and Assured Intelligence Learning (SAIL) Lab, University of New Haven</institution>
          ,
          <addr-line>New Haven</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Secure and Assured Intelligent Learning (SAIL) Lab University of New Haven</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Secure and Assured Intelligent Learning (SAIL) Lab, University of New Haven</institution>
          ,
          <addr-line>New Haven</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>250</volume>
      <issue>300</issue>
      <abstract>
        <p>Deep Reinforcement Learning (DRL) has become an appealing solution to algorithmic trading such as high frequency trading of stocks and cyptocurrencies. However, DRL policies are shown to be susceptible to adversarial attacks. It follows that algorithmic trading DRL agents may also be compromised by such adversarial techniques, leading to policy manipulation. In this paper, we develop a threat model for deep trading policies, and propose two active attack techniques for manipulating the performance of such policies at test-time. Additionally, we explore the exploitation of a passive attack based on adversarial policy imitation. Furthermore, we demonstrate the efectiveness of the proposed attacks against benchmark and real-world DQN trading agents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons</p>
      <p>License Attribution 4.0 International (CC BY 4.0).</p>
      <p>Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
systems is yet to be explored. Recent developments in the domain of adversarial machine
learning have brought attention to the security challenges in regards to the vulnerability of
machine learning models to adversarial attacks, a paper by Papernot et al. [2018]. Instances of
such attacks include adversarial examples like Fast Gradient Sign Method by Goodfellow et al.
[2014], which are strategically induced perturbations in the input vectors that are not easily
detectable by human observers.</p>
      <p>
        Adversarial attacks can impact all deep learning and classical machine learning models,
including DRL agents, investigated by Behzadan and Munir [2018]. Recent work by Behzadan
Behzadan and Munir [2017a, 2018], Behzadan [2019] establish that DRL algorithms are
vulnerable to adversarial actions at both training and inference phases of their deployment. This
discovery is further verified in settings such as video games (Huang et al. [2017]), robotics
        <xref ref-type="bibr" rid="ref9">(Clark et al. [2018])</xref>
        , autonomous navigation
        <xref ref-type="bibr" rid="ref1 ref2 ref6">(Behzadan and Munir [2019])</xref>
        , and cybersecurity
(Han et al. [2018]). Yet, the extent, severity, and the dynamics of such vulnerabilities in DRL
trading agents are yet to be addressed.
      </p>
      <p>Adversarial perturbations of DRL trading policies are also significant form the financial
Model Risk Managment (MRM) point of view (Reserve [2011], of the Superintendent of
Financial Institutions [OSFI], Morini [2011]) since the existence of such vulnerabilities can be
traced back to the algorithmic underpinnings of these systems. However, principal diferences
between traditional financial models and algorithmic trading systems pose additional challenges
for quantifying the resulting model risk. For instance, the number of model components involved
in an algorithmic trading system can be large and hence, fusion of otherwise individually
negligible residual model risk may result in significant system errors. Furthermore, There exist the
adaptive nature of DRL based algorithms where the model components are re-calibrated (e.g.,
through retraining) based on a low latency schedule. It should also be noted that unlike other
areas of quantitative modelling in finance (such as asset pricing or credit risk), benchmarking of
model components in algorithmic systems is dificult due to competition considerations, as there
may be restrictions for conducting open box validation of proprietary models within a firm.</p>
      <p>In this paper, we investigate test-time adversarial attacks against DRL trading agents. The
main contributions are:
• We present a threat model for DRL trading policies, identifying susceptible attack surfaces
and practical attack vectors at test-time.
• We establish the vulnerability of current DRL trading policies to adversarial manipulation
for active test-time attacks.
• We explore Imitation Learning for adversarial purposes after acquisition of expert
demonstrations, both perfect and imperfect, from a passive test-time attack for policy imitation.
• We investigate the transferability of our perturbation attacks from the imitated agents to
the target agent.
• We demonstrate the eficacy of the proposed attack vectors in manipulating DRL trading
agents.</p>
      <p>The remainder of the paper is as followed: Section 2 presents an overview of reinforcement
learning and a review of the security issues in electronic trading platforms. Section 3 proposes
a DRL threat model for trading DRL agents, outlining various attack surfaces and vectors that
can be exploited by an adversary. Section 4 provides the details of our experimental setup for
investigating the proposed attack mechanisms, the results of which are presented in Section 5
and 6. The paper concludes in Section 7 with a summary of our findings, as well as discussions
on future directions of research on the security of deep trading policies.</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <sec id="sec-2-1">
        <title>Reinforcement Learning, Value Iteration &amp; Deep Q-Learning</title>
        <p>Reinforcement learning (RL) is concerned with agents that interact with an environment and
exploit their experiences to optimize a sequential decision-making policy. RL can be formally
modeled as learning to control a Markov Decision Process (MDP) M = (S, A, R, P ), where S
is the set of reachable states in the process, A is the set of available actions, R is the mapping
of transitions to the immediate reward, and P represents the transition probabilities (i.e., state
dynamics), which are initially unknown to RL agents. At any given time-step t, the agent is at
a state st ∈ S, chooses an at ∈ A, transitions from st to a state st+1 according to the transition
probability P (st+1|st, at) and receives a reward rt+1 = R(st, at, st+1). The solution to an MDP
problem is a policy π (s) that is a mapping from states to actions. The goal of RL is to learn
a policy that maximizes the expected discounted return E[Rt], where Rt = PtN=0 γ krt; with rt
denoting the instantaneous reward received at time t, and γ is a discount factor γ ∈ [0, 1]. The
value of a state st is defined as the expected discounted return from st following a policy π ,
that is, V π (st) = E[Rt|st, π ]. The state-action value (Q-value) Qπ (st, at) = E[Rt|st, at, π ] is the
value of state st after applying action at and following a policy π thereafter.</p>
        <p>The solution approaches to RL include value iteration algorithms that optimize a value
function (e.g., V (.) or Q(., .)) to extract the optimal policy from it. As an instance of value
iteration algorithms, Q-Learning aims to maximize for the action-value function Q through the
iterative formulation of Eq. (1):</p>
        <p>Q(s, a) = R(s, a) + γmax a′ (Q(s′, a′))
Where s′ is the state that emerges as a result of action a, and a′ is a possible action in state s′.
The optimal Q value given a policy π is defined as: Q∗ (s, a) = maxπ Qπ (s, a), and the optimal
policy is given by π ∗ (s) = arg maxa Q(s, a).</p>
        <p>The Q-learning method estimates the optimal action policies by using the Bellman
formulation to iteratively reduce the TD-Error given by Qi+1(s, a) − E[r + γ maxa Qi] for the iterative
update of a value iteration technique. Practical implementation of Q-learning is commonly
based on function approximation of the parametrized Q-function Q(s, a; θ ) ≈ Q∗ (s, a). A
common technique for approximating the parametrized non-linear Q-function is via neural network
models whose weights correspond to the parameter vector θ . Such neural networks, commonly
referred to as Q-networks, are trained such that at every iteration i, the following loss function
is minimized:</p>
        <p>Li(θ i) = Es,a∼ ρ (.)[(yi − Q(s, a, ; θ i))2]
where yi = E[r + γ maxa′ Q(s′, a′; θ i− 1)|s, a], and ρ (s, a) is a probability distribution over
states s and actions a. This optimization problem is typically solved using computationally
eficient techniques such as Stochastic Gradient Descent (SGD).</p>
        <p>A Deep Q-Network (DQN) by Mnih et al. [2015] is a training algorithm and implementation
of Q-value estimation by a neural network function approximator. Techniques such as experience
replay and a target network are used in an DQN to stabilize the training process and maintain
the i.i.d. (Independent and Identically Distributed) property of the data. Mnih et al. [2015]
demonstrate the application of this new Q-network technique to end-to-end learning of Q-values
in playing Atari games based on observations of pixel values in the game environment.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>State of Security in Algorithmic Trading</title>
        <p>In recent years, electronic trading platforms have made access to global capital markets easier
for the public, resulting in a lower barrier to entry and influx of trafic across these platforms.
(1)
(2)
The growing interest in such trading platforms and technologies is however accompanied by the
increasing risks of cyber attacks. While the literature on the cybersecurity issues of current
trading platforms is scarce, few industry-sponsored studies report concerning issues in deployed
trading platforms. One such study on the exposure of security flaws in trading technologies
by Hernandez [2018] evaluates various popular desktop, mobile and web trading service
platforms against a standard list of security checks, and reports that these trading technologies
are in general far more susceptible to cyber attacks than previously-reviewed personal banking
applications from 2013 and 2015. The security checks consisted of features such as 2-Factor
Authentication (2FA), encrypted communications, privacy mode, anti-reverse engineering, and
hard-coded secrets. This study reports that 64% of the reviewed trading applications rely on
unencrypted communication channels for authentication and trading data. Also, the author finds
that many trading applications utilize poor session management and SSL certificate validation,
thereby enabling Man-in-The-Middle (MITM) attacks. Furthermore, this report points out the
wide-scale susceptibility of such platforms to remote Denial of Service (DoS) attacks, which may
render the applications useless. Building on the findings of this study, our paper investigates
attacks that leverage the aforementioned vulnerabilities to manipulate deep algorithmic trading
policies.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Threat Model of DRL Trading Agents</title>
      <p>Adversarial attacks against DRL policies aim to compromise one or more aspects of the
Conifdentiality, Integrity, and Availability (CIA) triad in the targeted agents Behzadan and Munir
[2018]. More specifically, the Confidentiality of a DRL agent refers to the need for confidentiality
of an agent’s parameters, such as the policy or reward function. The Integrity of a DRL agent
relies on the policy behaving as intended by the user. Availability refers to the agent’s capability
to execute its task when needed. At a high-level, the threat landscape of DRL agents can be
captured in terms of the Attack Surface and Attack Model of the agent by Behzadan [2019], as
outlined below.
3.1</p>
      <sec id="sec-3-1">
        <title>Attack Surface and Vectors</title>
        <p>Adversarial attacks may target all components of a DRL agent, including the environment,
agent’s observation channel, reward channel, actuators, and training components (e.g.,
experience storage and selection), as identified by Behzadan Behzadan [2019].</p>
        <p>Figure 1 illustrates the components of a DRL trading agent at test-time. In the context of
algorithmic trading, the observation of the environment is gathered from various sources such
as market indicators, social media indicators, and exchanges– we refer to these sources as input
channels. This data is prepossessed and feature engineered to create the agent’s observation
of the state. These states are part of the observation returned by the environment to the
agent along with the reward. Through the observation channel, an adversary may intercept the
observation and exchange it for a perturbed observation, otherwise called a Man-In-The-Middle
(MITM) attack. An adversary may also impose a delay the observation channel through a
Denial of Service (DoS) attack. It has been shown that slight perturbations of the observation
state impact DRL agent performance by Ding and Dong [2020]. The reward channel is often
tied to internal securities such as bank accounts or portfolios, and hence are less susceptible to
external adversarial manipulation. However, any external component reachable by the agent
can be compromised implicitly.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Attack Model</title>
        <p>The capabilities of an adversary are defined by two factors: actions available to the adversary
and information available about the target. This section presents a classification of attacks
Input Channels</p>
        <p>Exchange</p>
        <p>Data
Market
Indicators
Social Media
Indicators</p>
        <p>State</p>
        <p>Delay through
Denial of Service
(DoS)</p>
        <p>Man-In-The-Middle</p>
        <p>(MITM)
Perturbations</p>
        <p>Reinforcement</p>
        <p>Learning</p>
        <p>Agent
External
Securities
(E.g. Bank)
Reward Channel</p>
        <p>Actuator Channel
More Difficult to Access
and adversaries at the inference phase based on the aforementioned factors. According to the
available information, attacks are classified as whitebox or blackbox. Whitebox refers to when
the adversary has suficient knowledge of the target’s parameter to directly craft an efective
perturbation, and blackbox refers to the vice versa scenario.</p>
        <p>Perturbations in observation afect both test-time and train-time. While this paper focuses on
test-time attacks, it is noteworthy that during training, additional error is bootstrapped,
potentially impacting learned policies. Work by Behzadan and Munir [2017b] show that training-time
attacks under certain conditions with suficiently high perturbation rates resulted in the agent’s
inability to recover performance upon test-time evaluation under non-adversarial conditions.
3.2.1</p>
        <sec id="sec-3-2-1">
          <title>Test-Time Attacks</title>
          <p>
            Test-time or inference-time attacks may be active or passive. Active attacks require adversarial
intervention to manipulate the DRL policy. Instances of such attacks include adversarial
examples
            <xref ref-type="bibr" rid="ref2 ref3 ref4 ref6 ref7">(Goodfellow et al. [2014],Carlini and Wagner [2017],Su et al. [2019])</xref>
            and delay induction
in observations. Passive attacks gather information about the target agent by observing the
target’s behavior in various states. With suficient observations of state-action pairs, the
adversary can reconstruct the targeted policy and compromise the Confidentiality of the targeted,
proprietary agents Behzadan and Hsu [2019].
          </p>
          <p>Active attacks can be classified under targeted and non-targeted attacks. Successful
nontargeted aim to have the policy select any action other than the one prescribed by the policy
modifying (i.e., perturbing) the true observation with a perturbed observation. Targeted attacks
craft perturbations such that the target selects a particular sub-optimal action a′.</p>
          <p>In the category of passive attacks, Imitation Learning and Inverse Reinforcement Learning
are avenues an adversary may exploit to either attack their target agent or steal components of
the agent such as its policy. As demonstrated in work by Behzadan and Hsu [2019], adversaries
can gather additional information through policy imitation, thereby enabling whitebox attacks
against blackbox targets.
3.2.2</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Training-Time Attacks</title>
          <p>Training-time attacks are also referred as data poisoning attacks by impairing an agent’s
capability to learn optimally. In such attacks, the adversary manipulates the training data via injecting
false samples, mislabeled samples, or overrepresented samples to manipulate the distribution of
the training data according to Goldblum et al. [2021]. Though typically studied in supervised
and unsupervised learning tasks, data poisoning can also apply to DRL as demonstrated by
Behzadan and Munir [2017b].
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <p>We demonstrate the proposed attacks on two trading agents based on DQN policies with varying
complexity, one we refer to as basic DQN which uses a simple OpenAI Gym1 environment to
emulate trading, and the other is based on an open-source framework called TensorTrade2 which
leverages a more realistic OpenAI Gym environment mimicking real-world trading settings. Our
basic DQN represents less complex agents while TensorTrade’s DQN will demonstrate the
realworld impact of such attacks that have external components tied to the agent like a portfolio.
In fact, TensorTrade is currently used and deployed for actual DRL-based trading in online
cryptocurrency and stock exchanges.</p>
      <p>There are general choices for the components of MDP M . The state space may contain a
subset of four common prices such as open, high, low, and close. Technical Indicators refer to
other measurements traders use to assess a stock are used in the state space. The duration of a
timestep can be any interval, eg. milliseconds, minutes, hours and each interval is called a bar.
The action space may include buy/sell/hold quantities, which can be continuous or discrete.
Environments will implement a commission fee upon changing position (buy/sell). The reward
function can be profit/loss or a more detailed metric such as the Sharpe value. Training is
usually on historical data.
4.1</p>
      <sec id="sec-4-1">
        <title>Basic Trading Environment</title>
        <p>In the basic trading environment, the historical data is sourced from Yandex N.V. (YNDX)(yan)
between the period of 2015-2016. The dataset is comprised of samples representing a one-minute
temporal resolution, and the dynamic of the price during that minute is captured by four values:
open price, high price, low price, and close price. Our agent can only hold, sell or buy a single
stock. Table 1 details the specifications of the Basic Stock Environment. Table 2 contains
hyperparameters of the DQN agent trained in this environment.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>TensorTrade Environment</title>
        <p>The TensorTrade environment (TT) can implement a portfolio that holds wallets of various coins
or currencies. The data used for this setup is included with TT as a demonstration of training.
This dataset is dated from the start of 2020, and contains the open, high, low, close and volume
prices at hourly intervals. It also includes technical indicators such as the Relative Strength
Indicator (RSI) and Moving Average Convergence Divergence (MACD) and log(Ct) − log(Ct− 1)
where Ct is the closing price at timestep t as the dataset features. Our portfolio starts with
10,000 USD and 10 BTC. We use the risk-adjusted reward scheme and manage-risk action scheme
provided by TT. The risk-adjusted reward scheme uses the Sharpe Ratio which is defined by the
equation below:</p>
        <p>Sa =</p>
        <p>E[Ra − Rb]</p>
        <p>σ a
1OpenAI Gym, (2016), GitHub repository, https://github.com/openai/gym
2TensorTrade, (2019), GitHub repository, https://github.com/tensortrade-org/tensortrade
where Ra is the asset return, Rb is the risk-free return, and σ a is the standard deviation of
the asset excess return. The manage-risk action scheme scales the action space depending on
provided arguments such as trade size, stop and take. The default trade size is 10 which implies
there will be a list of 10 trade sizes that are uniformly spaced. For instance, trade size of 3
implies 33.3%, 66.6%, and 99.9% of the balance can be traded. Take is a list of possible take
profit percentages from an order, and stop is a list of possible stop loss percentages from an
order. The action space is the resulting product of take, stop, trade size, and action type which
is buy or sell. There is one additional action: wait/hold. In our case, we have an action space
size of 181. This information as well as training hyperparameters are summarized in Table 1
and Table 2, respectively. There are other simpler reward (e.g., SimpleProfit) and action (e.g.,
Buy Sell Hold BSH) schemes available with TT.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Active Test-Time Attacks</title>
      <p>In this section, we investigate the impact of adversarial attacks on deep trading agents at
testtime. To preserve the realism of our study, we limit the scope of our investigation to attacks
that satisfy the following constraints: (1) Attacks are limited to manipulating the observation
channel of the target. (2) Attacks are limited to perturbations that are not immediately detected
by common human or automated anomaly detection mechanisms.</p>
      <p>We implement 2 diferent types of attack namely untargeted delay attacks, and
untargeted/targeted adversarial perturbation attacks. This study considers whitebox attacks only.
However, as demonstrated in Behzadan and Hsu [2019], it is also feasible to reverse-engineer
blackbox policies via imitation learning, thereby converting blackbox attacks to whitebox.
5.1</p>
      <sec id="sec-5-1">
        <title>Non-Targeted Delay Attacks</title>
        <p>We evaluate through non-targeted attacks on the observation channel through a single, most
recent window history tuple of their features. The observation delay is of 1 timestep where a
tuple of values seen at timestep t − 1 will be received at timestep t. This is both practical
and representative of minimal interference. Because there is no adversarial preference of when
to implement the delay, this is non-targeted. Likewise, a targeted delay attack implements an
intended timing; however, we did not pursue this. Results are presented in Figure 2. This type
of non-targeted attack should be of concern to traders because of lack of computational expense,
and adversarial predisposition because anomalies are masked by time-series locality.
80
60
To investigate the efectiveness of adversarial example attacks on DRL policies, we implemented
Fast Gradient Sign (FGSM) by Goodfellow et al. [2014] and Carlini and Wagner ( C&amp;W) Carlini
and Wagner [2017] adversarial sample attacks using L2 loss for both DQNs.</p>
        <p>In Table 4, there are failure counts and other notable counts for the basic DQN and TT’s
DQN. In this experiment, we perturb a single, most recent tuple of values in the observation space
for all FGSM and C&amp;W attacks. Our definition of a failure for non-target attacks is the failure
to change the action or action type. Post-constraints are applied where adversarial samples must
fall within a realistic distribution of the true data. There were some modifications to the C&amp;W
implementation for TT due to non-normalized data. Representative samples of a perturbed
tuple of values from successful attacks are presented in Table 3. See Table 3 for the Basic DQN
return performance comparison when under attack vs. not under attack. Additionally, we have
provided the total reward diference and net-worth diference between the TT target agent and
TT target agent under attack in Figure 5 and Figure 7, respectively. Through these results,
we establish that the test-time performance of the target policy in regards to its total reward
is negatively impacted by our attacks. We have also shown that the agent’s net-worth is also
impacted, but not necessarily reflected by total reward.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Targeted Perturbation Attacks</title>
        <p>Targeted attacks aim to manipulate a policy into taking an adversarial action a′t instead of
action at at a timestep t. We have evaluated against targeted FGSM and targeted C&amp;W attacks
using L2 loss for both DQNs with minimal Q-value actions as our selected adversarial actions.
However, it is noteworthy to remember that the function approximator can regress values which
may not be the best for the adversary.</p>
        <p>Again we leave the exact implementation details to the full paper. See Table 4 for failure
counts, attempts, and partial successes (PS). We define partial success if the attack results in an
action am where the action type of am is the adversarial action type for action a′t. Our
adversarial tuples are simple but should emphasize that adversarial attacks crafted under expensive
parameters like low learning rate, high number of iterations, and high confidence can produce
more human convincing adversarial samples. Performance under attack for Basic DQN can be
found in Figure 4, TT’s total reward diference can be found in Figure 6, and TT’s net-worth
difference in Figure 8. We thus establish the impact of targeted attacks on TT’s DQN on test-time
performance as well as its significant impact on the agent’s net-worth.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Passive Test-Time Attacks and Policy Imitation</title>
      <p>An adversary performs a passive test-time attack when they observe the target policy rollout
trajectories through some active interception like a MiTM. An adversary may use other learning
methods such like Imitation Learning (IL) to leverage these demonstrations training given an
appropriate environment. We will make two strong assumptions to investigate an adversarial
approach to IL which will require perfect adversarial information: (1) Access to an identical
MDP that produced the target policy. (2) Ability to observe complete trajectories.</p>
      <p>We use Deep Q-Learning from Demonstration (DQfD) as our IL method. There are two
adversarial objectives for imitated agents: the first is policy imitation and the second is
profitability relative to training cost. We define policy imitation for this paper as the training of a
policy π ′ where its objective is to mimic a target policy’s observable behavior. Policy imitation
can result in additional adversarial knowledge, providing an adversary a way to perform more
whitebox attacks based on attack transferability. Policy imitation can possibly lead to similar
performances of the target agent. Depending on the adversary’s objective, policy imitation can
be feasible. The second objective refers to having an imitated policy converge sooner to an
optimal policy which may imply a smaller adversary’s budget than the target agent’s training
budget.
6.1</p>
      <sec id="sec-6-1">
        <title>Imitation Learning &amp; Deep Q-Learning from</title>
      </sec>
      <sec id="sec-6-2">
        <title>Demonstration</title>
        <p>IL is a learning framework for imitating an expert policy through demonstrations. There are two
objectives to IL: to imitate behavior exhibited in the demonstrations or to learn an underlying
task from demonstrations. Agents that follow the first are often modeled as naive supervised
learners known as behavioral clones (BC). The second objective is also referred to as
Apprenticeship Learning. Learning frameworks like Inverse RL and RL are often used for this objective,
but there are other methods that use supervised learning architectures.</p>
        <p>DQfD Hester et al. [2017] has a few components: the pretraining phase, cost function,
and prioritized demonstration sampling. We use the default parameters set by the authors. The
pretraining phase is training prior to interaction with the environment. There is the prioritization
of sampling expert demonstrations. The cost function is the sum of four loss functions: a
Temporal Diference (TD) with 1 step double DQN loss JDQ(Q), n-step double DQN loss Jn(Q),
Trade’s DQN Total Reward
0
50
200
250
0
50
200
250
(a) Non-Targeted FGSM Attack</p>
        <p>Reward Diference
DQN Total Reward
0
50
(a) Net-Worth Diference Targeted</p>
        <p>(b) Net-Worth Diference Targeted
FGSM Attack</p>
        <p>C&amp;W Attack</p>
        <p>Proceedings of the Conference on Applied Machine Learning for Information Security, 2021
Figure 8: Net-Worth Diference between Control Total Reward and Targeted Attacks on TensorTrade’s
DQN Net-Worth</p>
        <p>NT C&amp;W - Target TRD P=1.0
NT C&amp;W - Target TRD P=0.5
a large margin supervised loss JE (Q), and L2 regularization JL2(Q). The proper equation is as
followed where λ 1, λ 2, λ 3 are scalars Hester et al. [2017]:</p>
        <p>J (Q) = JDQ(Q) + λ 1Jn(Q) + λ 2JE (Q) + λ 3JL2(Q)
6.2</p>
      </sec>
      <sec id="sec-6-3">
        <title>Perfect Demonstrations &amp; Imperfect Demonstrations</title>
        <p>Perfect demonstrations refers to observing complete trajectories that inititates from a start state
s0 and terminates at a state sT . DQfD’s pretraining phase uses perfect demonstrations but we
may also be interested in imperfect demonstrations.</p>
        <p>Once again we leave exact implementation details to full paper however we have tested
various amounts of demonstrations among various quantities of timesteps. We included a
behaviorial clone as a form of alternative method to DQfD. We fixed the number of iterations in
the pretraining phase and quantity of episodes based on a parameter search. For policy behavior
evaluation during both the randomized evaluation and training evaluation, we include Table 5.
As expected for the supervised learner (S.L.), it is capable of imitating the training trajectory,
but not necessarily the testing trajectories. Additionally, the diference in tasks such as Q-value
regression vs. maximum likelihood contribute to the expectation of policy imitation. We provide
Figure 9(b) as net-worth comparison among agents.
We present Table 6 which contains the count of successful transferred non-targeted FGSM attacks
from our imitated agents to the target agent. We considered an attack successful at a timestep t
if an imitated agent changes to any action other than optimal action a and a successful transfer if
observation is shown to target agent also results in change of action. We note that all pretraining
phase policies were more susceptible to our FGSM attacks and had higher successful transfer
to the target agent. For the intentions of attacking the target agent, it is plausible that the
pretraining phase agents and B.C. agent are suficient.</p>
        <p>15000
ian 10000
G
trh 5000
o
-teNW 0
e
g−5000
a
r
e
vA−10000
−15000</p>
        <p>TAagrgeentt 50% 10% Supervis1e0d0%Learning
Agents trained on Percentage of Demonstration</p>
        <p>Agents trained on Timesteps
(a) Average Net-Worth Gain over 10</p>
        <p>Randomized Starts
(b) Average Net-Worth Gain over 10</p>
        <p>Randomized Starts</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We investigated the vulnerability of DRL trading agents to adversarial attacks at inference
time. We identified the attack surface and vectors of algorithmic trading policies in a novel threat
model, and proposed 2 active test-time attack techniques namely: non-targeted DoS-based delay
induction and targeted/non-targeted MITM-based adversarial perturbation. We demonstrate
the susceptibility of a benchmark DRL trading agent and an agent based on a popular
opensource framework for algorithmic trading called TensorTrade. Through perturbation of a single
tuple from the history window, we show adversarial intervention can easily result in sub-optimal
performance upon test-time. The results demonstrate that our target agents are sensitive to
even weak attacks such as FGSM, as well as and more powerful attacks like C&amp;W which provide
human-fooling adversarial samples. Furthermore, portfolios tied to the agent may be impacted
in ways that is not directly reflected in the performance metric at test-time, namely total reward.
With TensorTrade’s DQN, our attacks were shown to adversely afect the agent’s net-worth. This
ifnding may have significant repercussions on risk mitigation, as test-time performance through
total reward may not alert human traders of the severity of impact upon external securities tied
to the agent.</p>
      <p>We looked to Imitation Learning methods for adversary usage through a passive, test-time
attack. We have considered objectives for an adversary such as: policy imitation, self-gain or
knowledge gain for whitebox attacks. We have shown that imitated agents can possibly perform
competitively or better for equivalent or less computational expense than its target. There
are limitations when using supervised learners but may be useful enough to an adversary for
whitebox optimization attacking. With the use of our imitated agents, we were able to show the
transfer-ability of an adversarial attack to our target agent.</p>
      <p>The reported findings establish the need for further research on various aspects of security in
DRL trading agents such as a need for metrics and measurement techniques for benchmarking
the resilience and robustness of trading policies to adversarial attacks. Furthermore, our results
call for further studies on mitigation and defense techniques against adversarial manipulation.
These studies are likely to find current risk-aware DRL approaches of limited utility in this
domain, as such techniques are typically addressing accidental (i.e., non-adversarial) noises in the
dynamics of the environment. Lastly, considering the significance of R&amp;D eforts in developing
and acquiring proprietary algorithmic trading policies, there remains a critical need to study the
impact of policy imitation attacks highlighted by Behzadan and Hsu [2019] targeting algorithmic
trading.</p>
      <p>Live quote:</p>
      <p>Historical chart.</p>
      <p>Adversarial exploitation of policy imitation. arXiv preprint</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          .
          <article-title>Security of deep reinforcement learning</article-title>
          .
          <source>PhD thesis</source>
          , Kansas State University,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Hsu</surname>
          </string-name>
          . arXiv:
          <year>1906</year>
          .01121,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Munir</surname>
          </string-name>
          .
          <article-title>Vulnerability of deep reinforcement learning to policy induction attacks</article-title>
          .
          <source>In International Conference on Machine Learning and Data Mining in Pattern Recognition</source>
          , pages
          <fpage>262</fpage>
          -
          <lpage>275</lpage>
          . Springer,
          <year>2017a</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Munir</surname>
          </string-name>
          .
          <article-title>Whatever does not kill deep reinforcement learning, makes it stronger</article-title>
          .
          <source>arXiv preprint arXiv:1712.09344</source>
          ,
          <year>2017b</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Munir</surname>
          </string-name>
          .
          <article-title>The faults in our pi stars: Security issues and open challenges in deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1810.10369</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Munir</surname>
          </string-name>
          .
          <article-title>Adversarial reinforcement learning framework for benchmarking collision avoidance mechanisms in autonomous vehicles</article-title>
          .
          <source>IEEE Intelligent Transportation Systems Magazine</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wagner</surname>
          </string-name>
          .
          <article-title>Towards evaluating the robustness of neural networks</article-title>
          .
          <source>In 2017 ieee symposium on security and privacy (sp)</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>57</lpage>
          . IEEE,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Á. Cartea</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jaimungal</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Penalva</surname>
          </string-name>
          .
          <article-title>Algorithmic and high-frequency trading</article-title>
          . Cambridge University Press,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Doran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Glisson</surname>
          </string-name>
          .
          <article-title>A malicious attack on the machine learning policy of a robotic system</article-title>
          .
          <source>In 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science</source>
          And Engineering (TrustCom/BigDataSE), pages
          <fpage>516</fpage>
          -
          <lpage>521</lpage>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          .
          <article-title>Challenges of reinforcement learning</article-title>
          .
          <source>In Deep Reinforcement Learning</source>
          , pages
          <fpage>249</fpage>
          -
          <lpage>272</lpage>
          . Springer Singapore,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-15-4095-07. U RL
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>