<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Exploring Agent Behaviors in Network Security through Trajectory Clustering</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ondrej</forename><surname>Lukas</surname></persName>
							<email>ondrej.lukas@aic.fel.cvut.czc</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Electrical Engineering</orgName>
								<orgName type="institution">Czech Technical University in</orgName>
								<address>
									<settlement>Prague</settlement>
									<country>Czechia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sebastian</forename><surname>Garcia</surname></persName>
							<email>sebastian.garcia@agents.fel.cvut.cz</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Electrical Engineering</orgName>
								<orgName type="institution">Czech Technical University in</orgName>
								<address>
									<settlement>Prague</settlement>
									<country>Czechia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Exploring Agent Behaviors in Network Security through Trajectory Clustering</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">8F33827B4E03EC51E533DCBF24D5BB47</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Explainable RL</term>
					<term>Trajectory Analysis</term>
					<term>Policy Evaluation</term>
					<term>Network Security</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Reinforcement learning has been successfully used for training security agents, but there have not been explanations for the behavior of their policies. In this work, we study the behavior of reinforcement learning-based attacking agents in network security environments to understand how to improve them. The sequences of steps (trajectories) generated by the agent-environment interactions are used for (i) analyzing the change in behavior during the training process and (ii) analyzing the performance of the policy of the trained agent. Our proposed method uses a vector representation of the trajectory steps, which are clustered to find similarities in the trajectories based on actions taken, their effects on the state of the environment, and the rewards obtained by the agents. The trajectory cluster analysis is paired with additional visualizations to provide a better and deeper understanding of the policies. Preliminary results show that the proposed method can identify behavioral patterns in the agents' policies and subsequently help guide the agent's learning process.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Reinforcement Learning (RL) has been successfully used in various complex problems, from theoretical games to robotics. Its application to the security domain has already been adopted by research in simulated security environments <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. Various model architectures were proposed for the RL-based agent playing the role of the attacker or penetration tester for traditional and Deep RL methods <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>.</p><p>The evaluation of an agent and its learning progress often relies only on numerical metrics such as win rate or mean return. While informative, especially during the early stages of the agent training, such evaluation does not provide sufficient insight into the trained agent's behavior, development throughout the training process, and ability to generalize <ref type="bibr" target="#b5">[6]</ref>. Therefore, a deeper insight into the trajectories generated by the agent policy plays an important role in the hyperparameter selection, training setup, and agent verification <ref type="bibr" target="#b6">[7]</ref>. In this ongoing work, we are focusing on exploring two main research questions:</p><p>1. RQ1: How can a trajectory analysis provide insights for model-agnostic behavior evaluation and understanding? 2. RQ2: How do the trajectories exhibit changes during the training process? Furthermore, we aim to explore the suitability of the trajectory analysis for finding similarities in the behavior of different model architectures. These can help identify the necessary steps in task solving and can be further used to validate the agents' behavior and explain it to humans.</p><p>The main contribution of this work is an evaluation of a variety of RL agents playing in the NetSecGame environment and comparing their behaviors. The comparison and evaluation are done following a method for processing the game-play trajectories of agent's policies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In recent years, there have been notable advancements in the RL for security both in the agents <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref> and the environments <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b7">8]</ref>. Still, there is a lack of explainable methods that would allow verification and easier human-computer cooperation in the security domain.</p><p>The increased focus on model interpretability can also be seen in the reinforcement learning domain. There are three main approaches to explainable RL: Model explanation only focuses on the underlying model, Policy Explanation explains the behavior of the agent, and Outcome explanation focuses on the local explanation of a (sub)trajectory <ref type="bibr" target="#b9">[10]</ref>.</p><p>In the latter two, the trajectories are commonly used for decision attribution <ref type="bibr" target="#b10">[11]</ref>, visual explanations <ref type="bibr" target="#b11">[12]</ref>, directly for model improvement <ref type="bibr" target="#b12">[13]</ref> or summarization of agent's behavior <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref> . However, none of these methods was evaluated in a security scenario.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>The NetSecGame<ref type="foot" target="#foot_0">1</ref> is an environment simulation of high-level network security tasks. The agent, playing the role of the attacker, interacts with the environment in an episodic setup, learning the dynamics of the environment in the process. The scenario used in this work simulates the Sensitive data exfiltration attack, in which the task of the attacker is to (i) understand the topology of the local networks, (ii) locate the sensitive data, (iii) take certain measures to access the data, and finally (iv) exfiltrate the data to an external location outside the local network.</p><p>In the NetSecGame, state representation and actions do not have a fixed size, which is different from most gym-like environments. States are represented by a collection of assets available to the agent consisting of a set of known networks, a set of known hosts, a set of controlled hosts, a set of known services, and a set of known data.</p><p>The actions in this environment consist of five action types, each of them having a different set of parameters that are selected from the state representation (e.g., IP addresses and services). All actions have a parameter that identifies from which host (position in the environment) it is executed. For example, given a state 𝑠 in which the agent controls a single host A in a network N, the action of ScanNetwork can be played with parameters source host=A, target network=N.</p><p>Such parametrization of actions allows for a modular and flexible environment that can model various scenarios and situations. Also, it makes the trained policies difficult to visualize, analyze, and evaluate. The environment changes after every agent's move, and a new state and an immediate reward is given to the agent. Each of these steps is represented by a tuple (𝑠 𝑡 , 𝑎 𝑡 , 𝑟 𝑡+1 , 𝑠 𝑡+1 ), where 𝑠 𝑡 is the current state of the game, 𝑎 𝑡 is the action performed in the state 𝑠 𝑡 , 𝑟 𝑡+1 is the immediate reward for playing action 𝑎 𝑡 , and 𝑠 𝑡+1 is the following state as the result of action 𝑎 𝑡 in state 𝑠 𝑡 .</p><p>A trajectory 𝑡 is a sequence of steps starting from the initial state 𝑠 0 until the terminal state of the episode, which ends when the goal is reached, agent detected, or by timeout (reaching the maximum allowed episode length). To analyze the agent's behavior during and after the training, we capture the trajectories generated by each agent.</p><p>We evaluate three types of agents (models) in the current stage of this work: Two variants of Q-learning and an LLM-based agent. The first model is a vanilla Q-learning algorithm with decaying 𝜖 exploration. In contrast, the second model still uses Q-learning but is extended with concepts to generalize to networks without knowing details such as the IP addresses, helping merge some of the state-action pairs. This results in better generalization to unknown networks and less overfitting to the topology of the network used in the trained task. The last model evaluated is based on the OpenAI LLM GPT-3.5-turbo. The LLM-agent <ref type="bibr" target="#b7">[8]</ref> uses the textual representation of the state, description of the goal, and the environment to select an action to be played in the state. The LLM model is not fine-tuned for playing the role of an attacker, apart from the prompt composition.</p><p>Trajectories from 500 evaluation episodes were collected for each model at multiple training checkpoints for comparison and analysis. In the case of the LLM agent, there was no training period; thus, only the evaluation trajectories were used. The first part of the policy evaluation focuses only on the sequence of actions. We study the action type distribution per step based on all the trajectories gathered for a policy. The distribution of the action types shows the agent's primary goal in each step of the interaction.</p><p>The optimal trajectory in the data exfiltration scenario consists of 5 steps, which allows computing the mean action type efficiency given the set of trajectories 𝑇 as follows:</p><p>Let 𝑇 𝑤𝑖𝑛𝑠 = {𝑡 ∈ 𝑇 | 𝑡.end = win} be a subset of trajectories in which the agent wins. Then, we compute the efficiency of the action type 𝑎 𝑡 as</p><formula xml:id="formula_0">𝑒𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑐𝑦 𝑎 𝑡 (𝑇 𝑤𝑖𝑛𝑠 ) = |𝑇 𝑤𝑖𝑛𝑠 | |{𝑠 ∈ 𝑇 𝑤𝑖𝑛𝑠 | 𝑠.action = 𝑎 𝑡 }|</formula><p>This metric equals 1 for each action type for an optimal trajectory, as each should be played only once. Values less than 1 mean that the action of type 𝑡 is repeatedly played -most likely with incorrect parameters.</p><p>While analyzing the action sequence brings insights into the agent's behavior, it does not fully use the information the trajectories provide. Therefore, we propose encoding each step 𝑠 of a trajectory 𝑡 using the following vector representation for further processing and analysis:</p><p>1. Size of each component of 𝑠. 4. Reward 𝑟. 5. Return when starting from the step 𝑠. (Sum of all rewards that the agent expects to receive from state 𝑠 until the end of the episode) 6. One-hot encoded action 𝑎 used in step 𝑠.</p><p>After the encoding, the trajectory steps are processed by UMAP <ref type="bibr" target="#b15">[16]</ref> (Uniform Manifold Approximation and Projection). UMAP is a dimensionality reduction technique that efficiently maps high-dimensional data into a lower-dimensional space. It uses manifold learning techniques to model the underlying structure of the data, preserving both local and global structures. We propose using the projection to find similarities among the trajectory steps of different models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>The results of comparing action types can be seen in Figure <ref type="figure" target="#fig_2">1</ref>. It shows the distribution of actions in each step of the trajectory for Q-learning (Figure <ref type="figure" target="#fig_2">1a</ref>), Q-Learning with general concepts(Figure <ref type="figure" target="#fig_2">1b</ref>) and GPT-3.5 agent(Figure <ref type="figure" target="#fig_2">1c</ref>). The bar plot in Figure <ref type="figure" target="#fig_2">1d</ref> compares the Action efficiency of each model.</p><p>UMAP projection of the trajectory steps is shown in Figures <ref type="figure" target="#fig_4">2 and 3</ref>. The step number, underlying model, action type, and the outcome of the trajectories are highlighted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>The comparison in Figure <ref type="figure" target="#fig_2">1</ref> shows several differences in the policies, most notably in the lengths of trajectories and the action composition.</p><p>All three models show in the first steps an initial phase of exploration (mainly composed of Scan Network action and Scan Services action). Still, in the case of the Conceptual Q-learning, it is heavily focused on the FindData action. Since searching the data in a host requires control of the host, taking this action in the first step of the game is impractical as confirmed by the action efficiency of 15% shown in Figure <ref type="figure" target="#fig_2">1d</ref>. The second major difference in the behavior of the Conceptual agent is the significant use of Data Exfiltration action in the later stages of the interaction. In comparison, the LLM and Q-learning agents are exfiltrating the data in very few cases, suggesting that it only happens for the correct data point required to win the game.</p><p>The high action efficiency of the Q-learning agent may indicate that the model could be overfitted to the particular task and network topology. This hypothesis is further supported by a low amount of Find Data actions, likely caused by a lack of exploration. In contrast, the LLM agent (which has no additional training for this particular task) shows more exploration (usage of Scan Network, Find Services, and Find Data). Such behavior, which shows less efficiency in this particular task and topology, can lead to better generalization capabilities of the policy.</p><p>The UMAP projection in Figure <ref type="figure" target="#fig_0">2</ref> supports the hypothesis of unnecessary use of Exfiltrate Data action of the conceptual agent as those steps should be taken later in the interaction and lead to either a timeout ending or a detection ending, which is visible in the largest central cluster.</p><p>Model attribution in the second subplot of Figure <ref type="figure" target="#fig_0">2</ref> indicates higher similarity in the Qlearning and LLM trajectory steps despite significant model differences.  Figure <ref type="figure" target="#fig_4">3</ref> shows the comparison of the trajectories of the Q-learning model at five distinct points of training. Since the figure depicts only one model type, it shows a lower separation of the clusters. However, the Action type subplot shows smaller clusters around them, which show higher purity and correspond to the winning trajectories. These clusters are attributed to the policies in the later stage of the training, having steps that occurred in the first twenty steps of the trajectories. A possible explanation is that as the model adapts to the environment, the produced trajectories have less exploration and higher similarity. In the projection, this results in the smaller peripheral clusters.</p><p>A notable exception are the two clusters of steps with action ScanNetwork and FindServices in the lower left part of the plot. The end reason subplot shows that they consist of both winning and losing trajectory steps. We can see that those steps occur at the beginning of the trajectories and for most of the models. The most likely reason is that these clusters consist of the agents' initial recon steps. Since the starting state, while being randomized, is very similar in each of the trajectories, this part of the Q-table is learned very early in the training process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Future Work</head><p>In this early-stage work, we introduce the policy evaluation for a security scenario using the trajectory step analysis. We propose the vector representation of the trajectories generated by the RL agent and demonstrate its application in visual explanations of trained policies. We show that the trajectory steps and proposed vector representation can be used to find similarities in the policies of different model types. We evaluate the method to explain the policies during the training process.</p><p>Currently, no DRL models are included in the evaluation. Additionally, comparison with other security environment is needed.</p><p>In the project's current phase, the analysis focuses only on the steps of the trajectories, but such an approach might not capture all the complexities of the policy. Future steps should focus on extending this work to the sequence of steps and potentially whole trajectories. The primary motivation for such an extension is to better understand and interpret the changes in the agent's behavior during training. Secondly, the better clustering of trajectory steps can allow the detection of agents' intrinsic sub-goals in the trajectories, their comparison across the model types, and their mapping to the existing knowledge base of attacking techniques.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>2 .</head><label>2</label><figDesc>Size of each component of s 𝑛𝑒𝑥𝑡 . 3. Amount of change caused by 𝑎 (|𝑠 𝑛𝑒𝑥𝑡 − 𝑠|).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Figures (a), (b), and (c) show action type distribution per step for Q-learning (a), Q-learning with concepts (b), and LLM-based model (c) in all evaluation episodes. The height of the bar represents the number of evaluation episodes in which the corresponding step was reached. The decreasing height of the bars shows a lower occurrence of long episodes. Figure (d) plot shows action efficiency per model for winning episodes only (higher is better).</figDesc><graphic coords="5,89.29,265.07,214.20,164.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: UMAP projection of the vector representation of trajectory steps.Step number is the sequence number of the step in each trajectory. The knowledge of the End reason of the trajectory is assigned to all the steps in that trajectory.</figDesc><graphic coords="6,89.29,84.19,416.69,288.17" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: UMAP projection of trajectories obtained from the same Q-learning agent after 5000, 10 000, 15 000, 20 000, and 25 000 training episodes.</figDesc><graphic coords="7,89.29,84.19,416.69,288.17" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/stratosphereips/NetSecGame</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D R</forename><surname>Team</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Seifert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">William</forename><surname>Betser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Blum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kate</forename><surname>Bono</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emily</forename><surname>Farris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Justin</forename><surname>Goren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kristian</forename><surname>Grana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Brandon</forename><surname>Holsheimer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joshua</forename><surname>Marken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nicole</forename><surname>Neil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jugal</forename><surname>Nichols</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Haoran</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><surname>Wei</surname></persName>
		</author>
		<ptr target="https://github.com/microsoft/cyberbattlesim" />
		<title level="m">Cyberbattlesim</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Nasimemu: Network attack simulator &amp; emulator for training agents generalizing to novel scenarios</title>
		<author>
			<persName><forename type="first">J</forename><surname>Janisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Pevnỳ</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Lisỳ</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Symposium on Research in Computer Security</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="589" to="608" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Automated post-breach penetration testing through reinforcement learning</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>O'brien</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xu</surname></persName>
		</author>
		<idno type="DOI">10.1109/CNS48642.2020.9162301</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Communications and Network Security (CNS)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="1" to="2" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Akella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Standen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bowman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Richer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-T</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2109.06449</idno>
		<title level="m">Deep hierarchical reinforcement agents for automated penetration testing</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automated penetration testing using deep reinforcement learning</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Beuran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tan</surname></persName>
		</author>
		<idno type="DOI">10.1109/EuroSPW51379.2020.00010</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE European Symposium on Security and Privacy Workshops (EuroS&amp;PW)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="2" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Quantifying generalization in reinforcement learning</title>
		<author>
			<persName><forename type="first">K</forename><surname>Cobbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Klimov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schulman</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v97/cobbe19a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Chaudhuri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</editor>
		<meeting>the 36th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">97</biblScope>
			<biblScope unit="page" from="1282" to="1289" />
		</imprint>
	</monogr>
	<note>of Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Explainable reinforcement learning: A survey and comparative review</title>
		<author>
			<persName><forename type="first">S</forename><surname>Milani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Topin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Veloso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fang</surname></persName>
		</author>
		<idno type="DOI">10.1145/3616864</idno>
		<ptr target="https://doi.org/10.1145/3616864.doi:10.1145/3616864" />
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Out of the cage: How stochastic parrots win in cyber security environments</title>
		<author>
			<persName><forename type="first">M</forename><surname>Rigaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Lukáš</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Catania</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Garcia</surname></persName>
		</author>
		<idno type="DOI">10.5220/0012391800003636</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th International Conference on Agents and Artificial Intelligence -Volume</title>
				<meeting>the 16th International Conference on Agents and Artificial Intelligence -Volume<address><addrLine>SciTePress</addrLine></address></meeting>
		<imprint>
			<publisher>ICAART, INSTICC</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="774" to="781" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Deep reinforcement learning for cyber security</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">J</forename><surname>Reddi</surname></persName>
		</author>
		<idno type="DOI">10.1109/TNNLS.2021.3121870</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Neural Networks and Learning Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="3779" to="3795" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Explainable deep reinforcement learning: State of the art and challenges</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Vouros</surname></persName>
		</author>
		<idno type="DOI">10.1145/3527448</idno>
		<ptr target="https://doi.org/10.1145/3527448.doi:10.1145/3527448" />
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">V</forename><surname>Deshmukh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dasgupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Krishnamurthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Theocharous</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Subramanian</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2305.04073</idno>
		<idno type="arXiv">arXiv:2305.04073</idno>
		<ptr target="http://arxiv.org/abs/2305.04073.doi:10.48550/arXiv.2305.04073" />
		<title level="m">Explaining RL Decisions with Trajectories</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Takagi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tabalba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kirshenbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Leigh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.07928</idno>
		<title level="m">Abstracted trajectory visualization for explainability in reinforcement learning</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings</title>
		<author>
			<persName><forename type="first">J</forename><surname>Co-Reyes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Eysenbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v80/co-reyes18a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 35th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Dy</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Krause</surname></persName>
		</editor>
		<meeting>the 35th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">80</biblScope>
			<biblScope unit="page" from="1009" to="1018" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Highlights: Summarizing agent behavior to people</title>
		<author>
			<persName><forename type="first">D</forename><surname>Amir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Amir</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th international conference on autonomous agents and multiagent systems</title>
				<meeting>the 17th international conference on autonomous agents and multiagent systems</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1168" to="1176" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Generation of policy-level explanations for reinforcement learning</title>
		<author>
			<persName><forename type="first">N</forename><surname>Topin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Veloso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="2514" to="2521" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Mcinnes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Healy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Melville</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.03426</idno>
		<title level="m">UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">ArXiv e-prints</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
