Evaluating Trainees in Large Cyber Exercises Andrea Artioli1,*,† , Mauro Andreolini1,† , Luca Ferretti1,† and Mirco Marchetti1,† 1 University of Modena and Reggio Emilia, Modena, Italy Abstract Cyber ranges are becoming a widely used alternative to teach trainees security through practice on realistic systems. They support evaluation and awareness through dashboards that display how many training objectives have been achieved in a cyber exercise. In our previous paper [1] we have outlined all the limitations of standard dashboards and proposed a new framework for modeling and assessing trainee activity through trainee graphs, reference graphs and scoring functions. In particular, we have introduced a scoring function based on the symmetric difference between a trainee and a reference graph, which allows to pinpoint quite effectively the inefficiencies of a trainee in an exercise. In this paper, we show how that model, while working well for small and coherent exercises, poses several problems when applied to a larger labs made of several, heterogeneous challenges. The accuracy of the symmetrical difference rapidly drops down as the number of nodes and edges in the reference graph increases, thus making it impossible to use it in large environments. Furthermore, there might be edge cases where trainees that do not complete an exercise obtain a higher score than those who do. This happens because the symmetric difference turns out to be higher if the trainee advances exploring fewer nodes of the reference graph (which is the case of a skilled attacker). To address these problems, we aim at reducing the complexity of the graphs fed to the scores. We improve the older model by representing an exercise as a set of smaller local graphs (each one for a coherent, intermediate challenge which can be assessed with a specific local score) and a global graph (representing the interconnection between intermediate challenges, that can be assessed with a specific global score). The main benefits of introducing global and local graphs using global and local progress are twofold: (a) having smaller graphs, scores related to precision (symmetric differences) are more performant; (b) we can assign different scores to different parts of the exercise, which is crucial in heterogeneous engagements. We have implemented a Python-based simulator that generates random exercises and compares the performance of the previous and proposed trainee models under specific scores. Our results empirically show that, on average, the original model fails to scale to sizes in the order of tens of vertices and or edges, while the new is able to preserve precise scores locally and better track overall progress. Keywords Cyber range, Graph analytics, Training, Cyber exercise, Monitoring 1. Introduction General consensus states that security is best learned by practice on realistic systems including workstations and servers, IoT devices and cyber-physical systems. A cyber range is a pre- ITASEC24: Italian Conference on Cybersecurity, April 08–12, 2024, Salerno, Italy * Corresponding author. † These authors contributed equally. $ andrea.artioli@unimore.it (A. Artioli); mauro.andreolini@unimore.it (M. Andreolini); luca.ferretti@unimore.it (L. Ferretti); mirco.marchetti@unimore.it (M. Marchetti)  0009-0007-7965-1322 (A. Artioli); 0000-0003-3671-6927 (M. Andreolini); 0000-0001-5824-2297 (L. Ferretti); 0000-0002-7408-6906 (M. Marchetti) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings arranged virtual environment that allows trainees to simulate realistic attack and defense scenarios on an architecture resembling some original system. Cyber ranges typically support evaluation and awareness through dashboards that display how many training goals have been fulfilled. This approach does not allow to identify the reasons behind a bad trainee performance. To overcome this limitation, Andreolini et al. [1] introduce a novel, modular and extensible framework that captures trainee activity into a trainee graph and compares it with a reference graph provided by the instructor. Given specific learning objectives, novel scores are proposed in order to measure trainee proficiency in achieving them. A score is a scalar function over a set of graphs. Many scoring strategies have been considered to assess trainee speed (shortest path to a goal) and precision (combined symmetric difference between trainee edges/vertices and reference graph edges/vertices). While these scores work well for small and coherent challenge environments (e.g., a single host executing a Web server), several problems arise when trying to apply this approach to a larger environment (e.g, heterogeneous multi-host labs located in different subnets). The accuracy of the symmetrical difference decreases abruptly as the number of nodes and edges in the reference graph increases. Furthermore, there might be edge cases where trainees that do not reach the exercise main goal obtain a higher score than those who do. To address these problems, we aim at reducing the complexity of the graphs fed to the scores. We model an exercise as a set of smaller local graphs (representing a coherent, intermediate challenge which can be assessed with a specific local score) and a global graph (representing the interconnection between intermediate challenges, that can be assessed with a specific global score). The main benefits of introducing global and local graphs using global and local progress are twofold: (a) having smaller graphs, scores related to precision (symmetric differences) are more performant; (b) we can assign different scores to different parts of the exercise, which is crucial in heterogeneous engagements. We have implemented a Python-based simulator that generates random exercises and com- pares the performance of the previous and proposed trainee models under specific scores. Our results empirically show that, on average, the original model fails to scale to sizes in the order of tens of vertices and or edges, while the new is able to preserve precise scores locally and better track overall progress. The paper is organized as follows. Section 2 briefly recalls the original scoring model intro- duced in [1], while Section 3 points out the limitations of the old model and introduces the new one. Section 4 describes our testbed and compares the performance of the old and new models under different scenarios. Section 5 discusses related work. Finally, Section 6 concludes the paper and outlines possible future work. 2. Background 2.1. Modeling trainee activities In this section we briefly recall the original scoring model introduced in [1]. We model trainee activities through oriented graphs where vertices represent intermediate states that are reached by a trainee during an exercise, while edges represent the actions performed by a trainee to move from a particular state to the next one. Unintended actions are modeled as edges towards special vertices representing errors. The start and end dummy vertices enclose a specific challenge. Both vertices and edges may be labelled with additional information (timestamps on vertices and edges to track progress over time, or command options on edges to better discern useful from useless commands). A reference graph is prepared by an instructor and models the ideal behavior of a trainee during an exercise. Figure 1 shows an example reference graph with intermediate steps 𝑆𝑖 (i=1, 2, 3, 4, 5) carried out through actions 𝑎𝑘 (k=1, 2, 3, 4, 5, 6, 7). A trainee a4 a1 S1 a7 S4 S5 End a2 S2 a5 Start S3 a3 a6 Figure 1: An example reference graph graph tracks the actions performed by a trainee during an exercise. It is built automatically offline (at the end of an exercise, for a post-mortem performance analysis) or online (during an engagement, to track live trainee progress on a dashboard). The build process uses a reference graph and a set of metrics collected on the game network (e.g., command history and Web browsing history) during the exercise. These metrics allow to built an event timeline and to match it with the intermediate states of a reference graph. Whenever an event in the timeline matches an edge (𝑉𝑖 , 𝑉𝑗 ) in the reference graph, the trainee graph is updated as follows: vertex 𝑉𝑗 is added to the trainee graph; vertex 𝑉𝑖 is located in the trainee graph; an edge (𝑉𝑖 , 𝑉𝑗 ) is added to the trainee graph. Figure 2 shows the incremental update of a trainee graph with vertex 𝑆2 and edge 𝑒2 . Matching of timeline events with trainee actions is done in ordered fashion on all nodes of the reference graph. If, on the other hand, a timeline event 𝑒 cannot be Reference graph Framework e1 e2 Start Trainee graph Lab Network e1 e2 Events S1 Start S2 M1 M2 ... Mn e3 S3 e4 Metrics Match S1 S2 Event e1 e2 ... en Update timeline End Figure 2: Incremental update of the trainee graph matched against any edge in the reference graph, the trainee graph is updated as follows: the current vertex 𝑉𝑖 in the trainee graph is located; the next expected vertex 𝑉𝑗 in the reference graph is identified (such a vertex always exists, be it “start” if the trainee hasn’t yet followed the recommended solution, or an intermediate one if the trainee has followed some steps of the exercise); the label 𝑁𝑗 of the vertex 𝑉𝑗 is identified; a new dummy vertex 𝑉𝑗 is added to the trainee graph with label 𝑁𝑗 _𝑒𝑟𝑟 (if it already exists, omit vertex insertion); an edge (𝑉𝑖 , 𝑉𝑗 ) is added with label set to 𝑒. Figure 3 shows the insertion of a dummy node in a trainee graph with event 𝑒. Trainee graph Reference graph e1 Start e e1 Start e2 S1 S2_err S1 S2 e3 S3 e4 End Figure 3: Insertion of a dummy node in the trainee graph 2.2. Scoring functions A score is a function 𝑓 : 𝐺𝑛 → R that takes as input a set of 𝑛 oriented graphs (including at least one trainee and a reference graph), and outputs a real number 𝑠 in a specified interval [𝑎, 𝑏] (e.g., [0, 1]). Different learning objectives call for different scores. We pick the following scores from [1] that will be used in the rest of the paper. Score 𝑠0 . Let 𝐺𝑢 be a trainee graph and 𝐺𝑟 a reference graph. Let 𝑙𝑢 and 𝑙𝑟 be the length of the shortest path from the starts vertex to the end vertex in the trainee and reference graph, respectively. We define 𝑠0 in the following equation: {︃ 𝑙𝑟 if the trainee completes the exercise 𝑠0 = 𝑙𝑢 (1) 0 otherwise The rationale is as follows. If the trainee completes the exercise, the 𝑙𝑙𝑢𝑟 fraction rewards shorter exploitation paths (𝑙𝑢 is lower). An unsolved exercise is not rewarded. Score 𝑠1 . Let 𝐺𝑢 = (𝑉𝑢 , 𝐸𝑢 ) be a trainee graph and 𝐺𝑟 = (𝑉𝑟 , 𝐸𝑟 ) a reference graph. We consider the symmetric difference of two sets 𝐴, 𝐵: 𝐴△𝐵 = (𝐴 ∖ 𝐵) ∪ (𝐵 ∖ 𝐴). In other words, 𝐴△𝐵 contains all members of 𝐴 not in 𝐵 and all members of 𝐵 not in 𝐴. We use two interesting properties of symmetric difference: (a) if two sets 𝐴 and 𝐵 coincide, 𝐴△𝐵 = ∅; (b) if two sets 𝐴 and 𝐵 are disjoint, 𝐴△𝐵 = 𝐴 ∪ 𝐵. We define the efficacy 𝜈 as following: |𝑉𝑢 △𝑉𝑟 | 𝜈 =1− (2) |𝑉𝑢 ∪ 𝑉𝑟 | If the trainee and reference graph coincide, we have 𝑉𝑢 △𝑉𝑟 = ∅, thus 𝜈 = 1; the trainee has followed exactly the sequence of intermediate states modeled by the reference graph, and obtains the highest possible score. On the other hand, if the trainee and reference graph are completely disjoint, we have 𝑉𝑢 △𝑉𝑟 = 𝑉𝑢 ∪ 𝑉𝑟 , thus 𝜈 = 0; the trainee hasn’t reached one single intermediate state of those in the reference graph, and obtains the lowest possible score. Intermediate performance produces scores in the [0, 1] interval. Efficacy measures the ability of a trainee to follow the paths of a solution described in the reference graph. Similarly, we define the efficiency 𝜂 as following: |𝐸𝑢 △𝐸𝑟 | 𝜂 =1− (3) |𝐸𝑢 ∪ 𝐸𝑟 | The same observations hold for 𝜂. A trainee that performs the exact actions modeled in the reference graph is assigned the highest score, while a trainee that misses every action is assigned the lowest score. Intermediate performance produces scores in the [0, 1] interval. Score 𝑠1 is defined as a linear combination of 𝜈 and 𝜂: 𝑠1 (𝐺𝑢 , 𝐺𝑟 ) = 𝛼𝜈 + (1 − 𝛼)𝜂 (4) In this paper, we choose 𝛼 = 0.5, but the score may be tuned to weigh more efficacy or efficiency, according to the specific learning objective. This score rewards trainees who are able to reach the final goal of an exercise following the path defined by the reference graph. Trainees who could not reach the goal or make many mistakes are penalized with a low score. This is a good candidate for teaching exercises, since it tracks trainee actions during the exercise, taking into account the ability to follow the taught path to reach a goal. 3. A trainee model for large environments 3.1. Limitations of the original trainee model While the original scoring model introduced in [1] works well for small and coherent en- gagements that can be evaluated consistently with a single scoring function, it poses several challenges in larger environments. With “larger environment” here we refer for example to labs made of multiple hosts with heterogeneous architectures, operating systems, libraries and applications, where a trainee has to showcase different skills (speed, precision, defense evasion, stealthiness, discovery of novel exploitation paths) in different contexts (UNIX, Windows, Web applications, networks, cryptography, reverse engineering). Under these circumstances it is often impossible to sum up trainee performance with a single scoring function. Furthermore, larger environments imply a larger number of interconnected hosts and services, resulting in larger trainee and reference graphs. Unfortunately, the accuracy of the symmetrical difference drops significantly as the number of vertices and edges in the reference graph increases. Even worse, in some pathological cases trainees that do not complete an exercise obtain a higher score than those who do. This happens because the symmetric difference turns out to be higher if the trainee explores fewer nodes of the reference graph (which is the case of a skilled attacker). These factors contribute to making the 𝑠1 score useless. To address these limitations we change the scoring model with the aim to reduce the com- plexity of the graphs fed to the scoring functions, as shown in the following section. 3.2. The proposed trainee model Instead of modeling an exercise as a single, potentially large graph we split it as a set of 𝑘 smaller, local graphs 𝐿𝐺𝑖 (𝑖 ∈ [1, 𝑘]) representing a single intermediate step. Figure 4 shows a possible local reference graph with local start, end and intermediate vertices 𝐿𝑆𝑖,𝑠𝑡𝑎𝑟𝑡 , 𝐿𝑆𝑖,𝑒𝑛𝑑 and 𝐿𝑆𝑖,𝑗 respectively (𝑖 ∈ [1, 𝑘], 𝑗 ∈ [0, 7]). Local trainee graphs are built exactly as in the older model. Each local graph is assigned a scoring function 𝑙𝑠𝑖 : 𝐺𝑛 → R (𝑖 ∈ [1, 𝑘]) that assesses trainee performance in that specific intermediate step. LSi,2 LSi,6 LSi,1 LSi,5 LSi,7 LSi,end LSi,start LSi,0 LSi,4 LSi,3 Figure 4: A local reference graph Similarly, we create a global graph 𝐺𝐺 that models the interconnection between the inter- mediate steps. Figure 5 shows a possible global graph with global start, end and intermediate vertices 𝐺𝑆𝑠𝑡𝑎𝑟𝑡 , 𝐺𝑆𝑒𝑛𝑑 and 𝐺𝑆𝑗 respectively (𝑗 ∈ [0, 𝑋]). Global trainee graphs are built exactly as in the older model by observing transition events from an intermediate challenge to another. The global graph is assigned a scoring function 𝑔𝑠 : 𝐺𝑛 → R that assesses trainee performance in advancing through the intermediate challenges. GS2 GS5 GSend GSstart GS0 GS3 GS4 GS1 Figure 5: A global reference graph In principle there is no impediment to assigning multiple scores to a specific graph (e.g., to simultaneously capture multiple skills such as speed and precision); however, for reasons of simplicity in this paper we will use only one score per graph. At the end of an exercise, the instructor has available: • a set of 𝑘 + 1 graphs (𝐺𝐺, 𝐿𝐺1 , 𝐿𝐺2 , . . . , 𝐿𝐺𝑘 ) that track out the overall advancement of a trainee through the intermediate challenges and in every single challenge; • a set of 𝑘 + 1 scores (𝑔𝑠, 𝑙𝑠1 , 𝑙𝑠2 , . . . , 𝑙𝑠𝑘 ) that quantify the aforementioned advancement and allow to compare it against the performance of other trainees (or even the same trainees over several, repeated exercises). Aggregating these scores to produce a leaderboard is a very interesting topic that is out of scope for the current paper. Here, we show that the scores produced by the new model are more consistent than those of the old one in large graphs. 4. Results In Section 4.1 we briefly discuss the simulator, the hardware/software testbed and the simulation scenarios carried out. In Section 4 we compare the performance of the 𝑠1 score in the old and new model under different simulation scenarios. 4.1. Testbed We have implemented a simulator that models an exercise based on a multitude of hosts organized in multiple networks. Each host exhibits a specific challenge which the trainee is supposed to solve. The trainee is also supposed to advance through the lab in a specific sequence. This is a quite popular model in the cyber exercise world (e.g. HackTheBox Pro Labs [2] and Offensive Security Proving Grounds [3]). The simulator automates the analysis proposed in this paper performing the following operations. It receives an input parameter the number of challenges 𝑘 which, in this paper, is also the lab size in terms of number of hosts. We range 𝑘 ∈ [5, 60] hosts because current lab deployments are similarly sized. Then, it generates 𝑘 nodes of the global graph and connects them randomly using the Watts-Strogatz algorithm which is used to create small world graphs. Then, for each node in the global graph a local graph is generated which represents the intermediate challenge. For simplicity in each local graph we use a constant number of ten nodes; we have verified that this design choice does not alter the main results of the paper. We also connect the nodes of the local graph randomly using the Watts-Strogatz algorithm. At this point, we have represented an exercise in the new model. To perform a fair comparison, we also generate the equivalent large reference graph in the old model by connecting the local graphs as shown by the global graph. After setting up the graphs, the simulator performs random walks over the global graph and, for each node of the global graph, performs random walks into the associated local graph. To speed up simulations, local random walks are performed with two different error rates: 0.95 (related to a beginner) and 0.0 (related to an expert). The goal is twofold: (a) simulate the progress of a trainee through different hosts in the lab; (b) simulate the progress of both a high skilled and a rookie trainee into a specific challenge. At every step of the global and local random walks we compute the appropriate global and local scoring functions. Finally, at the end of an exercise the simulator saves the test configuration (𝑘, the local and global reference and trainee graphs, the global and local scores) into separate files. The simulator has been developed in Python 3.11.7 and uses the networkx package to handle large graphs. Simulations have been run in a host with an Intel Core 12700H CPU, 16 GB RAM and 1 TB SSD, running the ArchLinux GNU/Linux distribution. 4.2. Simulation results Figure 6 shows the evaluation of scores 𝑠0 and 𝑠1 under different simulation scenarios. Labels have the form score/model/stats, where “score” is either the 𝑠0 or 𝑠1 score, “model” refers to the old-style or new-style trainee graphs (local or global in the latter case) and “stats” is either the minimum or maximum score value observed during a specific simulation. Let’s focus on Figure 6: the 𝑠1 /old/max and 𝑠1 /old/min labels that represent the maximum and minimum values of 𝑠1 computed with the old model in exercises of increasing complexity in terms of vertices. As can be seen, the maximum value of the symmetric difference drops with 𝑘, making 𝑠1 unusable starting with approximately 𝑘 = 15 machines. The ratio of vertices in common between a trainee graph and a reference graph tends to drop as the number of vertices and edges increases. Keeping a high 𝑠1 score would mean that an attacker exploits every node in exactly every possible way as defined by an instructor, which is very hard and often simply not needed. On the other hand, the minimum value stabilizes to zero for pretty much the whole range of considered hosts; this means that bad performance is consistently evaluated. A very similar behavior is exhibited by the 𝑠1 score in the new model (labels 𝑠1 /new-global/*) when operating on larger and larger global trainee graphs. Let us now focus on the other 𝑠0 /new-global/* and 𝑠1 /new-local/* labels that represent respectively 𝑠0 in the new model (applied on a global trainee graph) and 𝑠1 in the new model (applied to local trainee graphs). Here, the aforementioned limitations does not apply; both good and bad performance is consistently evaluated for the whole range of considered hosts. In particular, the 𝑠0 /new-global/max plot is consistently high because 𝑠0 rewards very highly a trainee that is able to find the shortest exploitation path (which almost always happens in repeated simulations). Regarding 𝑠1 /new-local/max, 𝑠1 performs extremely well due to the constant and low number of ten nodes in every local graph generated by the simulator. Table 4.2 shows the values of the 𝑠1 score in a pathological simulation representing an exercise run in a lab of 𝑘 = 50 nodes. We show only this value for reasons of space, but we have verified that the problem persists basically for every value of 𝑘, and is more frequent with increasing values of 𝑘. The old reference graph has a shortest path to the goal of length 1. Two trainees carry out the exercise, each one exhibiting different skills (a beginner with error rate 0.95 and an expert with error rate 0). The expert trainee finds the shortest path and completes the challenge, while the beginner trainee develops an incomplete exploitation path of length 9 and fails to complete the challenge. One would expect a higher score for the expert trainee with respect to the beginner trainee; however, this is not the case. The expert trainee receives a score of 0.035, while the beginner trainee receives a score of 0.07 (exactly doubled). The reason for this anomaly lies again in the way symmetrical graph difference works. The beginner trainee graph shares a longer path with the reference graph than the expert trainee would; this leads to a higher similarity of graphs and a higher 𝑠1 score. Reference Trainee Error rate Completed 𝑠1 /old shortest path shortest path 0 1 1 True 0.035 0.95 1 9 False 0.07 Table 1 Scoring anomalies in the old model Table 4.2 compares the values of the 𝑠0 and 𝑠1 scores in the old and new model under the same pathological simulation. As we can see, the 𝑠1 score still exhibits the same limitation and is thus useless as a scoring function. On the other hand, the 𝑠0 score behaves correctly since it only focuses on path lengths and not on graph similarities. This leaves the question open as on how to fix 𝑠1 to remove this anomaly. We reserve to address this problem in future work. Error rate Completed 𝑠1 /new-local/[min-max] 𝑠1 /new-global 𝑠0 /new-global 0 True [0.95 − 0.95] 0.027 1.0 0.95 False [0.10 − 0.20] 0.07 0.0 Table 2 Scoring anomalies - old vs. new models 5. Related Work Previous literature has explored formal representations to articulate an attacker’s maneuvers in terms of the techniques utilized and vulnerabilities exploited within systems and configu- rations. Attack trees [4], bring the first structured representation of attacks on a system and the corresponding countermeasures, organized in a hierarchical tree format. The concept of attack trees has been broadened in academic research. Attack-defense trees [5] explicitly model interactions between attackers and defenders, enabling a more comprehensive and precise security assessment compared to traditional attack trees. Attack-response trees [6] enhance attack-defense trees to account for uncertainties in intrusion detection such as false positives and negatives. It is worth pointing out that as the number of vertices and edges in attack trees increases, their complexity escalates rapidly. Specifically, it becomes increasingly resource-intensive to trace all pathways from a leaf node to the root node. In realistic scenarios, the network of nodes and connections often surpasses thousands, where the mere addition of a single node significantly expands the number of arcs and potential attack routes. Moreover, given that the root node symbolizes the ultimate aim of the attack, complex multi-stage attacks may necessitate the utilization of multiple attack trees for accurate representation. Attack graphs [7] represent the infrastructure requiring protection, including network topology, vulnerable assets, and available exploits. They point out the pathways an attacker must traverse to achieve specific objectives. All the aforementioned approaches offer a static perspective of attacks and mitigation strategies, but they fail to track the progress of an attacker on a live system. In the nascent stages of scoring systems research for cyber ranges, the main approaches are focused on signaling goal completion rather than evaluating trainee performance, as in [8], [9], [10]. To the best of our knowledge, [1] is the first proposal that models and assesses user activity in cyber exercises with the help of trainee graphs, reference graphs and scoring functions. However, as this paper shows, the original trainee and scoring model may perform poorly in large environments. 6. Conclusions In this paper we have discussed our previously published trainee and scoring model [1] in the context of larger exercise labs. We have shown how the accuracy of the symmetrical difference decreases abruptly as the number of nodes and edges in the reference graph increases. Furthermore, there are pathological corner cases where trainees performing worse receive higher scores than those who perform better. We have tracked down the failures in the way the symmetric difference works and have devised a strategy to mitigate some of its flaws by feeding smaller graphs to the scoring functions. We have verified through simulations that our approach indeed improves the accuracy of the 𝑠1 score. Future work will be centered on fixing the scoring anomaly produced by 𝑠1 in some corner case simulations and on implementing the new scoring model in the prototype presented in [1]. References [1] M. Andreolini, V. G. Colacino, M. Colajanni, M. Marchetti, A framework for the evaluation of trainee performance in cyber range exercises, Mob. Netw. Appl. 25 (2020) 236–247. URL: https://doi.org/10.1007/s11036-019-01442-0. doi:10.1007/s11036-019-01442-0. [2] htbprolabs, HackTheBox Pro Labs, https://www.hackthebox.com/hacker/pro-labs, 2017. [3] ospg, Offensive Security Proving Grounds, https://www.offsec.com/labs/, 2020. [4] B. Schneier, Attack trees, Dr. Dobb’s journal 24 (1999) 21–29. [5] B. Kordy, P. Kordy, S. Mauw, P. Schweitzer, Adtool: security analysis with attack–defense trees, in: International conference on quantitative evaluation of systems, Springer, 2013, pp. 173–176. [6] S. A. Zonouz, H. Khurana, W. H. Sanders, T. M. Yardley, Rre: A game-theoretic intrusion response and recovery engine, IEEE Transactions on Parallel and Distributed Systems 25 (2013) 395–406. [7] X. Ou, W. F. Boyer, M. A. McQueen, A scalable approach to attack graph generation, in: Proceedings of the 13th ACM conference on Computer and communications security, ACM, 2006, pp. 336–345. [8] P. Čeleda, J. Čegan, J. Vykopal, D. Tovarňák, Kypo–a platform for cyber defence exercises, M&S Support to Operational Tasks Including War Gaming, Logistics, Cyber Defence. NATO Science and Technology Organization (2015). [9] J. Vykopal, M. Vizváry, R. Oslejsek, P. Celeda, D. Tovarnak, Lessons learned from com- plex hands-on defence exercises in a cyber range, in: 2017 IEEE Frontiers in Education Conference (FIE), IEEE, 2017, pp. 1–8. [10] M. Carlisle, M. Chiaramonte, D. Caswell, Using ctfs for an undergraduate cyber education, in: 2015 {USENIX} Summit on Gaming, Games, and Gamification in Security Education (3GSE 15), 2015.