<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pretraining Graph State Encoders for microRTS using Graph Self-Supervised Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pavan Kantharaju</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Smart Information Flow Technologies</institution>
          ,
          <addr-line>319 1st Ave North, Suite 400, Minneapolis, MN 55401-1689</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Real-Time Strategy (RTS) games are a popular and successful genre of video games that also doubles as a research environment for evaluating challenging AI problems.  RTS is one such research environment that represents many of the AI research problems inherent in commercial RTS games in a low graphics fidelity grid-based game. Prior work studied the use of graph game states for Deep Reinforcement Learning (DRL)-based game-playing in  RTS. However, the state encoder used in the DRL agent was specifically trained for the task of DRL. Ideally, we want to train a single state encoder that can be transferred to a variety of downstream tasks (e.g., player modeling, game-playing, and content generation) for  RTS to minimize the costly training of state encoders for each possible task. Additionally, the state encoder used a Gated Recurrent Unit for processing graphs over Graph Neural Networks (GNNs) designed to process graphs. This paper provides an initial study on pretraining a transferrable GNN-based graph state encoder using a variant of the self-supervised learning algorithm Distillation with No Labels (DINO) for graph representation learning over a set of  RTS game replays. We provide a qualitative analysis of the latent graph states through a cluster analysis and begin to evaluate the transferability of the latent state representations starting with the task of action prediction. We show that the latent states contain information about  RTS game maps despite not being explicitly trained on spatial map features, and associations that hint at particular types of player behaviors. We also show that the latent states can be fine-tuned for action prediction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Real-Time Strategy Games</kwd>
        <kwd>Self-Supervised Learning</kwd>
        <kwd>Representation Learning</kwd>
        <kwd>Graph Neural Networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Real-Time Strategy (RTS) games such as Starcraft and Age of Empires are a popular and successful genre
of video game, which are also valuable for evaluating solutions to AI research problems. For example,
RTS games have been used for evaluating methods in planning [1, 2, 3], plan and goal recognition [4, 5, 6],
and reinforcement learning [7, 8, 9].</p>
      <p>A core component of any RTS game is its game state, which is a snapshot of the game world at some
given time. This game state includes characteristics about the player and their opponents, resources on
the current game’s map, and the terrain of the map. AI methods, particularly machine/deep learning
methods, require a latent representation of the game state that encodes salient factors from the state,
which allows them to perform their task. This representation is usually created using a neural encoder,
which takes the game state and projects it into a latent space. Prior work by Jin [10] showed that
graphs can be used as a game state representation that transfers across game maps from the RTS game
 RTS [11]. A graph is a useful way to model game states in RTS games as graphs can represent relational
and relative spatial information across units on a game map. Jin [10] then used a Gated Recurrent Unit
(GRU) [12] encoder, contained within a Deep Reinforcement Learning (DRL) agent, to project the graph
state into a latent space. However, the GRU encoder was specifically trained for the task of DRL-based
game-playing, which limits its generality to other downstream tasks for RTS games. Additionally, GRUs
are designed to work over sequential data (e.g., text [12]) instead of graph data, while Graph Neural
Networks (GNNs) are designed to process graphs (e.g., knowledge graphs [13]).</p>
      <p>This paper presents an initial study on pretraining a transferrable GNN graph state encoder over a
set of  RTS game replays.1 More specifically, we contribute a pretrained graph state encoder trained
using a variant of the self-supervised learning (SSL) algorithm Distillation with No Labels (DINO) [14],
used in visual representation learning, for learning latent graph game state representations from game
replays. We provide a qualitative analysis of the pretrained latent graph states through a cluster analysis
and a quantitative evaluation of the latent states for the task of action prediction. Our results show
that the latent graph states contain information about  RTS game maps despite not being explicitly
trained on spatial map features, and associations that hint at particular types of player behaviors. Our
results also show that these latent states can be successfully transferred to the task of action prediction,
providing a starting point for conducting evaluations across diferent RTS game tasks in future work.</p>
      <p>This paper is structured as follows. First, we contextualize our work with respect to the literature.
Next, we provide a brief background on  RTS, DINO, and the specific GNN used in this paper, Graph
Attention Network V2 (GATv2) [15]. Next, we describe the  RTS graph game state and how we pretrain
a GNN state encoder over these game states. We then detail our experiments, provide initial results on
the pretrained GNN state encoder, and discuss the results. Finally, we conclude with next steps.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Self-Supervised Graph Learning: There are three major categories of self-supervised learning
over graphs. Generative learning [16] trains models to learn features that reconstruct its input data
distribution. Joint-embedding contrastive learning [17] trains models to align positive (similar) inputs
close together in latent space while pushing negative (dissimilar) inputs away from each other using
explicit positive-positive and positive-negative input pairs. However, joint-embedding contrastive
learning methods tend to be computationally expensive as they require a large number of
positivenegative input pairs to perform well. Joint-embedding non-contrastive learning [18, 19, 20] addresses this
limitation by training models to elicit the same behavior by only using positive-positive input pairs. The
closest method to our approach, GraphDINO [19], falls under joint-embedding non-contrastive learning.
GraphDINO is an SSL method that trains a modified Transformer [ 21] backbone using DINO [14] for
learning representations of 3D neuronal morphologies. DINO trains models to place similar inputs close
together in latent space by perturbing inputs and aligning perturbed data points generated from the
same input in latent space. GraphDINO adapts DINO to spatial graph data structures by introducing
stochastic graph perturbations relevant to spatial neuronal graphs and modifying the Transformer
architecture for graph data while keeping the DINO learning algorithm unchanged. Our work also
leaves the DINO algorithm unchanged, but, instead, uses a GNN backbone and graph perturbations
relevant to game states, and is applied to learning game states for RTS games.</p>
      <p>Bootstrapped Graph Latents (BGRL) [18] and Graph Barlow Twins (GBT) [20] are two other methods
relevant to our work that falls under the joint-embedding non-contrastive learning category. BGRL, GBT,
and GraphDINO conduct SSL by comparing outputs from two neural networks, but these methods difer
in their training paradigms. GraphDINO and BGRL are based on the knowledge distillation learning
paradigm, where a teacher network distills its knowledge into a student network. However, GraphDINO
uses symmetric architectures for both student and teacher while BGRL uses asymmetric architectures.
GBT is based on the redundancy-reduction principle [20], where models are trained to reduce the
redundancy in their representations. Similar to GraphDINO, GBT uses symmetric architectures for its
two networks, but GBT additionally uses the same model weights for the networks.
Game State Self-Supervised Learning: The goal of SSL for games is to construct general and
informative representations of game states for AI game tasks, such as game-playing, content generation,
and player modeling. Many of the SSL methods studied in the literature are contrastive-based methods.
Trivedi et al. [22] provided a study of of-the-shelf SSL methods SimCLR, SwAV, and BYOL; to the
best of our knowledge, BYOL is currently the only non-contrastive SSL method that has been studied.</p>
      <sec id="sec-2-1">
        <title>1Code is available at https://github.com/Teravolt/microrts-graph-state-encoder-gssl</title>
        <p>
          GameCLR [23] uses the SimCLR over images to learn game state representations while Trivedi et al.
[24] uses supervised contrastive learning over labeled images. Our work difers in that (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) our work
focuses on graph game states over image game states, and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) we study learning representations for
RTS games. The closest approach to our work is Knowledge-Enhanced Graph Contrastive Learning
(KEGC) [25], which uses contrastive learning over graphs to address the task of game outcome prediction
in Multiplayer Online Battle Arena games; our work instead uses non-contrastive SSL for RTS games.
AI in Real-Time Strategy Games: RTS games have been extensively used in prior research as a
way to evaluate challenging AI research problems. A variety of diferent methods have been used
for addressing AI tasks in RTS games, including planning [1, 3], plan and goal recognition [4, 6], and
recently, reinforcement learning [7, 9]. Our pretrained graph state encoder can complement many of
these methods and is one avenue for future work. The closest research to our work was done by Jin
[10], who studied the combination of Convolutional Neural Networks (CNNs) and GRUs for extracting
state information for DRL game-playing in  RTS. The CNNs were used to extract spatial information
while the GRUs were used to extract relational information from graphs. Our work difers in that our
graph state representations contains spatial information within the edge features instead of using a
CNN. We also evaluate our GNN state encoder over the task of action prediction (vs DRL) and study
how to pretrain a GNN over a large-scale dataset of RTS game replays.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <sec id="sec-3-1">
        <title>3.1. microRTS</title>
        <p>This section describes the RTS testbed and provides a brief background on GATv2 and DINO.
 RTS is a minimalistic RTS game designed to evaluate AI research in an RTS setting [11, 3]. Figure 1
shows two scripted agents playing against each other on a game map represented as an  ×  discrete
grid. Despite its minimalism and low graphics fidelity,  RTS retains the properties of commercial
RTS games, such as StarCraft, that make them complex from an AI point of view, such as durative
and simultaneous action execution, real-time decision making, large state spaces, and full or partial
observability. Since 2017, IEEE CoG (previously CIG) has hosted the  RTS competition to foster AI
research in game-playing agents for RTS games.2 This paper makes use of deterministic and
fullyobservable game replay data from the COG 2019 and 2020 RTS competitions.
2https://sites.google.com/site/micrortsaicompetition/introduction?authuser=0</p>
        <p>Ps(x1) Ps(x2)</p>
        <p>RTS has six core entities, specifically light, heavy, ranged, barracks, base, worker, and resources, the
ifrst six of which can be controlled by players. Resources are currency in  RTS, which are used to
construct entities, and their amount and locations are predefined by a game map. The first three entities
(light, heavy, ranged) are considered ofensive units whose purpose is to attack enemy units, and are
created using barracks. Bases are used to construct worker units that harvest resources, and build bases
and barracks. This makes worker entities the backbone of  RTS gameplay, as they are indirectly or
directly responsible constructing and interacting with the other core entities in the game.</p>
        <p>The  RTS game state is an  ×  or  ×  grid-based game map containing the above entities,
information pertaining to the entities (e.g., health, remaining resources, etc.), and terrain information.
 RTS has six actions that can be done by player-controlled units: idle, move, harvest, produce, return,
and attack. Each action is restricted to certain units, except the idle action, which can be done by all
units. Specifically, the move and attack actions can only be done by worker, light, heavy, and ranged
units. The harvest and return actions can only be done by worker units while produce can be done by
worker, base, and barracks.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Distillation with No Labels</title>
        <p>Distillation with no labels (DINO) is an SSL method from computer vision that trains vision neural
networks to learn dense latent representations of images. Figure 2 (left) provides an illustration of DINO
over images. DINO is based on the knowledge distillation learning paradigm, where a Teacher Encoder
distills knowledge into a Student Encoder. More specifically, DINO uses representations from a previous
training checkpoint of the model (Teacher Encoder) as the “label" for training the current version of
model (Student Encoder). DINO takes an image  from the current batch and constructs two new images
(known as views), 1 and 2, based on a set of domain-general perturbations (e.g., random crop, gaussian
blur, etc). These views are provided to student and teacher siamese networks,  and  , to construct
-dimensional latent representations of the views, denoted by  ( ) and  ( ), where  ∈ {1, 2}
is the index of the image view. These representations are then passed through a temperature-scaled
softmax to convert the representations into a -dimensional probability distribution,
,( ) =</p>
        <p>
          , ()/ 
∑︀=0 , ()/ 
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where  is either students  or teacher ,   is the temperature, and  ∈ {1, 2, . . . , }. This will result
in four probability distributions: two for the student ((1), (2)) and two for the teacher ((1),
(2)). The student network is then trained through a cross-entropy loss between the student and
teacher probability distributions while the teacher network is updated using an Exponential Moving
Average (EMA) of the student network.
        </p>
        <p>DINO uses sharpening and centering of the teacher network’s outputs to prevent the learned
representations from collapsing to a single dominant dimension, or a uniform distribution regardless of the
input. Sharpening is implemented by setting   to a low value (in our experiments,   = 0.0.4), while
centering is implemented by subtracting a center value  from the teacher’s outputs. This center value
is computed using first-order statistics of the teacher’s outputs in a batch, and is updated every training
iteration using EMA:
 =  *  + (1 − ) *</p>
        <p>1 ∑︁[ (,1)| (,2)]
 =1
where  ∈ [0, 1] is the EMA rate parameter, | is a concatenation,  is the batch size, and ,1 and ,2
are the two views for an image at batch index .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Graph Attention Network V2</title>
        <p>The main contribution of our work is a transferrable pretrained graph neural network (GNN) state
encoder. A GNN is a class of networks designed to process graphs, and has been applied to computer
games [10], neuroscience [19], etc. GNNs can create latent node, edge, and graph representations, which
can subsequently be used for a variety of downstream graph-oriented tasks, including node, edge, and
graph classification and regression. A single-layer GNN can capture neighboring relationships for each
of the nodes in a graph, while adding additional layers can help model deeper node relationships. The
GNN that we focus on using in this paper is Graph Attention Network V2 (GATv2) [15].</p>
        <p>GATv2 takes in a graph  = (, ℰ ), where  = {1, . . . ,  } is a set of nodes, each containing
a set of features ℎ0 ∈ ℛ, and ℰ is the set of directed edges (, ) from node  to node , each
optionally containing edge features , ∈ ℛ. GATv2 uses message passing to capture shallow and
deep relationships between nodes in the graph. More specifically, for each node  in the graph, message
passing aggregates (via summation) scaled features from neighboring nodes , and integrates them
into the node. This scaling is a normalized score stating the importance of the neighboring node’s
features to , where the normalization is done over all nodes in .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Learning Graph State Representation for microRTS</title>
      <p>Graphs allow us to model the relational and relative spatial information across  RTS entities by
representing them as nodes, with edges defining relations between them. Specifically, we represent an
observed game state in  RTS as a fully-connected graph, where each node in the graph is an entity
with a set of game features relevant to the entity. These node features are a one hot encoding of the
unit’s type (worker, base, barracks, heavy, light, ranged, resource), concatenated with the amount of
resources, hit-points, and an indicator of whether this is a player’s unit; if a feature is not available for
the node’s unit, that feature is 0.</p>
      <p>
        The edges in the graph encode relative spatial information through euclidean distance and relative
directionality between entities in the game state. The relative directionality is the angle (in radians)
between a source and destination node, computed by (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) taking the inverse tangent between the
destination and source node, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) adjusting the angle based on the positioning of the two nodes with
GATv2
+
      </p>
      <p>+
GATv2</p>
      <p>Specific to DINO
Average</p>
      <p>Linear</p>
      <p>Linear</p>
      <p>Node
Embeddings
(Node State)</p>
      <p>Graph
Embedding
(Graph State)
Model Outputs</p>
      <p>Projection Embedding
(Used to train via DINO)
GATv2 + Linear Residual</p>
      <p>Block 1</p>
      <p>GATv2 + Linear Residual</p>
      <p>Block 2
respect to the orientation of the game map. For example, if the destination node is to the top-left of the
source node, then the relative angle will fall under [/2, ] (second quadrant of an x,y grid). Table 1
provides a summary of the node and edge features used in the RTS game state.</p>
      <sec id="sec-4-1">
        <title>4.1. Graph Self-Supervised Learning</title>
        <p>We want a GNN to learn to project semantically-similar game state graphs close together in latent
space. Contrastive learning approaches [17] can fulfill this goal, but these approaches tend to be
very computationally expensive. Instead, following previous work on GraphDINO [19], we use the
self-supervised learning approach DINO to learn latent node and graph state representations. Figure 2
(right) provides an illustration of how graphs are processed by GraphDINO and our approach. Both
approaches train an encoder model (Student Encoder in Figure 2) to place semantically-similar graphs
close together in latent space by having the model predict the latent space it previously learned (Teacher
Encoder in Figure 2). However, GraphDINO and our approach diverge in what and how graphs are
processed by DINO. Thus, these methods can be viewed as two variations of graph SSL using DINO.</p>
        <p>More specifically, there are two major diferences between GraphDINO and our approach. The first
diference is the type of encoder model architecture that is trained by DINO (the how diference). Since
we are working with graphs, we use GNNs over Transformers as GNNs are designed specifically for
graphs. Figure 3 provides a visualization of our graph state encoder, a network with two GATv2 [15]
GNN and residual connection blocks (GATv2 + Linear Residual Block 1 and 2 in Figure 3). The state
encoder takes, as input, a graph 0 = (, ) with node ℎ0 ∈ ℛ and edge , ∈ ℛ features (,  ∈  ),
and passes the graph through a GATv2 layer while simultaneously passing the node embeddings
through a linear residual layer. Both the GATv2 and linear residual layer output node embeddings,
which are then added together to yield new node embeddings ℎ1 (∀ ∈  ) for graph 1 = (, ) that
is topologically the same as 0. Next, 1 and its node embeddings are passed into a second GATv2
and residual layer, respectively, and their node embeddings are added together resulting in 2 with
new node embeddings; these node embeddings ℎ2 (∀ ∈  ) are then averaged together to get a graph
embedding. Next, this graph embedding is passed through two projection layers (linear layers) to get a
ifnal projection embedding that is used by DINO to learn node and graph representations (Specific to
DINO in Figure 3). After training, the two projection layers are removed from the encoder and the node
and graph embeddings are used for downstream tasks.</p>
        <p>The second diference is the type of graph (graph representing game state) and stochastic perturbations
conducted over these state graphs (the what diference). General graph perturbations, such as node
feature masking and edge masking [18], etc., can result in changes to the semantics of a state graph.
For example, if we mask the resources feature of a resource node in the graph game state, this would
indicate that resource does not exist in the state, which changes what that state means (i.e., high → low
resource state). We explore using three stochastic graph perturbations that maintain the semantics of
the original graph, distance scaling, rotation ofset, and edge removal:
• Distance Scaling Perturbation (changes edge features): The relative distance between nodes is
scaled by a factor  ∈ [, ); in our experiments,  = 0.5 and  = 1.5.
• Rotation Ofset Perturbation (changes edge features): The relative orientation of all nodes is
ofset by a factor  ∈ [0, 2).
• Edge removal: Edges are removed with a probability  following a Bernoulli distribution; in our
experiments,  = 0.5.</p>
        <p>The parameters for each perturbation is based on human judgement of what could significantly change
the original graph state while not changing the semantics of the state.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>We now study the graph game state representations learned by DINO. First, we provide a qualitative
cluster analysis which will allow us to understand what patterns and associations are contained in the
representations that DINO learned, and the graph state encoder. Second, we conduct a quantitative
evaluation of the learned representation’s performance for the task of action prediction to understand
the transferrability of the representation on a game AI task.</p>
      <p>We use a series of  RTS replay datasets for these experiments generated from two-player,
deterministic, and fully-observable games. Specifically, we use replay data from the CoG 2019 and 2020  RTS
competitions as these were readily available to the authors for use. Each replay in the datasets consists
of a sequence of state, action pairs conducted by the two players. We parse each replay by skipping the
ifrst 200 state, action pairs and extracting every 10 state, action pairs for a maximum of 64 pairs. This is
done so we can acquire diverse states for training and evaluation.</p>
      <p>Training Dataset: The training dataset for the graph state encoder comes from the CoG 2019  RTS
competition replay data.3 The competition consisted of three tracks, standard (deterministic and
fullyobservable), non-deterministic, and partially-observable. Each track involved a five-iteration round robin
tournament where 12 agents played on eight open maps and four hidden maps; see Table 2 for the
maps and agents. We use the first three iterations of the standard track for training (without the hidden
maps), which contains 6336 replays, and ignore replays containing the RandomBiasedAI agent as its
random actions will frequently yield states that are noise to the graph state encoder, resulting in 5808
replays and 38934 game states.</p>
      <sec id="sec-5-1">
        <title>3https://sites.google.com/site/micrortsaicompetition/competition-results/2019-cog-results</title>
        <p>Evaluation Dataset 1: The first evaluation dataset consists of the last two iterations of the standard
track of the CoG 2019  RTS competition replay data, and contains 4224 replays (see Table 2 for the
maps and agents). Similar to the training dataset, we ignore replays containing the RandomBiasedAI,
resulting in 3872 replays and 17813 game states.</p>
        <p>Evaluation Dataset 2: The second evaluation dataset consists of replay data from the last two
iterations of the CoG 2020  RTS competition.4 Similar to the CoG 2019 competition, there are three
tracks, standard, non-deterministic, and partially-observable. Each track involved a five-iteration round
robin tournament where 10 agents played on eight open maps and four hidden maps; see Table 2 for
the specific maps and agents. We run our evaluations on all maps from the last two iterations of the
standard track. Similar to the first evaluation dataset, we ignore replays containing the RandomBiasedAI,
resulting in 3888 replays and 26221 game states.</p>
        <p>Pretraining Configuration: The graph state encoder is configured with a hidden dimension of 256
and is trained for 10 epochs with a batch size of 16. We use the AdamW optimizer [26] with a learning
rate computed using a cosine scheduler with a linear warmup of 1000 steps and initial learning rate of
4e-7. Specific to DINO, the teacher network is updated using an EMA of the student network, where the
EMA rate is  = 0.996 , the student and teacher temperature in Equation 1 is   = 0.1 and   = 0.0.4,
and the EMA rate parameter for centering is  = 0.9 [14].</p>
        <sec id="sec-5-1-1">
          <title>5.1. Cluster Analysis</title>
          <p>We cluster the graph embeddings generated by our graph state encoder over Datasets 1 and 2 using
-medoids, an of-the-shelf clustering algorithm that clusters datapoints into  partitions and is less
susceptible to outlier datapoints compared to -means. To find the optimal number of clusters for each
dataset, we compute the silhouette score for clusters ranging from 3 to 15, and use the elbow method to
ifnd an appropriate number of clusters. The elbow method is a visual heuristic that provides us with
the largest number of clusters before the silhouette score starts to plateau. We use TSNE [27] to reduce
the dimensionality of the graph embeddings for visualization.</p>
          <p>Figure 4 provides the silhouette score for 3 to 15 clusters over Datasets 1 and 2; using the elbow
method, we see that 7 clusters provides good fit over Dataset 1 while 8 clusters provides a good fit over
Dataset 2. Figure 5 provides a visualization of the 7 clusters over Dataset 1. The visualization shows
a good separation all clusters. To understand what could be contained in the clusters, we look at the
distribution of various game features contained in each of the clusters. Figure 6 contains the average
count of player units in each of the clusters. Overall, we see that heavy and ranged units are rarely</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4https://sites.google.com/site/micrortsaicompetition/competition-results/2020-cog-results</title>
        <p>2
n
o
i
s
n 0
e
m
i
D
50
100
100
50</p>
        <p>0
Dimension 1
50
100
contained in the clusters, while worker units are seen frequently in all clusters except for Cluster 4.
The frequency of worker units makes sense as workers are directly and indirectly required for many of
the core gameplay features in RTS: harvesting and building bases, barracks, and ofensive units.</p>
        <p>While Cluster 4 lacks worker units, we do see that Cluster 4 has the highest number of bases and
barracks. Figure 7 provides the distribution of game maps across the clusters and we see that Cluster 4
has a high concentration of game states from BroodWar/(4)BloodBath.scmB (64x64 game map), which is
a large enough map to allow for expansion and construction of multiple bases and barracks. We also
see that other clusters have a high concentration of particular game maps. For example, Figure 7 shows
that FourBasesWorkers8x8, BWDistantResources32x32 and DoubleGame24x24 is most frequent in Cluster
1, Cluster 5, and Cluster 6, respectively. This result implies that the graph state encoder’s latent space
may contain information about game maps despite not being explicitly trained on spatial map features.</p>
        <p>Figure 8 provides a cluster visualization of the learned latent space for Dataset 2. This visualization
shows less clear separation of clusters in contrast to Figure 5, particularly for Clusters 0 and 7. However,
100
50
2
n
o
is 0
n
e
m
i
D
50
100</p>
        <p>BWDistantResources32x32
BroodWar/(4)BloodBath.scmB
p
a
M</p>
        <p>FourBasesWorkers8x8
TwoBasesBarracks16x16</p>
        <p>NoWhereToRun9x8
DoubleGame24x24</p>
        <p>0
Dimension 1
50
100
the clusters are still separate, indicating that the graph state encoder was able to discover and learn
hidden patterns in the graph states from a completely diferent  RTS competition. Figure 9 contains the
count of units in each of the clusters. Our results show that Cluster 3 contains mainly ofensive units
(light, heavy, ranged) and barracks, which could indicate ofense-heavy game states or states from large
game maps as heavy ofensive units are rarely useful in small maps due to their cost and construction
time. Looking at the map distribution in Figure 10, we see a high concentration of data points from
Cluster 3 for the maps armyGeneration16x16 (16x16 game map) and BroodWar/(4)BloodBath.scmB,
confirming our hypothesis. We also see in Figure 10 that NoWhereToRun9x8 and DoubleGame24x24 is
most frequently seen in Cluster 0 and Cluster 6, respectively, providing further evidence that the latent
space learned by the graph state encoder may contain information about game maps. We also see, in
Figure 9, that Cluster 1 has a low number of worker units in comparison to all other clusters except
Cluster 3, but has the second-highest concentration of light and ranged units, and barracks. This could
indicate that Cluster 1 contains states from agents using a rush strategy as these units are quick and
cheap to construct.</p>
        <sec id="sec-5-2-1">
          <title>5.2. Action Prediction</title>
          <p>We now turn to evaluating the pretrained graph state encoder on the task of action prediction. This task
requires a method to determine the type of action (idle, move, harvest, return, produce, and attack) and
its parameters (e.g., direction, attack position) for each observable unit in a given  RTS game state. For
our experiments, we represent an action as a feature vector, similar to Huang et al. [8] and Goodfriend
[9]. This action feature vector is a binary vector, where diferent subsets of the vector represent diferent
parts of an action, and is the same size across all  RTS unit types; Figure 11 provides an illustration of
the action feature vector and its subsets. An action is represented using this vector by setting its action
type and parameters to 1. For example, if we want to specify an action to move up, then we would have
the following vector: [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, . . . , 0]. This makes the action prediction task a multilabel
classification task.</p>
          <p>The action vector subset relative attack position (rightmost subset in Figure 11) is specifically used by
the attack action to denote attack positions for a unit that can attack. This subset has a dimension of
2, where  = 2 + 1 and  is the maximum attack range over all possible unit types, and represents
an  ×  attack grid around the unit. When a unit does an attack action, we use the distance of
the ofensive unit to the position being attacked to determine the attack grid position and assign that
position to 1. For example, if the maximum attack range is 1 and the ofensive unit is attacking to
its upper-left position, then we would have a 3 × 3 attack grid around the ofensive unit that will be
attacking position (0, 0), which corresponds to element 0 in the vector subset.</p>
          <p>We construct an action prediction model for this task by connecting the pretrained graph state encoder
to a single task-specific linear prediction layer. The graph state encoder provides node embeddings
for each unit in the game state (Node Embeddings box in Figure 3), all of which are then passed into
the linear prediction layer to predict the unit’s action. We note that this linear prediction layer is not
pretrained by DINO and is only trained for the task of action prediction. This action prediction model is
trained for evaluation in two ways: linear probing and fine-tuning. Linear probing assesses the strength
of the pretrained encoder’s representation by freezing the parameters of the graph state encoder and
only training the final linear prediction layer. Fine-tuning assesses the adaptability of the pretrained
encoder’s representation by jointly training both the encoder and linear prediction layer.</p>
          <p>We use four baselines for our evaluation. The first baseline replaces the graph state encoder with
three linear layers, where each layer is followed by ReLU activations. This baseline allows us to assess
the benefit of using a graph state representation and GNNs, and is denoted as action prediction (no
GNN). The second baseline uses a randomly-initialized graph state encoder, which we denote by action
prediction random init. The final two baselines use a graph state encoder pretrained by Bootstrapped
Graph Latents (BGRL) [18] and Graph Barlow Twins (GBT) [20], which we use to compare our approach
against previous graph self-supervised learning methods.</p>
          <p>We train all action prediction models to conduct multilabel classification over the binary action
vectors from Dataset 1. More specifically, for each game state in the replay dataset, we extract the
binary action vectors for all units in the state, and have the action prediction models predict each unit’s
action vector. We then evaluate all models over Dataset 2. To accurately evaluate the models, we want
to minimize the number of graph topologies in our evaluation dataset that were seen during training.
The graph game state represents relations between units, whose topology directly depends on the
gameplay behaviors of the agents. Thus, we remove RandomBiasedAI, POWorkerRush, POLightRush,
NaiveMCTS, and UTS_Imass_SocketAI from Dataset 2 for evaluation as these agents were seen during
training, resulting in an evaluation dataset containing 2160 replays and 9434 game states.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Our qualitative and quantitative results showed that the graph state encoder trained by DINO learns
meaningful features that successfully transferred to the task of action prediction in  RTS. In particular,
our qualitative cluster analysis showed that the latent graph states learned by the state encoder contain
information about  RTS game maps. This is particularly interesting because our graph representation
only provides map information directly through resources and their counts; any spatial information is
only provided through unit formations defined by relative distance and orientation between units in
the game state. This implies that the state encoder acquired information about the game maps without
explicit spatial map features such as terrain or map geometry. This finding is very important for the
applicability of game-playing and player modeling approaches for  RTS as spatial features are usually
acquired from state encoders that directly process the spatial aspect of game maps (e.g., CNNs), but
these models struggle to apply across the non-uniform map sizes (height and width) seen in  RTS [10].
Our graph state encoder is applicable across these maps, providing game-playing and player modeling
approaches an alternative for encoding game states. One next step would be to understand any spatial
features learned by our graph state encoder, and assess performance diferences between spatial features
learned by the encoder and models that directly process game maps.</p>
      <p>Our quantitative results showed that the graph state encoder can be successfully transferred to
the task of action prediction when fine-tuned, even when the encoder was pretrained on data from
a single  RTS competition (CoG 2019 competition). The actions we used for the action prediction
experiments are the same as those used in prior work for DRL in  RTS [8, 9], which provides a signal
that the encoder could be transferrable to DRL. However, the encoder DINO learned struggled to
perform well on the task of action prediction under linear probing. This could be due to two possible
issues. First, DINO, which trains the encoder to learn general graph representations, may have been too
general or not amicable for action prediction. While pretrained models should ideally learn task-general
representations, these representations should also be helpful for any downstream task. Second, there
may be a sizable distribution shift between the pretraining data and data used for action prediction.
This second issue can be remediated by training on diverse replays from other, more recent  RTS
competitions, which we are working on acquiring for future work.</p>
      <p>Our results also demonstrated that graph state representations can be helpful for action prediction in
 RTS. Graphs encode relationships through their nodes and edges, which allows us to represent team
formations, but other modalities exist that encode diferent types of information. For example, images
provide spatial and view information, which can be helpful for understanding geometry. We believe
that graphs can encode one piece of the overall game state, but other modalities will be needed to fill in
information gaps left by graph representations. Thus, another area of future work would be to learn
multimodal state encoders that could build a “complete" understanding of the game state.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This paper presents initial work on pretraining latent graph state representations of game states in
 RTS. Our results show that our graph representation and graph state encoder are beneficial for the
task of action prediction, performing similarly or better than several baselines when the encoder was
ifne-tuned. Our immediate steps will be to conduct an in-depth evaluation of the graph state encoder
and representations on tasks such as DRL and player modeling. We also would like to gather additional
 RTS replay data, conduct systematic hyperparameter tuning of the graph state encoder, and scale the
encoder for more representational power. Finally, a long term goal of this work is to study how state
models can be pretrained for AI tasks that are generalizable and transferrable across a diverse set of
tasks and video games.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The authors would like to give a special thanks to Dr. Santiago Ontañón for access to the CoG 2019 and
2020  RTS competition data. The authors would also like to thank our reviewers for their insightful
and helpful feedback.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <sec id="sec-9-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[3] J. R. Marino, R. O. Moraes, C. Toledo, L. H. Lelis, Evolving action abstractions for real-time planning
in extensive-form games, in: Proceedings of the 2018 AAAI Conference on Artificial Intelligence,
2018, pp. 2330–2337.
[4] G. Synnaeve, P. Bessiere, A bayesian model for plan recognition in rts games applied to starcraft,
in: Proceedings of 7th AAAI Conference on AIIDE, 2011, pp. 79–84.
[5] G. Synnaeve, P. Bessiere, A bayesian model for opening prediction in rts games with application
to starcraft, in: Proceedings of 2011 IEEE-CIG, 2011, pp. 281–288.
[6] P. Kantharaju, S. Ontañón, C. W. Geib, Scaling up ccg-based plan recognition via monte-carlo
tree search, in: 2019 IEEE Conference on Games (CoG), 2019, pp. 1–8. doi:10.1109/CIG.2019.
8848013.
[7] O. Vinyals, I. Babuschkin, W. M. Czarnecki, et. al., Grandmaster level in starcraft ii using
multiagent reinforcement learning, Nature 575 (2019) 350–354.
[8] S. Huang, S. Ontañón, C. Bamford, L. Grela, Gym-µrts: Toward afordable full game real-time
strategy games research with deep reinforcement learning, in: 2021 IEEE Conference on Games
(CoG), 2021, pp. 1–8. doi:10.1109/CoG52621.2021.9619076.
[9] S. Goodfriend, A competition winning deep reinforcement learning agent in microrts, in: 2024 IEEE</p>
        <p>Conference on Games (CoG), IEEE, 2024, pp. 1–8. doi:10.1109/cog60054.2024.10645610.
[10] Y. Jin, Advanced Deep Reinforcement Learning for Real-Time Strategy Games, Ph.D. thesis, School
of Electronic Engineering and Computer Science Queen Mary University of London, 2024.
[11] S. Ontañón, The combinatorial multi-armed bandit problem and its application to real-time
strategy games, in: Proceedings of the 9th AAAI Conference on AIIDE, 2013, pp. 58–64.
[12] K. Cho, B. van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of neural machine
translation: Encoder–decoder approaches, in: D. Wu, M. Carpuat, X. Carreras, E. M. Vecchi
(Eds.), Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical
Translation, Association for Computational Linguistics, Doha, Qatar, 2014, pp. 103–111. URL:
https://aclanthology.org/W14-4012/. doi:10.3115/v1/W14-4012.
[13] Z. Ye, Y. J. Kumar, G. O. Sing, F. Song, J. Wang, A comprehensive survey of graph neural networks for
knowledge graphs, IEEE Access 10 (2022) 75729–75741. doi:10.1109/ACCESS.2022.3191784.
[14] M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, A. Joulin, Emerging properties
in self-supervised vision transformers, in: 2021 IEEE/CVF International Conference on Computer
Vision (ICCV), 2021, pp. 9630–9640. doi:10.1109/ICCV48922.2021.00951.
[15] S. Brody, U. Alon, E. Yahav, How attentive are graph attention networks?, in: International</p>
        <p>Conference on Learning Representations, 2022, pp. 1–26.
[16] J. Park, H. Jung, H. Park, Cimage: Exploiting the conditional independence in masked graph
auto-encoders, in: Proceedings of the Eighteenth ACM International Conference on Web Search
and Data Mining, WSDM ’25, Association for Computing Machinery, New York, NY, USA, 2025,
pp. 10–19. URL: https://doi.org/10.1145/3701551.3703515. doi:10.1145/3701551.3703515.
[17] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, Y. Shen, Graph contrastive learning with augmentations,
in: Proceedings of the Thirty-Fifth Annual Conference on NeurIPS, 2021, pp. 1–12.
[18] S. Brody, U. Alon, E. Yahav, Large-scale representation learning on graphs via bootstrapping, in:</p>
        <p>International Conference on Learning Representations, 2022, pp. 1–21.
[19] M. A. Weis, L. Pede, T. Lüddecke, A. S. Ecker, Self-supervised graph representation learning for
neuronal morphologies, TMLR (2023). URL: https://openreview.net/forum?id=ThhMzfrd6r.
[20] P. Bielak, T. Kajdanowicz, N. V. Chawla, Graph barlow twins: A self-supervised representation
learning framework for graphs, Knowledge-Based Systems 256 (2022) 109631. URL: http://dx.doi.
org/10.1016/j.knosys.2022.109631. doi:10.1016/j.knosys.2022.109631.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Łukasz Kaiser, I. Polosukhin,</p>
        <p>Attention is all you need, in: Advances in NeurIPS, 2017, pp. 5998–6008.
[22] C. Trivedi, K. Makantasis, A. Liapis, G. N. Yannakakis, Learning task-independent game state
representations from unlabeled images, in: 2022 IEEE Conference on Games (CoG), IEEE Press,
2022, pp. 88–95. doi:10.1109/CoG51982.2022.9893648.
[23] C. Trivedi, K. Makantasis, A. Liapis, G. N. Yannakakis, Game state learning via game scene
augmentation, in: Proceedings of the 17th International Conference on the Foundations of Digital
Games, FDG ’22, ACM, New York, NY, USA, 2022, pp. 1–4. doi:10.1145/3555858.3555902.
[24] C. Trivedi, A. Liapis, G. N. Yannakakis, Contrastive learning of generalized game representations,
2021. URL: https://arxiv.org/abs/2106.10060. arXiv:2106.10060.
[25] J. Jiang, L. Wu, Z. Hu, R. Wu, X. Shen, H. Zhao, Knowledge enhanced graph contrastive learning for
match outcome prediction, Information Processing and Management 62 (2025) 104010. doi:https:
//doi.org/10.1016/j.ipm.2024.104010.
[26] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2019. URL: https://arxiv.org/abs/
1711.05101. arXiv:1711.05101.
[27] L. Van der Maaten, G. Hinton, Visualizing data using t-sne, JMLR 9 (2008).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ontañón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Buro</surname>
          </string-name>
          ,
          <article-title>Adversarial hierarchical task network planning for complex real-time games</article-title>
          ,
          <source>in: Proceedings of the 24th IJCAI</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1652</fpage>
          -
          <lpage>1658</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ontañón</surname>
          </string-name>
          ,
          <article-title>Combinatorial multi-armed bandits for real-time strategy games</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>58</volume>
          (
          <year>2017</year>
          )
          <fpage>665</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>