<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>AIQ x QIA</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An Application of Reinforcement Learning for Minor Embedding in Quantum Annealing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riccardo Nembrini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurizio Ferrari Dacrema</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Cremonesi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano</institution>
          ,
          <addr-line>Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Research in the Quantum Computing (QC) field has been soaring thanks to the latest developments and wider availability of real hardware. The strong interest in this technology has naturally spurred a contamination with the Machine Learning (ML) field. Both quantum methods to perform ML and ML methods to support quantum computation has been developed. A largely difused QC paradigm is that of Quantum Annealers, machines that can rapidly search for solutions to optimization problems. Their sparse qubit structure, however, requires to search for a mapping between the problem's and the hardware's graphs before computation. This is a NP-hard combinatorial optimization task in itself, called Minor Embedding. In this work, we aim at developing and assessing the capabilities of Reinforcement Learning to perform this task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Quantum Computing</kwd>
        <kwd>Quantum Annealing</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Proximal Policy Optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Reinforcement Learning for Minor Embedding</title>
      <p>In this section, we describe the components of our agent, called RLME, such as the environment and
its state, the possible actions and the reward function. First of all, the objective of the agent is to
perform ME mapping a problem graph G to a hardware graph H. The interaction loop between agent
and environment involves mapping one node from G to a node in H at each step. The G node to map
is chosen in a round-robin fashion, while the H node is chosen by the agent’s policy from the set of
selectable nodes, comprised of qubits not yet assigned to a problem variable that are adjacent to qubits
already mapped to the same variable (its chain, if present). Therefore, the environment’s state includes
information about both G and H. An observation of the state , obtained by the agent, is a 1-dimensional
array composed of contiguous sections representing diferent aspects of the state. In a section, each
cell corresponds to a single node in G or H, with a predefined mapping consistent among all sections
referring to the same graph. After selecting the round-robin G node, the observation’s sections are the
following:
• a one-hot encoding indicating which is the current round-robin G node,
• one component for each existing qubit, with value 1 if the qubit is part of the current round-robin
node’s chain, 0 otherwise,
• one component for each existing qubit, with value 1 if the qubit is selectable (not yet mapped and
adjacent to the chain of the round-robin node, if present), 0 otherwise,
• for each G node, the number of connections with other G nodes that are missing in the mapping.
In summary, given an intermediate state of the ME process, the agent would be aware of the next G
node for which to map a new qubit and its current chain in the mapping, the selectable qubits and the
number of missing connections between chains for the mapping to be valid.</p>
      <p>
        The action performed by the agent is the choice of one of the selectable qubits. After the action,
which is determined by a policy on the observation of the state, the agent receives a reward from the
environment. Depending on the objectives of an agent, one could design diferent kinds of rewards. In
this work, the focus is on obtaining the shortest possible chains, therefore the rewards corresponding
to each action are fixed and negative. Thus, maximizing the cumulative reward would teach the agent
how to build minor embeddings with fewer nodes. Agent training is performed using the Proximal
Policy Optimization RL algorithm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which learns the policy with Deep Neural Networks. In order to
rule out non-selectable qubits from the possible actions we also use Invalid Action Masking [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Protocol</title>
      <p>Experiments with the RLME agent are performed in two diferent scenarios. In the first one, our goal
is to understand how to build the environment and how the agent would scale in its training process
with the sizes of G and H. Therefore, we train multiple agents, one for each couple of specific G and H
graphs. All the considered G graphs are fully connected and vary in the number of nodes. H graphs,
instead, vary in topology (Chimera and Zephyr [13], shown in Figure 1) and number of nodes. Each
agent learns how to perform ME of a certain G graph with |G| nodes on a certain H graph, with a budget
of 1 million training actions.</p>
      <p>In the second scenario, instead, our goal is to understand if RLME is able to generalize on unseen data
and if learning on smaller graphs first can help when scaling. Every agent is trained to perform ME of a
synthetic dataset of varying G graphs, with diferent sizes and connectivity, on a specific H graph. The
dataset is built by generating all the possible non-isomorphic graphs with sizes between 3 and 7 nodes,
splitting into training and testing sets with respectively 80% and 20% of the graphs, trying to maintain
a uniform distribution on the number of edges. Then, in order to have around 1000 graphs for each
G size, we duplicate (if there are not enough graphs) or sample (if there are more than required) the
corresponding graphs, again keeping a uniform distribution for the edges. During training, performed
with a budget of 3 million actions, the agent sees the graphs ordered according to the size of G, so that
it learns from simpler graphs first. When the dataset has been completely fed to the agent, it is shufled
(maintaining the size ordering) and re-submitted to the agent.</p>
      <p>
        RL agents are trained with Stable Baslines3 [14], with default hyperparameters. In both scenarios,
each agent is trained 10 times with diferent random seeds and the testing results are averaged between
all trained models. In the first scenario, we use each trained agent to generate 100 mappings for their
respective fixed G-H couple. In the second, we use each agent to perform ME of all graphs in the testing
set on their respective H graphs. We evaluate both scenarios based on how many of the generated
mappings are valid and on the number of qubits required. For both scenarios, we compare RLME results
with the general heuristic developed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To replicate our agents’ behavior, each time we use the
heuristic for the ME process we generate 100 mappings.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>Table 1 shows results for the first scenario described in Section 3, when training the agent on fully
connected problem graphs and performing ME on the Zephyr topology (Figure 1c), compared with the
heuristic. We performed experiments on G graphs with 3 to 7 nodes and H graphs from 160 to 2176
nodes (qubits). Notice that in all results we refer to H’s size as  and the number of qubits can be
computed as 16 *  * (2 *  + 1) for Zephyr and 8 * 2 for Chimera. Only a slice of the
results is reported, for clarity’s sake, since other results show similar behaviors. Let’s remark that, for
problems of this size, the heuristic can find optimal solutions, with the lowest possible number of qubits.</p>
      <p>As it can be seen, with a lower H, the agent is able to precisely learn how to map nodes from the
fully-connected problem, with a number of qubits comparable to that of the heuristic. With a larger
H, the number of actions that can be chosen by the agent is higher, therefore training becomes
harder, with the agent struggling to find solutions as compact as with a lower H . This behavior is
also found when comparing RLME used on the Zephyr and Chimera topology. Indeed, while in Zephyr
graphs each qubit is connnected to at most other 20 qubits, in Chimera graphs the maximum degree is 6.
Because of the sparser topology, it is harder for the agent to navigate the H graph and find the needed
connections. Figure 2 shows the comparison between RLME trained and used on Chimera and Zephyr
topologies, when the number of qubits in H increases. When scaling, the number of required qubits in
the mapping is drastically lower on Zephyr, while the majority of the agents on Chimera cannot find a
valid mapping. This kind of challenge is not present in the heuristic, since it chooses new nodes to add
to the mapping based on shortest-path distances, which are not influenced by the size of H, if not for
algorithmic complexity.</p>
      <p>Method</p>
      <p>|G| = 3</p>
      <p>SR% #Q
2
5
8</p>
      <p>RL
Heuristic</p>
      <p>RL
Heuristic</p>
      <p>RL
Heuristic</p>
      <p>When training the agents on the second scenario with the dataset, instead, the size of H does not afect
the results in the same way. Figure 3 shows the comparison between the number of qubits required
by the heuristic when performing ME on the training set w.r.t. RLME. As it can be seen, even with
 = 8, the agent is able to obtain mappings with a number of qubits comparable to the heuristic
(around 2 qubits more for |G| = 7). This suggests that the agent is learning better when seeing smaller
graphs first, exploiting the experience made in navigating the H graph when mapping simpler G graphs.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Directions</title>
      <p>In this work we develop a Reinforcement Learning agent capable of performing Minor Embedding, a
key task when using a Quantum Annealer. We describe the components needed to train the agent and
report the results obtained in two diferent scenarios. From these results we conclude that the RL agent
is able to generate valid mappings in both scenarios, obtaining the best results when the training phase
is performed with diferent graphs, starting from simpler ones. Future directions comprise designing
and testing new reward functions and additional information to be fed to the agent, such as distances
between nodes or qubit chains. An extension making use of Graph Neural Networks to extract better
information directly from the graphs is already in the works.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We acknowledge the financial support from ICSC - “National Research Centre in High Performance
Computing, Big Data and Quantum Computing”, funded by European Union – NextGenerationEU. We
acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance
computing resources and support.
Artificial Intelligence Research Society Conference, FLAIRS 2022, Hutchinson Island, Jensen Beach,
Florida, USA, May 15-18, 2022, 2022. URL: https://doi.org/10.32473/flairs.v35i.130584. doi: 10.
32473/flairs.v35i.130584.
[13] K. Boothby, A. King, J. Raymond, Zephyr topology of d-wave quantum processors, Tech. rep.,
DWave Systems Inc. (2021). URL: https://www.dwavesys.com/media/2uznec4s/14-1056a-a_zephyr_
topology_of_d-wave_quantum_processors.pdf.
[14] A. Rafin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable
reinforcement learning implementations, Journal of Machine Learning Research 22 (2021) 1–8.
URL: http://jmlr.org/papers/v22/20-1364.html.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. G.</given-names>
            <surname>Macready</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <article-title>A practical heuristic for finding graph minors</article-title>
          ,
          <source>CoRR abs/1406</source>
          .2741 (
          <year>2014</year>
          ). URL: http://arxiv.org/abs/1406.2741. arXiv:
          <volume>1406</volume>
          .
          <fpage>2741</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Boothby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <article-title>Fast clique minor generation in chimera qubit connectivity graphs</article-title>
          ,
          <source>Quantum Inf. Process</source>
          .
          <volume>15</volume>
          (
          <year>2016</year>
          )
          <fpage>495</fpage>
          -
          <lpage>508</lpage>
          . URL: https://doi.org/10.1007/s11128-015-1150-6. doi:
          <volume>10</volume>
          .1007/S11128-015-1150-6.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sugie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoshida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mertig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Takemoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Teramoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          , I. Takigawa,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yamaoka</surname>
          </string-name>
          , T. Komatsuzaki,
          <article-title>Minor-embedding heuristics for large-scale annealing processors with sparse hardware graphs of up to 102, 400 nodes</article-title>
          , Soft Comput.
          <volume>25</volume>
          (
          <year>2021</year>
          )
          <fpage>1731</fpage>
          -
          <lpage>1749</lpage>
          . URL: https://doi.org/10.1007/s00500-020-05502-6. doi:
          <volume>10</volume>
          .1007/S00500-020-05502-6.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferrari Dacrema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nembrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , G. Faggioli,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <article-title>Towards feature selection for ranking and classification exploiting quantum annealers</article-title>
          , in: E. Amigó,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          , G. Kazai (Eds.),
          <source>SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Madrid, Spain,
          <source>July 11 - 15</source>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>2814</fpage>
          -
          <lpage>2824</lpage>
          . URL: https://doi.org/10.1145/3477495.3531755. doi:
          <volume>10</volume>
          .1145/ 3477495.3531755.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Bello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Neural combinatorial optimization with reinforcement learning</article-title>
          ,
          <source>in: 5th International Conference on Learning Representations, ICLR</source>
          <year>2017</year>
          , Toulon, France,
          <source>April 24-26</source>
          ,
          <year>2017</year>
          , Workshop Track Proceedings, OpenReview.net,
          <year>2017</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Bk9mxlSFx</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mazyavkina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sviridov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          , E. Burnaev,
          <article-title>Reinforcement learning for combinatorial optimization: A survey</article-title>
          ,
          <source>Comput. Oper. Res</source>
          .
          <volume>134</volume>
          (
          <year>2021</year>
          )
          <article-title>105400</article-title>
          . URL: https://doi.org/10.1016/j.cor.
          <year>2021</year>
          .
          <volume>105400</volume>
          . doi:
          <volume>10</volume>
          .1016/J.COR.
          <year>2021</year>
          .
          <volume>105400</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Berto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Son</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , J. Park,
          <article-title>RL4CO: an extensive reinforcement learning for combinatorial optimization benchmark</article-title>
          ,
          <source>CoRR abs/2306</source>
          .17100 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2306.17100. doi:
          <volume>10</volume>
          .48550/ARXIV. 2306.17100. arXiv:
          <volume>2306</volume>
          .
          <fpage>17100</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moro</surname>
          </string-name>
          , M. G. A. Paris, M. Restelli, E. Prati,
          <article-title>Quantum compiling by deep reinforcement learning</article-title>
          ,
          <source>Communications Physics 4</source>
          (
          <year>2021</year>
          ). URL: http://dx.doi.org/10.1038/s42005-021-00684-3. doi:
          <volume>10</volume>
          . 1038/s42005-021-00684-3.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z. T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Quantum compiling with reinforcement learning on a superconducting processor</article-title>
          ,
          <source>CoRR abs/2406</source>
          .12195 (
          <year>2024</year>
          ). URL: https: //doi.org/10.48550/arXiv.2406.12195. doi:
          <volume>10</volume>
          .48550/ARXIV.2406.12195. arXiv:
          <volume>2406</volume>
          .
          <fpage>12195</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Foderà</surname>
          </string-name>
          , G. Turati,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nembrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Dacrema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for variational quantum circuit design</article-title>
          ,
          <source>in: Proceedings of the International Workshop on AI for Quantum and Quantum for AI</source>
          (AIQxQIA
          <year>2024</year>
          )
          <article-title>co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence</article-title>
          (AIxIA
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          ,
          <source>CoRR abs/1707</source>
          .06347 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1707.06347. arXiv:
          <volume>1707</volume>
          .
          <fpage>06347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ontañón</surname>
          </string-name>
          ,
          <article-title>A closer look at invalid action masking in policy gradient algorithms</article-title>
          , in: R.
          <string-name>
            <surname>Barták</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Keshtkar</surname>
          </string-name>
          , M. Franklin (Eds.),
          <source>Proceedings of the Thirty-Fifth International Florida</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>