1. Introduction

AIQ x QIA

An Application of Reinforcement Learning for Minor Embedding in Quantum Annealing

Riccardo Nembrini

Maurizio Ferrari Dacrema

Paolo Cremonesi

0 0 Politecnico di Milano , Milano , Italy

2024

2 0000 0002

Research in the Quantum Computing (QC) field has been soaring thanks to the latest developments and wider availability of real hardware. The strong interest in this technology has naturally spurred a contamination with the Machine Learning (ML) field. Both quantum methods to perform ML and ML methods to support quantum computation has been developed. A largely difused QC paradigm is that of Quantum Annealers, machines that can rapidly search for solutions to optimization problems. Their sparse qubit structure, however, requires to search for a mapping between the problem's and the hardware's graphs before computation. This is a NP-hard combinatorial optimization task in itself, called Minor Embedding. In this work, we aim at developing and assessing the capabilities of Reinforcement Learning to perform this task.

eol>Quantum Computing Quantum Annealing Reinforcement Learning Proximal Policy Optimization

1. Introduction 2. Reinforcement Learning for Minor Embedding

In this section, we describe the components of our agent, called RLME, such as the environment and its state, the possible actions and the reward function. First of all, the objective of the agent is to perform ME mapping a problem graph G to a hardware graph H. The interaction loop between agent and environment involves mapping one node from G to a node in H at each step. The G node to map is chosen in a round-robin fashion, while the H node is chosen by the agent’s policy from the set of selectable nodes, comprised of qubits not yet assigned to a problem variable that are adjacent to qubits already mapped to the same variable (its chain, if present). Therefore, the environment’s state includes information about both G and H. An observation of the state , obtained by the agent, is a 1-dimensional array composed of contiguous sections representing diferent aspects of the state. In a section, each cell corresponds to a single node in G or H, with a predefined mapping consistent among all sections referring to the same graph. After selecting the round-robin G node, the observation’s sections are the following: • a one-hot encoding indicating which is the current round-robin G node, • one component for each existing qubit, with value 1 if the qubit is part of the current round-robin node’s chain, 0 otherwise, • one component for each existing qubit, with value 1 if the qubit is selectable (not yet mapped and adjacent to the chain of the round-robin node, if present), 0 otherwise, • for each G node, the number of connections with other G nodes that are missing in the mapping. In summary, given an intermediate state of the ME process, the agent would be aware of the next G node for which to map a new qubit and its current chain in the mapping, the selectable qubits and the number of missing connections between chains for the mapping to be valid.

The action performed by the agent is the choice of one of the selectable qubits. After the action, which is determined by a policy on the observation of the state, the agent receives a reward from the environment. Depending on the objectives of an agent, one could design diferent kinds of rewards. In this work, the focus is on obtaining the shortest possible chains, therefore the rewards corresponding to each action are fixed and negative. Thus, maximizing the cumulative reward would teach the agent how to build minor embeddings with fewer nodes. Agent training is performed using the Proximal Policy Optimization RL algorithm [ 11 ], which learns the policy with Deep Neural Networks. In order to rule out non-selectable qubits from the possible actions we also use Invalid Action Masking [ 12 ].

3. Experimental Protocol

Experiments with the RLME agent are performed in two diferent scenarios. In the first one, our goal is to understand how to build the environment and how the agent would scale in its training process with the sizes of G and H. Therefore, we train multiple agents, one for each couple of specific G and H graphs. All the considered G graphs are fully connected and vary in the number of nodes. H graphs, instead, vary in topology (Chimera and Zephyr [13], shown in Figure 1) and number of nodes. Each agent learns how to perform ME of a certain G graph with |G| nodes on a certain H graph, with a budget of 1 million training actions.

In the second scenario, instead, our goal is to understand if RLME is able to generalize on unseen data and if learning on smaller graphs first can help when scaling. Every agent is trained to perform ME of a synthetic dataset of varying G graphs, with diferent sizes and connectivity, on a specific H graph. The dataset is built by generating all the possible non-isomorphic graphs with sizes between 3 and 7 nodes, splitting into training and testing sets with respectively 80% and 20% of the graphs, trying to maintain a uniform distribution on the number of edges. Then, in order to have around 1000 graphs for each G size, we duplicate (if there are not enough graphs) or sample (if there are more than required) the corresponding graphs, again keeping a uniform distribution for the edges. During training, performed with a budget of 3 million actions, the agent sees the graphs ordered according to the size of G, so that it learns from simpler graphs first. When the dataset has been completely fed to the agent, it is shufled (maintaining the size ordering) and re-submitted to the agent.

RL agents are trained with Stable Baslines3 [14], with default hyperparameters. In both scenarios, each agent is trained 10 times with diferent random seeds and the testing results are averaged between all trained models. In the first scenario, we use each trained agent to generate 100 mappings for their respective fixed G-H couple. In the second, we use each agent to perform ME of all graphs in the testing set on their respective H graphs. We evaluate both scenarios based on how many of the generated mappings are valid and on the number of qubits required. For both scenarios, we compare RLME results with the general heuristic developed in [ 1 ]. To replicate our agents’ behavior, each time we use the heuristic for the ME process we generate 100 mappings.

4. Results and Discussion

Table 1 shows results for the first scenario described in Section 3, when training the agent on fully connected problem graphs and performing ME on the Zephyr topology (Figure 1c), compared with the heuristic. We performed experiments on G graphs with 3 to 7 nodes and H graphs from 160 to 2176 nodes (qubits). Notice that in all results we refer to H’s size as and the number of qubits can be computed as 16 * * (2 * + 1) for Zephyr and 8 * 2 for Chimera. Only a slice of the results is reported, for clarity’s sake, since other results show similar behaviors. Let’s remark that, for problems of this size, the heuristic can find optimal solutions, with the lowest possible number of qubits.

As it can be seen, with a lower H, the agent is able to precisely learn how to map nodes from the fully-connected problem, with a number of qubits comparable to that of the heuristic. With a larger H, the number of actions that can be chosen by the agent is higher, therefore training becomes harder, with the agent struggling to find solutions as compact as with a lower H . This behavior is also found when comparing RLME used on the Zephyr and Chimera topology. Indeed, while in Zephyr graphs each qubit is connnected to at most other 20 qubits, in Chimera graphs the maximum degree is 6. Because of the sparser topology, it is harder for the agent to navigate the H graph and find the needed connections. Figure 2 shows the comparison between RLME trained and used on Chimera and Zephyr topologies, when the number of qubits in H increases. When scaling, the number of required qubits in the mapping is drastically lower on Zephyr, while the majority of the agents on Chimera cannot find a valid mapping. This kind of challenge is not present in the heuristic, since it chooses new nodes to add to the mapping based on shortest-path distances, which are not influenced by the size of H, if not for algorithmic complexity.

Method

|G| = 3

SR% #Q 2 5 8

RL Heuristic

When training the agents on the second scenario with the dataset, instead, the size of H does not afect the results in the same way. Figure 3 shows the comparison between the number of qubits required by the heuristic when performing ME on the training set w.r.t. RLME. As it can be seen, even with = 8, the agent is able to obtain mappings with a number of qubits comparable to the heuristic (around 2 qubits more for |G| = 7). This suggests that the agent is learning better when seeing smaller graphs first, exploiting the experience made in navigating the H graph when mapping simpler G graphs.

5. Conclusions and Future Directions

In this work we develop a Reinforcement Learning agent capable of performing Minor Embedding, a key task when using a Quantum Annealer. We describe the components needed to train the agent and report the results obtained in two diferent scenarios. From these results we conclude that the RL agent is able to generate valid mappings in both scenarios, obtaining the best results when the training phase is performed with diferent graphs, starting from simpler ones. Future directions comprise designing and testing new reward functions and additional information to be fed to the agent, such as distances between nodes or qubit chains. An extension making use of Graph Neural Networks to extract better information directly from the graphs is already in the works.

Acknowledgments

We acknowledge the financial support from ICSC - “National Research Centre in High Performance Computing, Big Data and Quantum Computing”, funded by European Union – NextGenerationEU. We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support. Artificial Intelligence Research Society Conference, FLAIRS 2022, Hutchinson Island, Jensen Beach, Florida, USA, May 15-18, 2022, 2022. URL: https://doi.org/10.32473/flairs.v35i.130584. doi: 10. 32473/flairs.v35i.130584. [13] K. Boothby, A. King, J. Raymond, Zephyr topology of d-wave quantum processors, Tech. rep., DWave Systems Inc. (2021). URL: https://www.dwavesys.com/media/2uznec4s/14-1056a-a_zephyr_ topology_of_d-wave_quantum_processors.pdf. [14] A. Rafin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research 22 (2021) 1–8. URL: http://jmlr.org/papers/v22/20-1364.html.

[1]

Cai ,

W. G.

Macready ,

Roy , A practical heuristic for finding graph minors , CoRR abs/1406 .2741 ( 2014 ). URL: http://arxiv.org/abs/1406.2741. arXiv: 1406 . 2741 .

[2]

Boothby ,

A. D.

King ,

Roy , Fast clique minor generation in chimera qubit connectivity graphs , Quantum Inf. Process . 15 ( 2016 ) 495 - 508 . URL: https://doi.org/10.1007/s11128-015-1150-6. doi: 10 .1007/S11128-015-1150-6.

[3]

Sugie ,

Yoshida ,

Mertig ,

Takemoto ,

Teramoto ,

Nakamura , I. Takigawa,

Minato ,

Yamaoka , T. Komatsuzaki, Minor-embedding heuristics for large-scale annealing processors with sparse hardware graphs of up to 102, 400 nodes , Soft Comput. 25 ( 2021 ) 1731 - 1749 . URL: https://doi.org/10.1007/s00500-020-05502-6. doi: 10 .1007/S00500-020-05502-6.

[4]

Ferrari Dacrema ,

Moroni ,

Nembrini ,

Ferro , G. Faggioli,

Cremonesi , Towards feature selection for ranking and classification exploiting quantum annealers , in: E. Amigó,

Castells ,

Gonzalo ,

Carterette ,

J. S.

Culpepper , G. Kazai (Eds.), SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , Madrid, Spain, July 11 - 15 , 2022 , ACM, 2022 , pp. 2814 - 2824 . URL: https://doi.org/10.1145/3477495.3531755. doi: 10 .1145/ 3477495.3531755.

[5]

Bello ,

Pham ,

Q. V.

Le ,

Norouzi ,

Bengio , Neural combinatorial optimization with reinforcement learning , in: 5th International Conference on Learning Representations, ICLR 2017 , Toulon, France, April 24-26 , 2017 , Workshop Track Proceedings, OpenReview.net, 2017 . URL: https://openreview.net/forum?id= Bk9mxlSFx .

[6]

Mazyavkina ,

Sviridov ,

Ivanov , E. Burnaev, Reinforcement learning for combinatorial optimization: A survey , Comput. Oper. Res . 134 ( 2021 ) 105400 . URL: https://doi.org/10.1016/j.cor. 2021 . 105400 . doi: 10 .1016/J.COR. 2021 . 105400 .

[7]

Berto ,

Hua ,

Park ,

Kim ,

Son ,

Kim ,

Kim , J. Park, RL4CO: an extensive reinforcement learning for combinatorial optimization benchmark , CoRR abs/2306 .17100 ( 2023 ). URL: https://doi.org/10.48550/arXiv.2306.17100. doi: 10 .48550/ARXIV. 2306.17100. arXiv: 2306 . 17100 .

[8]

Moro , M. G. A. Paris, M. Restelli, E. Prati, Quantum compiling by deep reinforcement learning , Communications Physics 4 ( 2021 ). URL: http://dx.doi.org/10.1038/s42005-021-00684-3. doi: 10 . 1038/s42005-021-00684-3.

[9]

Z. T.

Wang ,

Chen ,

Du ,

Z. H.

Yang ,

Cai ,

Huang ,

Zhang ,

Xu ,

Du ,

Li ,

Jiao ,

Wu ,

Liu ,

Lu ,

Xu ,

Jin ,

Wang ,

Yu ,

S. P.

Zhao , Quantum compiling with reinforcement learning on a superconducting processor , CoRR abs/2406 .12195 ( 2024 ). URL: https: //doi.org/10.48550/arXiv.2406.12195. doi: 10 .48550/ARXIV.2406.12195. arXiv: 2406 . 12195 .

[10]

Foderà , G. Turati,

Nembrini ,

M. F.

Dacrema ,

Cremonesi , Reinforcement learning for variational quantum circuit design , in: Proceedings of the International Workshop on AI for Quantum and Quantum for AI (AIQxQIA 2024 ) co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2024 ), CEUR Workshop Proceedings, CEUR-WS.org, 2024 .

[11]

Schulman ,

Wolski ,

Dhariwal ,

Radford ,

Klimov , Proximal policy optimization algorithms , CoRR abs/1707 .06347 ( 2017 ). URL: http://arxiv.org/abs/1707.06347. arXiv: 1707 . 06347 .

[12]

Huang ,

Ontañón , A closer look at invalid action masking in policy gradient algorithms , in: R. Barták , F. Keshtkar , M. Franklin (Eds.), Proceedings of the Thirty-Fifth International Florida