1. Introduction

IPGPT: Solving Integer Programming Problems with Sequence to Contrastive Multi-Label Learning

Shufeng Kong

0 2

Caihua Liu

Carla Gomes

0 0 Cornell University , Ithaca , U.S.A 1 Guilin University of Electronic Technology , Guilin , China 2 Sun Yat-sen University , Zhuhai Campus , China

2000

Integer Programming (IP) is an essential class of combinatorial optimization problems (COPs). Its inherent NP-hardness has fostered considerable eforts towards the development of heuristic strategies. An emerging approach involves leveraging data-driven methods to automatically learn these heuristics. For example, using deep (reinforcement) learning to recurrently reoptimize an initial solution with Large Neighborhood Search (LNS) has demonstrated exceptional performance across numerous applications. A pivotal challenge within LNS lies in identifying an optimal subset of variables for reoptimization at each stage. Existing methods typically learn a policy to select a subset, either by maintaining a fixed cardinality or by decomposing the subset into independent binary decisions for each variable. However, such strategies overlook the modeling of LNS's sequential processes and fail to explore the correlations inherent in variable selection. To overcome these shortcomings, we introduce IPGPT, an innovative model that reimagines policy learning as a sequence-to-multi-label classification (MLC) problem. Our approach uniquely integrates a tailored Decision Transformer encoder, incorporating a causal transformer (GPT2) to capture the sequential nature of LNS. Additionally, we employ an MLC decoder with contrastive learning to exploit the correlations in variable selection. Our extensive experiments confirm that IPGPT significantly surpasses the performance of current state-of-the-art baselines and exhibits excellent generalization to larger instances.

eol>Integer Programming contrastive learning causal transformer Multi-Label Learning

1. Introduction

mercial IP solvers such as Gurobi or SCIP as a generic black-box subroutine and thus benefits from the cuttingIP has found applications in production planning [1], edge technologies of such commercial IP solvers. scheduling [2], scientific discovery [ 3], and telecommu- In this paper, we focus on boosting the performance of nications networks [4], among many others. It is well- LNS, though our method can also be applied to boost the known that IP is NP-complete [5] and many eforts have performance of other local search algorithms such as LB. been devoted to designing efective heuristics to find A key challenge of LNS is to select a promising variable near-optimal solutions [6]. Historically, such algorithms subset to reoptimize based on the current solution. Since were designed largely manually, requiring a careful un- the selection choice is combinatorial, finding an optimal derstanding of the underlying structure within specific subset is also computationally hard. Song et al. [9] learn classes of optimization problems. to select fixed, predefined variable subsets with imitation

Due to the recent success of deep learning (DL) and learning and RL. Wu et al. [8] learn to select arbitrary reinforcement learning (RL), there has been an increas- variable subsets with RL by factorizing the selection of a ing interest in automatically learning heuristics for COPs variable subset into elementary selections on each varifrom training data [7]. Existing works often leverage able separately. Similarly, Sonnerat et al. [10] learn to machine learning (ML) to output solutions directly from predict the probability of selecting a variable indepeninput instances, configure hyperparameters of COP algo- dently of other variables using imitation learning and rithms, or learn a local decision policy for search frame- Nair et al. [11] use RL to learn a policy that selects one works such as branch&bound (B&B), local branching variable at a time. Nevertheless, all of these works miss (LB), or LNS. Among them, we are particularly interested modeling the sequential processes of LNS and also do not in learning to iteratively reoptimize an initial solution exploit correlations of variable selection 1. To address with LNS [8, 9, 10, 11]. These approaches are attractive these limitations, we propose to model the policy learnbecause we can leverage existing state-of-the-art com- ing as a sequence to a multi-label classification problem, which jointly models the selection of variables as well as STRL’23: Second International Workshop on Spatio-Temporal Reason- the sequential processes of LNS. i$ngsakn2d29L9e@arcnoinrnge,lAl.uedguus(tS2.0K2o3,nMg)a;ccaaoih,Su.aA.l.iRu,@Cghuineat.edu.cn (C. Liu); Our contributions are threefold: (1) we give a new angomes@cs.cornell.edu (C. Gomes)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 1We have reserved an in-depth discussion of the related work for CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) the appendix due to space constraints. gle of sequence to multi-label classification for learning an efective local decision policy for LNS; (2) we materialize this idea by providing a novel model to seamlessly integrate a customized decision transformer encoder to model the sequential processes of LNS and an MLC decoder with contrastive learning to exploit correlations of variable selection; (3) we conduct extensive experiments on various benchmarks and the results show that our model significantly outperforms state-of-the-art baselines. 2. Preliminaries 2.1. Integer Program called an action of an agent that is executed in step ; • (, ) is the transition function to return the next state. Let be the solution with state , a smaller sub-IP is first generated by keeping the values of non-selected variables in and reoptimizing the remainder, and then the next state +1 is obtained by updating with the new solution to the sub-IP: +1 = arg min{ | ≤ ; ≥ 0; ∈ Z; = , ∀ ̸∈ }; • (, ) is the reward function to return the change of objective values, which is defined as = (, ) = ( − +1). Let be the step limit, the cumulative rewards from step of an episode is defined as = ∑︀

= − with a discount factor ∈ [0, 1].

An integer program (IP) is a problem of optimizing a linear function over points in a polyhedral set: arg min{ | ≤ ; ≥ 0; ∈ Z}, where ∈ Z is a vector of decision variables; ∈ R denotes the vector of objective coeficients; the incidence matrix ∈ R× and vector ∈ R together define linear constraints.

A policy is a (potentially probabilistic) mapping :

→ . The goal of RL-based algorithms for solving IPs is to find a policy function to maximize the expected cumulative reward E[1] over all episodes, i.e., the expected improvement over initial solutions. However, existing RL-based algorithms for IP solving train a policy by either temporal diference (TD) learning [ 16], policy gradient 2.2. LNS and Its Markov Decision Process [17], or behavior cloning [18], all of which miss modeling

Formulation sequential processes of LNS explicitly. Furthermore, RLGiven an initial assignment of values to the decision based algorithms may sufer from various issues, such as variables in an IP instance, LNS iteratively refines this the need for bootstrapping to propagate returns in TDassignment by selecting a subset of decision variables, learning can cause stability problems, the discounting relaxing their values, and solving a subproblem that aims future rewards can induce undesirable short-sighted beto optimize the objective function while respecting the haviors, policy gradient is known to be sample ineficient, instance’s constraints. LNS aims to explore a complex and behavior cloning can sufer from cascading errors solution neighborhood and gradually improve its cur- [19, 20, 21]. To circumvent these disadvantages, we prorent solution until a certain termination condition is pose to learn a policy with decision transformers, which met [12]. A key challenge of LNS is how to define a seeks to benefit from modeling sequential processes of good solution neighborhood, namely, one needs to de- LNS and better generalization. cide which variable subset to reoptimize given the current solution. Obviously, such a decision problem is combina- 2.3. Decision Transformer torial, and many works devote to constructing efective heuristics for it [13, 14, 15]. In this work, we are particularly interested in the recent trend of learning-based approaches, where data-driven methods are applied to learn the heuristics automatically [9, 8, 10]. To this end, the LNS framework can be formulated as a Markov Decision Process (MDP) (, , , ): Decision transformer (DT) [21] abstracts the decisionmaking process in RL as a sequence modeling problem and attempts to learn a return-conditioned stateaction mapping. The return-conditionality means that given a history of return-state-action tokens, such that the first token represents the desired return at the current state, the DT predicts the action required to • is a set of states. A state ∈ describes achieve this desired return. In this paper, we folthe current status of the LNS process in step , low the convention of the original DT and define rewhich normally includes the static IP instance turn, , as the non-discounted rewards-to-go: = information (e.g., variables, constraints, and ob- ∑︀ . DT takes as input a sequence of three-tokens: jectives) and the dynamic solving statistics (e.g., (⟨− , − , − ⟩, · · · , ⟨, , ⟩), where ≤ the incumbent solution); is the context length. Each token is then encoded into • is a set of all candidate variable subsets for an embedding and added by a positional encoding. Let reoptimization. A variable subset ∈ is also (⟨− , − , − ⟩, · · · , ⟨ , , ⟩) be the corresponding sequence of embeddings, and it is fed into a causal transformer to produce another sequence of em- 3.1. A Novel Model IPGPT for Solving IPs beddings (⟨ℎ− , ℎ− , ℎ− ⟩, · · · , ⟨ℎ , ℎ , ℎ ⟩).

A decoder takes as input ℎ and outputs ˆ. During training, a suitable loss function is applied to penalize the diference between the prediction ˆ and label .

During inference, after specifying a target return based on desired performance and the environment starting state, DT generates actions autoregressively. The actions are executed and the target return is subtracted by the achieved rewards to obtain the next states. The process of generating actions and applying them to obtain the next return-to-go and state is repeated until episode termination.

3. Solving IPs with Sequence to Contrastive Multi-Label Classification In this section, we propose a novel model IPGPT for the problem given in equation (2) based on the causal transformer and contrastive learning. Instead of learning to select fixed, predefined variable

subsets [9], Wu et al. [8] factorizes the combinatorial ac- Figure 1: The architecture of our model IPGPT: it consists of tion space into elementary actions on each dimension several token encoders to produce latent token embeddings, (i.e. variables), where ∈ {1, 0} denotes the elemen- a causal transformer to capture dependence between token tary action of whether selecting for reoptimization embeddings, and a contrastive MLC decoder to exploit correin step , and is 1 if is selected and 0 otherwise. lations between label categories. Here we use a small IP with Therefore, any action can be expressed as = ∪=1 4 variables and 3 constraints to show the full pipeline of our and the action selection problem can be converted into model. The problem is first translated into a factor graph , separated binary classification problems. The policy for and is associated with dynamic factor-node features that action selection is factorized by describe the states of MDP and are encoded into state embeddings in diferent steps. Similarly, returns and actions are (|) = ∏︁ (|), (1) eanlscooednecroadneddtiwntoosliamtepnlet eMmLbPesdadsinrgetsu.rWneaunsdeaactGioCnNenacsosdtaetres =1 respectively. Each token embedding is further added with its relative positional encoding. The sequence of embeddings is fed into the causal transformer to produce another sequence of embeddings. Finally, the contrastive MLC decoder takes as inputs the state embeddings and outputs action predictions. which expresses the probability of selecting an action as the product of probabilities of selecting its elements.

However, such an action space factorization limits the class of policies that can be learned and it also fails to explore the correlations between elementary actions. To address these limitations, we propose to model the policy learning as a sequence to multi-label classification problem, which jointly models the selection of multiple elementary actions as well as the sequential processes of LNS, i.e., (|) = (|ℎ((, ))),

(2) where (, ) denotes the function to return the last sequence of return-state-action tokens from steps − to , i.e., (⟨− , − , − ⟩, · · · , ⟨, , ·⟩ ); ℎ(· ) denotes the sequence encoder parameterized by (NN); and the MLC decoder parameterized by (NN) takes as input state embedding ℎ produced by ℎ and outputs action distribution (|ℎ ). Efective implementations of the sequence encoder and MLC decoder are crucial to this work.

3.1.1. Factor Graph Representation

An IP instance can be represented by a factor graph [22] which is a bipartitle grah = (, , ℰ ) consistring of variable-nodes = {1, · · · , } and factor-nodes = {1, · · · , }. Variable nodes correspond to the variables and factor nodes correspond to the constraints in the IP. An edge ∈ ℰ between and is established only if the -th constraint contains the -th variable. The variable nodes are associated with a feature matrix ∈ Z× , where is the number of features for each variable node. The features of each variable-node include two parts: (1) static features: a one-hot vector indicates the node type and the objective coeficient of ; (2) dynamic features: the current solution of in step and the incumbent solution of . Note that the dynamic features are used to describe the states of MDP in diferent steps. The factor nodes are also associated with a feature matrix ∈ Z× , where is the number of features for each factor node. The features of each factor-node only include static features: a onehot vector indicates the node type and the value at the right-hand-side (RHS) of the -th constraint. Finally, the weight matrix of edges is exactly the incidence matrix.

3.1.2. Model Architecture

, LN (+1)())︁ ︁( Firstly, a GCN takes as input an IP instance and learns embeddings for label categories respectively, and labels within the same category share the same embedding. Secondly, we use the state embedding in step as an input feature whose inner products with label embeddings are used to produce prediction ^. Lastly, a contrastive loss is designed to pull together the feature embedding and positive label embeddings, while separating the feature embedding from the negative label embeddings. Note that a label embedding is positive only if its label is 1. An example with label [1, 0, 0, 1] is shown.

In this work, we adopt the causal transformer GPT2 [25] to learn and reason about sequences and we defer the other architecture details to the original paper.

Contrastive MLC Decoder: Recall that an elemen, (3) tary action ∈ {0, 1} (a.k.a. a label) denotes whether or not to select variable for reoptimization in step .

A label vector = ∪=1 ∈ {0, 1} denotes the se

lected action given state . Diferent from eq. (1) which approximates (|) with separated binary classiwhere (), () ces in the -th layer; () ∈ Rℎ× ℎ are trainable weight matri∈ R× ℎ and ()

∈ R× ℎ are embeddings for variable-nodes and factor-nodes re- fication problems, we propose to approximate (|) spectively in the -th layer; LN and (· ) denote layer normalization and Tanh activation function respectively. to , where ℎ is a state embedding generated by the with an MLC decoder that finds a mapping from ℎ =1 ℎ = ∑︁ , =

exp( · ) The initial embeddings (0) and (0) are linear projections of the raw feature matrices and respectively.

In this paper, all MLP encoders only have two layers, and

the embeddings’ dimensions ℎ are set to be 128; the

GCN encoder consists of two convolution layers and a mean pooling layer. Causal Transformer: Causal transformer [24] is an

architecture to eficiently model sequences that consist ℎ is given by In our model, each layer receives a sequence of = 3 token embeddings {}=1, and outputs embeddings {ℎ}=1, preserving the input dimensions. Specifically, each token embedding is mapped to a key , a query , and a value via linear functions, and the output causal transformer and served as an input feature for our MLC decoder. A key aspect of learning a policy with an MLC module is that we can exploit the correlations between elementary actions, which is missing in those existing MLboosted IP solvers [8, 9, 10].

We propose to exploit category-level label correlation

with contrastive learning based on the MLC model GM

VAE [26]. GMVAE assumes that the number of labels

ples. This is not applicable to our case since diferent instances may have a diferent number of decision variables, i.e., actions from diferent instances may have different cardinalities. Alternatively, we learn categorylevel label embeddings for each IP instance with a shared GCN, and the embeddings are only shared across samples within each instance. Fig. 2 gives the architecture of our MLC decoder. We denote the -th category of of stacked self-attention layers with residual connections. is fixed and label embeddings are shared across all sam∑︀ ′=1 exp( · ′ ) . (4) labels { }=− collected from steps [ − idea is as follows: (1) we learn an embedding for , ]. Our each label category such that labels within the same where is a trade-of weight. The model is trained category share the same embedding. Since the number with Adam [27] and optimized with ℒ. We sample miniof label categories is exactly the number of variables batches of sequence length from the dataset and in an IP instance, we use the GCN described in equa- the model is trained with GPUs in parallel. However, tions (3) to take as input an IP instance and output node similar to DT, our model will generate action predictions embeddings, and we use the variable-node embeddings autoregressively during testing. We refer the reader to as label category embeddings respectively; (2) we use the section 2.3 for more details. state embedding ℎ of each LNS step as an input feature whose inner products with label embeddings correspond to feature-label similarity and can be used for prediction; 4. Experiments (3) we use contrastive learning to capture correlations between label categories by pulling similar categories’ We conduct experiments on four diverse NP-hard COP embeddings together. Sepcifically, let ≡ { 1, · · · , } bmeanxcihmmumarkcsu,t i(nMcClu)d,sinetgcmovienriimngu m(SCv)e,ratnexd ccoomvebrin(aMtoVrCia)l, and we define the positive label set of as () ≡ auction (CA). We follow the experimental settings of [9] { ∈ | = 1}. Given a sequence of return-state-action and [8]. tokens (⟨− , − , − ⟩, · · · , ⟨, , ⟩), our decoder is designed to optimize the following contrastive loss function: 4.1. Datasets and Experimental Setup

Datasets. MVC and MC are graph optimization prob

ℒ = 1 =∑−︁ | (1 )| ∈∑ (︁ ) − log ∑︀′∈ℎ ·ℎ·′ ,wlEerimdthőss;1-S0RC0é0naynniod(dECeRsA)amanrodedegedelgn[e2e8rpa]rltooIbPagsbe.niFleiotryratM0e.1Vr5aC.n,FdwoormeMugsCrea,ptwhhees (5) use the Barabasi-Albert (BA) model [29] to generate ranwhere state embeddings ℎ for ∈ [ − , ] are gener- dom graphs with 500 nodes and an average degree of 4. ated by the causal transformer and are computed as in For SC, we generate instances with matrices having 5000 eq. (4). rows and 1000 columns following the procedure in [30],

For example, if in most of the actions , labels where each entry ∈ {0, 1} represents whether the and often appear together (i.e., they both equal 1), -th element in the universe belongs to the -th set. For contrastive learning will implicitly pull their embeddings CA, we use the Combinatorial Auction Test Suit (CATS) together. In other words, if two labels do co-appear often, [31] with arbitrary relationships to generate instances their label embeddings would become similar. On the with 2000 items and 4000 bids. For each problem type, we other hand, if they never co-occur or only co-appear generate 100, 20, and 50 instances for training, validation, occasionally, their connections are not significant and and testing, respectively. For each training instance, we our decoder will not optimize for their similarity. use LNS with a random policy to run on it for 100 steps and set a time limit of 2s for solving the sub-IP in each 3.2. Training Algorithm step with Gurobi. We run it 30 times and collect the trajectory with the best objective values for training IPGPT, Our model will be trained with supervised learning. and we randomly sample 20 sequences of return-stateGiven a set of training IP instances, we first collect a action tokens with length = 25 from each trajectory. dataset of sequences of return-state-action tokens that Therefore, our dataset for training IPGPT includes 2000 are generated by the MDP with some expert policy: = sequences of three tokens. {(⟨− , − , − ⟩ , · · · , ⟨, , ⟩ )}=1, where Initialization. LNS starts with a feasible initial solu ≤ is the length of each sequence. For each sequence tion. For MVC, MAXCUT, SC, and CATS, we initialize a ∈ , our model will take as input and generate a feasible solution by including all vertices in the cover set, set of action predictions {ˆ }=− , and we will also randomly partitioning all vertices into two complemencollect the set of labels from , { }=− . A supervised tary sets, including all sets in the set cover, and accepting cross-entropy loss for each sequence is given by no bids, respectively. The initialization process does not incur additional computational costs in our experiments. ℒ = 1 =∑−︁ 1 ∑=︁1 log ˆ +(1− ) log(1− ˆ(6)). a2cltaIimyoenpr,slaeanmnddepn1o2ts8aitthiiooidnndeeannnconddeHuerrysopnaser.erTpahallersasimmtapteelteeerMnsc.LoPRdseertwuiisrtnha, The final objective function to minimize is given by GCN with 2 convolution layers and a mean pooling layer.

We use the GPT2 as our casual transformer. Dimensions (7) of all hidden embeddings are set to 128. We set the batch ℒ = ℒ − ℒ , Gurobi FT-LNS RL-LNS LB-SRMRL

IPGPT Methods Gurobi FT-LNS RL-LNS LB-SRMRL

IPGPT

CA-2000

Obj.± Std.% − 111668 ± 2.0 − 110041 ± 1.6 − 111787 ± 2.6 − 110741 ± 3.1 − 114535 ± 1.8

MVC-2000 Obj.± Std.% 392.5 ± 1.3 390.5 ± 1.1 375.8 ± 2.1 395.2 ± 1.9 365.1 ± 1.1 size and the number of training epochs to 128 and 100, tive of solutions returned by diferent algorithms within respectively, for all experiments. Our model was imple- a time limit; (2) the gap between solutions, namely, the mented with the Pytorch deep learning framework and ratio of objective diference w.r.t. the best result. the whole model was trained using the Adam optimizer [27] with a learning rate of 0.0001 and a weight decay 4.2. Experimental Results ratio of 0.01 in an end-to-end fashion. All experiments were carried out on a machine with a 4.2 GHz quad-core A comparison of IPGPT and other state-of-the-art baseIntel i7 CPU, 16 GB RAM, and an Nvidia RTX 3090 24GB lines on 4 diverse benchmarks is given in Table 6. All GPU card. Further implementation details can be found learning-based algorithms, including our IPGPT, call in the appendix. Gurobi to solve sub-IPs with a time limit of 2s at every

Baselines. We compare our method with four base- step. We can observe that LB-SRMRL is not comparable lines: (1) Gurobi (version 9.5) with default settings: a to other algorithms. RL-LNS remains the most competileading state-of-the-art IP solver; (2) FT-LNS: the best- tive baseline and consistently outperforms both Gurobi performing LNS version by [9], which applies imitation and FT-LNS. Our IPGPT consistently outperforms RLlearning to mimic the best demonstrations; (3) RL-LNS: LNS and Gurobi on all benchmarks by averaged ratios the current state-of-the-art learning-based LNS method of 2.41% and 3.68%, respectively. Overall, these results for solving IPs [8], which uses deep RL to learn LNS policy suggest that our approach can reliably ofer substantial via action factorization to represent all potential variable improvements over state-of-the-art solvers. subsets; (4) LB-SRMRL: the best-performing LB version We also compare the generalization ability of all alby [32], which uses a regression model and RL to learn gorithms to solve large IPs. To this end, we generate a hybrid model to predict and adapt the neighborhood two sets of testing instances following the same settings size for the LB heuristic. We follow the default settings as in section 4.1 but double and quadruple the number of these learning-based baselines and further fine-tune of variables respectively. Note that we only generate them on our datasets to get the best hyperparameters. 50 testing instances for each problem type without conFor more details of the settings of these baselines, we sidering training and validation. We test all (trained) refer the reader to their original papers. models on these new instances and summarize results

Evaluation Metrics. The performances of diferent in Tables 7 and 3. We can observe that the advantage algorithms are compared in two measures: (1) the objec- of our IPGPT still preserves on larger problem instances (c) SC-1000 (d) CA-2000 compared to baselines. Specifically, Table 7 shows that tories during test time, even when trained on random IPGPT still consistently outperforms all baselines on the walk data for the shortest pathfinding problem. 4 benchmarks when the instance size is doubled; IPGPT outperforms RL-LNS and Gurobi by averaged ratios of 4.4. Additional Experiments with SCIP 2.06% and 4.56% respectively. On the other hand, Table 3 shows that IPGPT outperforms all baselines on 3 out of 4 Our framework can integrate any ILP solver to enhance benchmarks when the instance size is quadrupled; IPGPT incumbent solutions. We primarily conducted experoutperforms RL-LNS and Gurobi by averaged ratios of iments with Gurobi, given its status as a leading ILP 1.12% and 3.03% respectively. In summary, our IPGPT solver. Additionally, we also present results utilizing learned on small instances generalizes well to larger in- SCIP (v6.0.1) as an alternative ILP solver. By employing stances, with a persistent advantage over other methods. the same settings as detailed in Section 5.1 and applying them to the four benchmarks, we display the results in 4.3. Anytime Performance Table 6. These outcomes align with those observed when using Gurobi as the ILP solver, albeit with SCIP exhibiting a notably lower performance compared to Gurobi.

We further showcase the anytime performance of various

algorithms, including random LNS in this experiment, to facilitate an easier comparison between random LNS and 4.5. Testing on Real-World Instances in IPGPT across four benchmarks, as illustrated in Figure 3.

Our observations indicate that: (1) IPGPT significantly MIPLIB outperforms other baselines with a noteworthy margin. We follow the experimental settings for real-world in(2) Even with extended time limits, IPGPT’s advantage stances in MIPLIB as described in [8]. We exclude “easy” persists. (3) Interestingly, even though IPGPT is trained instances with relatively small sizes, as well as instances on trajectories derived from random LNS, it can generate where Gurobi cannot find any feasible solutions within a remarkably superior trajectories during testing, leading 3600-second time limit. Consequently, we choose 35 repto substantially improved performance compared to ran- resentative “hard” or “open” instances containing only dom LNS. This mirrors the findings in [ 21], where the integer variables. Within these instances, the number of Decision Transformer (DT) can generate optimal trajec- variables ranges from 150 to 393,800 (averaging 49,563),

SCIP FT-LNS RL-LNS LB-SRMRL

IPGPT and the number of constraints varies from 301 to 850,513 [2] W.-Y. Ku, J. C. Beck, Mixed integer programming (averaging 96,778). We use the datasets in section 4.1 to models for job shop scheduling: A computational train our model and evaluate our model (with Gurobi as analysis, Computers & Operations Research 73 the repair solver) on this realistic dataset, in the style of (2016) 165–173. active search [33, 8, 34] on each instance. Our findings [3] D. Chen, Y. Bai, S. Ament, W. Zhao, D. Guevarra, indicate that, with a 3600-second time limit, IPGPT sur- L. Zhou, B. Selman, R. B. van Dover, J. M. Grepasses both solvers on 20 of the 35 instances and exhibits goire, C. P. Gomes, Automating crystal-structure comparable performance on 10 of the 35 instances. More phase mapping by combining deep learning with details are provided in the appendix. constraint reasoning, Nature Machine Intelligence 3 (2021) 812–822. [4] S. Gollowitzer, I. Ljubić, Mip models for connected 5. Conclusion facility location: A theoretical and computational study, Computers & Operations Research 38 (2011) Addressing large-scale IP problems presents a formidable 435–449. challenge. An increasingly prevalent approach for the [5] R. M. Karp, Reducibility among combinatorial probautomated design and tuning of IP solvers leverages lems, in: Complexity of computer computations, data-driven methodologies. This paper concentrates on Springer, 1972, pp. 85–103. enhancing learning-based LNS approaches, given their [6] H. A. Taha, Integer programming: theory, applicaability to conveniently utilize any existing solver as a tions, and computations, Academic Press, 2014. subroutine. We introduce IPGPT, a novel approach that [7] Y. Bengio, A. Lodi, A. Prouvost, Machine learning models policy learning as a sequence to an MLC prob- for combinatorial optimization: a methodological lem. It seamlessly integrates a customized decision trans- tour d’horizon, European Journal of Operational former encoder, encompassing a causal transformer, to Research 290 (2021) 405–421. model the sequential processes of LNS, and an MLC de- [8] Y. Wu, W. Song, Z. Cao, J. Zhang, Learning large coder with contrastive learning to exploit correlations neighborhood search policy for integer programin variable selection. Furthermore, we carry out com- ming, Advances in Neural Information Processing prehensive experiments on four diverse benchmarks and Systems 34 (2021) 30075–30087. real-world instances. The results suggest that our IPGPT [9] J. Song, Y. Yue, B. Dilkina, et al., A general large approach consistently delivers substantial improvements neighborhood search framework for solving integer over state-of-the-art solvers and exhibits excellent gener- linear programs, Advances in Neural Information alization capabilities for larger instances. Processing Systems 33 (2020) 20012–20023. [10] N. Sonnerat, P. Wang, I. Ktena, S. Bartunov, V. Nair, Acknowledgments Learning a large neighborhood search algorithm for mixed integer programs, arXiv preprint The work of Caihua Liu is supported and funded by arXiv:2107.10201 (2021). the Humanities and Social Sciences Youth Foundation, [11] V. Nair, M. Alizadeh, et al., Neural large neighborMinistry of Education of the People’s Republic of China hood search, in: Learning Meets Combinatorial (Grant No.21YJC870009). Algorithms at NeurIPS2020, 2020. [12] D. Pisinger, S. Ropke, Large neighborhood search, in: Handbook of metaheuristics, Springer, 2010, pp.

References 399–419. [13] S. Ropke, D. Pisinger, An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows, Transportation science 40 (2006) 455–472. [1] J. Mula, R. Poler, J. P. García-Sabater, F. C. Lario,

Models for production planning under uncertainty: A review, International journal of production economics 103 (2006) 271–285. [14] L. Perron, P. Shaw, V. Furnon, Propagation guided [29] R. Albert, A.-L. Barabási, Statistical mechanics of large neighborhood search, in: International Con- complex networks, Reviews of modern physics 74 ference on Principles and Practice of Constraint (2002) 47.

Programming, Springer, 2004, pp. 468–481. [30] E. Balas, A. Ho, Set covering algorithms using cut[15] D. Dumez, F. Lehuédé, O. Péton, A large neighbor- ting planes, heuristics, and subgradient optimizahood search approach to the vehicle routing prob- tion: a computational study, Springer, 1980. lem with delivery options, Transportation Research [31] K. Leyton-Brown, M. Pearson, Y. Shoham, Towards Part B: Methodological 144 (2021) 103–132. a universal test suite for combinatorial auction algo[16] R. S. Sutton, A. G. Barto, Reinforcement learning: rithms, in: Proceedings of the 2nd ACM conference

An introduction, MIT press, 2018. on Electronic commerce, 2000, pp. 66–76. [17] R. J. Williams, Simple statistical gradient-following [32] D. Liu, M. Fischetti, A. Lodi, Learning to search algorithms for connectionist reinforcement learn- in local branching, in: Proceedings of the AAAI ing, Machine learning 8 (1992) 229–256. Conference on Artificial Intelligence, volume 36, [18] F. Torabi, G. Warnell, P. Stone, Behavioral cloning 2022, pp. 3796–3803.

from observation, in: Proceedings of the 27th Inter- [33] I. Bello, H. Pham, Q. V. Le, M. Norouzi, S. Bengio, national Joint Conference on Artificial Intelligence, Neural combinatorial optimization with reinforce2018, pp. 4950–4957. ment learning, arXiv preprint arXiv:1611.09940 [19] L. P. Kaelbling, M. L. Littman, A. W. Moore, Rein- (2016).

forcement learning: A survey, Journal of artificial [34] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, L. Song, intelligence research 4 (1996) 237–285. Learning combinatorial optimization algorithms [20] S. Ross, D. Bagnell, Eficient reductions for imi- over graphs, Advances in neural information protation learning, in: Proceedings of the thirteenth cessing systems 30 (2017). international conference on artificial intelligence [35] E. Larsen, S. Lachapelle, Y. Bengio, E. Frejinger, and statistics, JMLR Workshop and Conference Pro- S. Lacoste-Julien, A. Lodi, Predicting solution sumceedings, 2010, pp. 661–668. maries to integer linear programs under imperfect [21] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, information with machine learning, arXiv preprint M. Laskin, P. Abbeel, A. Srinivas, I. Mordatch, De- arXiv:1807.11876 (2018). cision transformer: Reinforcement learning via se- [36] V. Nair, S. Bartunov, F. Gimeno, I. von Glehn, quence modeling, Advances in neural information P. Lichocki, I. Lobov, B. O’Donoghue, N. Sonnerat, processing systems 34 (2021) 15084–15097. C. Tjandraatmadja, P. Wang, et al., Solving mixed [22] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, A. Lodi, integer programs using neural networks, arXiv Exact combinatorial optimization with graph con- preprint arXiv:2012.13349 (2020). volutional neural networks, Advances in Neural [37] C. K. Joshi, T. Laurent, X. Bresson, An efiInformation Processing Systems 32 (2019). cient graph convolutional network technique for [23] S. Zhang, H. Tong, J. Xu, R. Maciejewski, Graph the travelling salesman problem, arXiv preprint convolutional networks: a comprehensive review, arXiv:1906.01227 (2019).

Computational Social Networks 6 (2019) 1–23. [38] Y. Deng, S. Kong, C. Liu, B. An, Deep attentive belief [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, propagation: Integrating reasoning and learning L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- for solving constraint optimization problems, arXiv tention is all you need, Advances in neural infor- preprint arXiv:2209.12000 (2022).

mation processing systems 30 (2017). [39] Y. Razeghi, K. Kask, Y. Lu, P. Baldi, S. Agarwal, [25] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, R. Dechter, Deep bucket elimination., in: IJCAI, I. Sutskever, et al., Language models are unsuper- 2021, pp. 4235–4242.

vised multitask learners, OpenAI blog 1 (2019) 9. [40] Y. Deng, S. Kong, B. An, Pretrained cost model for [26] J. Bai, S. Kong, C. P. Gomes, Gaussian mixture vari- distributed constraint optimization problems, in: ational autoencoder with contrastive learning for Proceedings of the AAAI Conference on Artificial multi-label classification, in: International Confer- Intelligence, volume 36, 2022, pp. 9331–9340. ence on Machine Learning, PMLR, 2022, pp. 1383– [41] S. Hochreiter, J. Schmidhuber, Long short-term 1398. memory, Neural computation 9 (1997) 1735–1780. [27] D. P. Kingma, J. Ba, Adam: A method for stochas- [42] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to setic optimization, arXiv preprint arXiv:1412.6980 quence learning with neural networks, Advances (2014). in neural information processing systems 27 (2014). [28] P. Erdős, A. Rényi, et al., On the evolution of ran- [43] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, dom graphs, Publ. Math. Inst. Hung. Acad. Sci 5 Bert: Pre-training of deep bidirectional transform(1960) 17–60. ers for language understanding, arXiv preprint

A. Ablation Study

IPGPT⊖ MLC IPGPT⊖ DT IPGPT

D. Other Related Work

label correlations are prevalent in MLC. Early works capture the correlations through classifier chains [ 46], Learning to Optimize. Recently, there has been an in- Bayesian inference [47], and dimensionality reduction creasing interest in applying ML to learn solving COPs. [48]. Thanks to the huge capacity of DNNs, one can alleBroadly speaking, there are three categories of learning viate the laborious feature mapping and therefore focus to optimize algorithms: (1) Learning to predict solutions on the loss function, feature-label and label-label correlafrom inputs. Larsen et al. [35] train a deep neural net- tion modeling [49, 50, 51]. It has been shown that conwork (DNN) to predict the solution of a stochastic load trastive learning can exploit label information efectively planning problem. Nair et al. [36] propose neural diving in a data-driven manner, and learn meaningful feature to learn a DNN to generate multiple partial assignments and label embeddings that capture the label correlations for its integer variables, and the resulting smaller mixed and enhance the predictive power [26]. However, current integer programs (MIPs) for un-assigned variables are LNS algorithms fail to explore the correlations of variable solved with an of-the-shelf MIP solver to construct high- selection in each step. In this work, we will formulate the quality joint assignments. Joshi et al. [37] learn a DNN policy learning of variable selection as an MLC problem by supervision to predict the probability of an edge to be and adopt contrastive learning to model category-level in the traveling salesman problem (TSP) tour. A feasible label correlations. To the best of our knowledge, we are tour is then generated by beam search. (2) Learning to the first to use sequence to multi-label learning to imconfigure COP algorithms. Liu et al. [32] learn to con- prove the performance of LNS, and thus enable us to ifgure the search neighborhood size of LB in each step benefit from modeling long sequences of LNS behaviors by using RL. Deng et al. [38] integrate belief propagation and exploiting correlations between variable selection (BP), gated recurrent units (GRUs), and graph attention simultaneously. networks (GATs) within the message-passing framework Primal Heuristics. Numerous primal heuristic alto reason about dynamic weights and damping factors gorithms have been proposed to enhance the eficiency for composing new BP messages. (3) Learning along- of solving ILPs [52]. Primal heuristics span from simside COP algorithms. Nair et al. [11] learn a DNN to pler rounding heuristics [53] to more computationally make variable selection decisions in B&B to bound the demanding diving and large neighborhood search (LNS) objective value gap with a small tree. Deep Bucket Elimi- heuristics, such as Relaxation Induced Neighborhood nation (DBE) [39] uses DNNs to approximate the large Search (RINS) [54]. LNS heuristics are improvement bucket functions. Deng et al. [40] propose a pretrained heuristics that solve auxiliary problems using the branchcost model which predicts the optimal cost of a given par- and-bound technique. tially instantiated COP. The predicted cost is then used RINS, a prominent LNS meta-heuristic, seeks to imto construct heuristics for various COP algorithms such prove a given feasible MIP solution. By comparing the as LNS and B&B. Our work belongs to the third category. feasible solution with one obtained from relaxing integer

Sequence Learning. The recent rapid development variables, it identifies and eliminates variables with difof sequence modeling is largely due to the successful fering values between the two. The resulting sub-MIP is applications of DNNs, from LSTMs [41] and sequence-to- then solved using a MIP solver. sequence models [42] to transformer architectures with The solution-polishing heuristic [55] employs a self-attention [24]. Sequence learning aims to capture variable-fixing neighborhood similar to RINS, while also the temporal dependence of sequential data (text, speech, integrating an evolutionary algorithm approach. Unlike video, etc.), and it is widely used in NLP [43, 44]. There RINS, this heuristic uses crossover and mutation operais also a recent attempt to apply sequence learning in tions to combine multiple solutions chosen from a pool of scientific discovery such as using a causal transformer to available feasible solutions. The crossover process fixes model material property with sequential structures [45]. variables that have identical values across all selected In light of this, it is also tempting to consider how such se- solutions, while mutation is introduced by randomly fixquence models can lead to improved performance in LNS, ing additional variables to refine already high-quality which is also concerned with sequential processes. A solutions. causal transformer is an architecture to eficiently model Adaptive LNS [56] capitalizes on an ensemble of LNS sequences, which is the cornerstone of a decision trans- algorithms for MIPs and employs a multi-armed bandit former in RL [21]. However, little has been done to model to adaptively switch among them during a MIP solve. the sequential processes of LNS with transformers; we Although our work does not focus on an ensemble apwill address this limitation by adopting a causal trans- proach, it could be incorporated as another ensemble former (GPT2). member to enhance performance.

Multi-Label Classification. Multi-label classification In contrast, learning-based LNS approaches, such as (MLC) is a prediction task where each sample can have IPGPT, can be regarded as primal heuristics automatically more than one labels. Unlike the single-label scenario, learned through machine learning. These approaches showcase significant potential by exploiting data-driven techniques, which ultimately result in improved performance and adaptability across a wide range of problem instances. This work is particularly interested in advancing the capabilities of learning-based LNS approaches.

E. Further Implementation Details

We implemented our GPT-2 using the Huggingface Transformers library. The hyperparameters we employed are as follows: (1) Number of layers: 3; (2) Number of attention heads: 1; (3) Embedding dimension: 128; (4) Nonlinearity function: ReLU; (5) Batch size: 128; (6) Context length K: 25; (7) Dropout ratio: 0.1; (8) Learning rate: 1e-4; (9) Gradient norm clipping: 0.25. We maintained other parameters at their default values. We trained the model from scratch and did not utilize any pre-trained weights.

Grid search is adopted for tuning. We tune learning rate from 0.00005 to 0.002 with interval 0.00005, dropout ratio from [0.05, 0.1, 0.3, 0.5], weight decay from [0, 0.01, 0.0001], from [0.1, 0.5, 1, 1.5, 2.0], token embedding size from [64, 128, 256, 512], context length from [15, 20, 25, 30], batch size from [64, 128, 256], Gradient norm clipping from [0.15, 0.2, 0.25, 0.3, 0.35].

In the data collection process, we utilize random LNS with adaptive neighborhood size. The neighborhood size is initially set to 10% of the number of variables in the input problem instance. It is then adapted following the approach described in the paper by [10].

During testing, at each step , the model generates action distributions , for each dimension autoregressively. We apply a threshold of 0.5 to convert these values into 1 or 0, representing the selection or non-selection of the corresponding variable in step .