<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IPGPT: Solving Integer Programming Problems with Sequence to Contrastive Multi-Label Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shufeng Kong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caihua Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carla Gomes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cornell University</institution>
          ,
          <addr-line>Ithaca</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Guilin University of Electronic Technology</institution>
          ,
          <addr-line>Guilin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sun Yat-sen University</institution>
          ,
          <addr-line>Zhuhai Campus</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2000</year>
      </pub-date>
      <abstract>
        <p>Integer Programming (IP) is an essential class of combinatorial optimization problems (COPs). Its inherent NP-hardness has fostered considerable eforts towards the development of heuristic strategies. An emerging approach involves leveraging data-driven methods to automatically learn these heuristics. For example, using deep (reinforcement) learning to recurrently reoptimize an initial solution with Large Neighborhood Search (LNS) has demonstrated exceptional performance across numerous applications. A pivotal challenge within LNS lies in identifying an optimal subset of variables for reoptimization at each stage. Existing methods typically learn a policy to select a subset, either by maintaining a fixed cardinality or by decomposing the subset into independent binary decisions for each variable. However, such strategies overlook the modeling of LNS's sequential processes and fail to explore the correlations inherent in variable selection. To overcome these shortcomings, we introduce IPGPT, an innovative model that reimagines policy learning as a sequence-to-multi-label classification (MLC) problem. Our approach uniquely integrates a tailored Decision Transformer encoder, incorporating a causal transformer (GPT2) to capture the sequential nature of LNS. Additionally, we employ an MLC decoder with contrastive learning to exploit the correlations in variable selection. Our extensive experiments confirm that IPGPT significantly surpasses the performance of current state-of-the-art baselines and exhibits excellent generalization to larger instances.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Integer Programming</kwd>
        <kwd>contrastive learning</kwd>
        <kwd>causal transformer</kwd>
        <kwd>Multi-Label Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>mercial IP solvers such as Gurobi or SCIP as a generic
black-box subroutine and thus benefits from the
cuttingIP has found applications in production planning [1], edge technologies of such commercial IP solvers.
scheduling [2], scientific discovery [ 3], and telecommu- In this paper, we focus on boosting the performance of
nications networks [4], among many others. It is well- LNS, though our method can also be applied to boost the
known that IP is NP-complete [5] and many eforts have performance of other local search algorithms such as LB.
been devoted to designing efective heuristics to find A key challenge of LNS is to select a promising variable
near-optimal solutions [6]. Historically, such algorithms subset to reoptimize based on the current solution. Since
were designed largely manually, requiring a careful un- the selection choice is combinatorial, finding an optimal
derstanding of the underlying structure within specific subset is also computationally hard. Song et al. [9] learn
classes of optimization problems. to select fixed, predefined variable subsets with imitation</p>
      <p>Due to the recent success of deep learning (DL) and learning and RL. Wu et al. [8] learn to select arbitrary
reinforcement learning (RL), there has been an increas- variable subsets with RL by factorizing the selection of a
ing interest in automatically learning heuristics for COPs variable subset into elementary selections on each
varifrom training data [7]. Existing works often leverage able separately. Similarly, Sonnerat et al. [10] learn to
machine learning (ML) to output solutions directly from predict the probability of selecting a variable
indepeninput instances, configure hyperparameters of COP algo- dently of other variables using imitation learning and
rithms, or learn a local decision policy for search frame- Nair et al. [11] use RL to learn a policy that selects one
works such as branch&amp;bound (B&amp;B), local branching variable at a time. Nevertheless, all of these works miss
(LB), or LNS. Among them, we are particularly interested modeling the sequential processes of LNS and also do not
in learning to iteratively reoptimize an initial solution exploit correlations of variable selection 1. To address
with LNS [8, 9, 10, 11]. These approaches are attractive these limitations, we propose to model the policy
learnbecause we can leverage existing state-of-the-art com- ing as a sequence to a multi-label classification problem,
which jointly models the selection of variables as well as
STRL’23: Second International Workshop on Spatio-Temporal Reason- the sequential processes of LNS.
i$ngsakn2d29L9e@arcnoinrnge,lAl.uedguus(tS2.0K2o3,nMg)a;ccaaoih,Su.aA.l.iRu,@Cghuineat.edu.cn (C. Liu); Our contributions are threefold: (1) we give a new
angomes@cs.cornell.edu (C. Gomes)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 1We have reserved an in-depth discussion of the related work for
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) the appendix due to space constraints.
gle of sequence to multi-label classification for learning
an efective local decision policy for LNS; (2) we
materialize this idea by providing a novel model to seamlessly
integrate a customized decision transformer encoder to
model the sequential processes of LNS and an MLC
decoder with contrastive learning to exploit correlations
of variable selection; (3) we conduct extensive
experiments on various benchmarks and the results show that
our model significantly outperforms state-of-the-art
baselines.
2. Preliminaries
2.1. Integer Program
called an action of an agent that is executed in
step ;
•  (, ) is the transition function to return the
next state. Let  be the solution with state , a
smaller sub-IP is first generated by keeping the
values of non-selected variables in  and
reoptimizing the remainder, and then the next state +1
is obtained by updating  with the new solution
to the sub-IP: +1 = arg min{  |  ≤
;  ≥ 0;  ∈ Z;  = , ∀ ̸∈ };
• (, ) is the reward function to return the
change of objective values, which is defined as
 = (, ) =   ( − +1). Let  be the
step limit, the cumulative rewards from step  of
an episode is defined as  = ∑︀</p>
      <p>=  −  with
a discount factor  ∈ [0, 1].</p>
      <p>An integer program (IP) is a problem of
optimizing a linear function over points in a polyhedral set:
arg min{  |  ≤ ;  ≥ 0;  ∈ Z}, where
 ∈ Z is a vector of  decision variables;  ∈ R
denotes the vector of objective coeficients; the incidence
matrix  ∈ R×  and vector  ∈ R together define
 linear constraints.</p>
      <sec id="sec-1-1">
        <title>A policy is a (potentially probabilistic) mapping  :</title>
        <p>→ . The goal of RL-based algorithms for solving IPs
is to find a policy function to maximize the expected
cumulative reward E[1] over all episodes, i.e., the expected
improvement over initial solutions. However, existing
RL-based algorithms for IP solving train a policy by either
temporal diference (TD) learning [ 16], policy gradient
2.2. LNS and Its Markov Decision Process [17], or behavior cloning [18], all of which miss modeling</p>
        <p>Formulation sequential processes of LNS explicitly. Furthermore,
RLGiven an initial assignment of values to the decision based algorithms may sufer from various issues, such as
variables in an IP instance, LNS iteratively refines this the need for bootstrapping to propagate returns in
TDassignment by selecting a subset of decision variables, learning can cause stability problems, the discounting
relaxing their values, and solving a subproblem that aims future rewards can induce undesirable short-sighted
beto optimize the objective function while respecting the haviors, policy gradient is known to be sample ineficient,
instance’s constraints. LNS aims to explore a complex and behavior cloning can sufer from cascading errors
solution neighborhood and gradually improve its cur- [19, 20, 21]. To circumvent these disadvantages, we
prorent solution until a certain termination condition is pose to learn a policy with decision transformers, which
met [12]. A key challenge of LNS is how to define a seeks to benefit from modeling sequential processes of
good solution neighborhood, namely, one needs to de- LNS and better generalization.
cide which variable subset to reoptimize given the current
solution. Obviously, such a decision problem is combina- 2.3. Decision Transformer
torial, and many works devote to constructing efective
heuristics for it [13, 14, 15]. In this work, we are
particularly interested in the recent trend of learning-based
approaches, where data-driven methods are applied to
learn the heuristics automatically [9, 8, 10]. To this end,
the LNS framework can be formulated as a Markov
Decision Process (MDP) (, , , ):
Decision transformer (DT) [21] abstracts the
decisionmaking process in RL as a sequence modeling
problem and attempts to learn a return-conditioned
stateaction mapping. The return-conditionality means that
given a history of return-state-action tokens, such that
the first token represents the desired return at the
current state, the DT predicts the action required to
•  is a set of states. A state  ∈  describes achieve this desired return. In this paper, we
folthe current status of the LNS process in step , low the convention of the original DT and define
rewhich normally includes the static IP instance turn, , as the non-discounted rewards-to-go:  =
information (e.g., variables, constraints, and ob- ∑︀ . DT takes as input a sequence of three-tokens:
jectives) and the dynamic solving statistics (e.g., (⟨−  , −  , −  ⟩, · · · , ⟨, , ⟩), where  ≤ 
the incumbent solution); is the context length. Each token is then encoded into
•  is a set of all candidate variable subsets for an embedding and added by a positional encoding. Let
reoptimization. A variable subset  ∈  is also
(⟨−  , −  , −  ⟩, · · · , ⟨ ,  ,  ⟩) be the
corresponding sequence of embeddings, and it is fed into a
causal transformer to produce another sequence of em- 3.1. A Novel Model IPGPT for Solving IPs
beddings (⟨ℎ−  , ℎ−  , ℎ−  ⟩, · · · , ⟨ℎ , ℎ , ℎ ⟩).</p>
        <p>A decoder takes as input ℎ and outputs ˆ. During
training, a suitable loss function is applied to penalize
the diference between the prediction ˆ and label .</p>
        <p>During inference, after specifying a target return based
on desired performance and the environment starting
state, DT generates actions autoregressively. The actions
are executed and the target return is subtracted by the
achieved rewards to obtain the next states. The process
of generating actions and applying them to obtain the
next return-to-go and state is repeated until episode
termination.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Solving IPs with Sequence to Contrastive Multi-Label Classification</title>
      <sec id="sec-2-1">
        <title>In this section, we propose a novel model IPGPT for the problem given in equation (2) based on the causal transformer and contrastive learning.</title>
      </sec>
      <sec id="sec-2-2">
        <title>Instead of learning to select fixed, predefined variable</title>
        <p>subsets [9], Wu et al. [8] factorizes the combinatorial ac- Figure 1: The architecture of our model IPGPT: it consists of
tion space  into elementary actions on each dimension several token encoders to produce latent token embeddings,
(i.e. variables), where  ∈ {1, 0} denotes the elemen- a causal transformer to capture dependence between token
tary action of whether selecting  for reoptimization embeddings, and a contrastive MLC decoder to exploit
correin step , and  is 1 if  is selected and 0 otherwise. lations between label categories. Here we use a small IP with
Therefore, any action can be expressed as  = ∪=1 4 variables and 3 constraints to show the full pipeline of our
and the action selection problem can be converted into  model. The problem is first translated into a factor graph ,
separated binary classification problems. The policy for and  is associated with dynamic factor-node features that
action selection is factorized by describe the states of MDP and are encoded into state
embeddings in diferent steps. Similarly, returns  and actions  are
 (|) = ∏︁  (|), (1) eanlscooednecroadneddtiwntoosliamtepnlet eMmLbPesdadsinrgetsu.rWneaunsdeaactGioCnNenacsosdtaetres
=1 respectively. Each token embedding is further added with its
relative positional encoding. The sequence of embeddings is
fed into the causal transformer to produce another sequence
of embeddings. Finally, the contrastive MLC decoder takes as
inputs the state embeddings and outputs action predictions.
which expresses the probability of selecting an action
as the product of probabilities of selecting its elements.</p>
        <p>However, such an action space factorization limits the
class of policies that can be learned and it also fails to
explore the correlations between elementary actions. To
address these limitations, we propose to model the policy
learning as a sequence to multi-label classification problem,
which jointly models the selection of multiple elementary
actions as well as the sequential processes of LNS, i.e.,
 (|) =  (|ℎ((, ))),</p>
        <p>(2)
where (, ) denotes the function to return the last
 sequence of return-state-action tokens from steps  −
 to , i.e., (⟨−  , −  , −  ⟩, · · · , ⟨, , ·⟩ ); ℎ(· )
denotes the sequence encoder parameterized by (NN);
and the MLC decoder parameterized by  (NN) takes as
input state embedding ℎ produced by ℎ and outputs
action distribution  (|ℎ ). Efective implementations
of the sequence encoder and MLC decoder are crucial to
this work.</p>
        <sec id="sec-2-2-1">
          <title>3.1.1. Factor Graph Representation</title>
          <p>An IP instance can be represented by a factor graph [22]
which is a bipartitle grah  = (, , ℰ ) consistring
of variable-nodes  = {1, · · · , } and factor-nodes
 = {1, · · · , }. Variable nodes correspond to the
variables and factor nodes correspond to the constraints
in the IP. An edge  ∈ ℰ between  and  is
established only if the -th constraint contains the -th variable.
The variable nodes are associated with a feature matrix
 ∈ Z×  , where  is the number of features for
each variable node. The features of each variable-node
 include two parts: (1) static features: a one-hot vector
indicates the node type and the objective coeficient  
of ; (2) dynamic features: the current solution of 
in step  and the incumbent solution of . Note that
the dynamic features are used to describe the states of
MDP in diferent steps. The factor nodes are also
associated with a feature matrix  ∈ Z×  , where  is the
number of features for each factor node. The features of
each factor-node  only include static features: a
onehot vector indicates the node type and the value  at the
right-hand-side (RHS) of the -th constraint. Finally, the
weight matrix of edges is exactly the incidence matrix.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>3.1.2. Model Architecture</title>
          <p>,
LN   (+1)())︁
︁(
Firstly, a GCN takes as input an IP instance and learns
embeddings for label categories respectively, and labels within
the same category share the same embedding. Secondly, we
use the state embedding in step  as an input feature whose
inner products with label embeddings are used to produce
prediction ^. Lastly, a contrastive loss is designed to pull
together the feature embedding and positive label embeddings,
while separating the feature embedding from the negative
label embeddings. Note that a label embedding is positive
only if its label is 1. An example with label [1, 0, 0, 1] is shown.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>In this work, we adopt the causal transformer GPT2 [25] to learn and reason about sequences and we defer the other architecture details to the original paper.</title>
        <p>Contrastive MLC Decoder: Recall that an
elemen,
(3) tary action  ∈ {0, 1} (a.k.a. a label) denotes whether
or not to select variable  for reoptimization in step .</p>
        <p>A label vector  = ∪=1 ∈ {0, 1} denotes the
se</p>
        <p>lected action given state . Diferent from eq. (1) which
approximates  (|) with  separated binary
classiwhere (), ()
ces in the -th layer;  ()
∈ Rℎ× ℎ are trainable weight
matri∈ R× ℎ and ()</p>
        <p>∈ R× ℎ
are embeddings for variable-nodes and factor-nodes re- fication problems, we propose to approximate  (|)
spectively in the -th layer; LN and  (· ) denote layer
normalization and Tanh activation function respectively. to , where ℎ is a state embedding generated by the
with an MLC decoder that finds a mapping from
ℎ

=1
ℎ = ∑︁
   ,

  =</p>
        <p>exp( · )
The initial embeddings  (0) and (0) are linear
projections of the raw feature matrices  and  respectively.</p>
      </sec>
      <sec id="sec-2-4">
        <title>In this paper, all MLP encoders only have two layers, and</title>
        <p>the embeddings’ dimensions ℎ are set to be 128; the</p>
      </sec>
      <sec id="sec-2-5">
        <title>GCN encoder consists of two convolution layers and a mean pooling layer.</title>
      </sec>
      <sec id="sec-2-6">
        <title>Causal Transformer: Causal transformer [24] is an</title>
        <p>architecture to eficiently model sequences that consist
ℎ is given by
In our model, each layer receives a sequence of  = 3
token embeddings {}=1, and outputs  embeddings
{ℎ}=1, preserving the input dimensions. Specifically,
each token embedding  is mapped to a key , a query
, and a value  via linear functions, and the output
causal transformer and served as an input feature for our
MLC decoder. A key aspect of learning a policy with an
MLC module is that we can exploit the correlations between
elementary actions, which is missing in those existing
MLboosted IP solvers [8, 9, 10].</p>
      </sec>
      <sec id="sec-2-7">
        <title>We propose to exploit category-level label correlation</title>
        <p>with contrastive learning based on the MLC model
GM</p>
      </sec>
      <sec id="sec-2-8">
        <title>VAE [26]. GMVAE assumes that the number of labels</title>
        <p>ples. This is not applicable to our case since diferent
instances may have a diferent number of decision
variables, i.e., actions from diferent instances may have
different cardinalities. Alternatively, we learn
categorylevel label embeddings for each IP instance with a shared
GCN, and the embeddings are only shared across
samples within each instance. Fig. 2 gives the architecture
of our MLC decoder. We denote  the -th category of
of stacked self-attention layers with residual connections. is fixed and label embeddings are shared across all
sam∑︀
′=1 exp( · ′ ) . (4) labels { }=−  collected from steps [ −
idea is as follows: (1) we learn an embedding  for
, ]. Our
each label category  such that labels within the same where  is a trade-of weight. The model is trained
category share the same embedding. Since the number with Adam [27] and optimized with ℒ. We sample
miniof label categories is exactly the number of variables batches of sequence length  from the dataset  and
in an IP instance, we use the GCN described in equa- the model is trained with GPUs in parallel. However,
tions (3) to take as input an IP instance and output node similar to DT, our model will generate action predictions
embeddings, and we use the variable-node embeddings autoregressively during testing. We refer the reader to
as label category embeddings respectively; (2) we use the section 2.3 for more details.
state embedding ℎ of each LNS step as an input feature
whose inner products with label embeddings correspond
to feature-label similarity and can be used for prediction; 4. Experiments
(3) we use contrastive learning to capture correlations
between label categories by pulling similar categories’ We conduct experiments on four diverse NP-hard COP
embeddings together. Sepcifically, let  ≡ { 1, · · · , } bmeanxcihmmumarkcsu,t i(nMcClu)d,sinetgcmovienriimngu m(SCv)e,ratnexd ccoomvebrin(aMtoVrCia)l,
and we define the positive label set of  as  () ≡ auction (CA). We follow the experimental settings of [9]
{ ∈ | = 1}. Given a sequence of return-state-action and [8].
tokens (⟨−  , −  , −  ⟩, · · · , ⟨, , ⟩), our
decoder is designed to optimize the following contrastive
loss function: 4.1. Datasets and Experimental Setup</p>
      </sec>
      <sec id="sec-2-9">
        <title>Datasets. MVC and MC are graph optimization prob</title>
        <p>ℒ = 1 =∑−︁ | (1 )| ∈∑ (︁ ) − log ∑︀′∈ℎ ·ℎ·′ ,wlEerimdthőss;1-S0RC0é0naynniod(dECeRsA)amanrodedegedelgn[e2e8rpa]rltooIbPagsbe.niFleiotryratM0e.1Vr5aC.n,FdwoormeMugsCrea,ptwhhees
(5) use the Barabasi-Albert (BA) model [29] to generate
ranwhere state embeddings ℎ for  ∈ [ − , ] are gener- dom graphs with 500 nodes and an average degree of 4.
ated by the causal transformer and are computed as in For SC, we generate instances with matrices having 5000
eq. (4). rows and 1000 columns following the procedure in [30],</p>
        <p>For example, if in most of the actions , labels  where each entry  ∈ {0, 1} represents whether the
and  often appear together (i.e., they both equal 1), -th element in the universe belongs to the -th set. For
contrastive learning will implicitly pull their embeddings CA, we use the Combinatorial Auction Test Suit (CATS)
together. In other words, if two labels do co-appear often, [31] with arbitrary relationships to generate instances
their label embeddings would become similar. On the with 2000 items and 4000 bids. For each problem type, we
other hand, if they never co-occur or only co-appear generate 100, 20, and 50 instances for training, validation,
occasionally, their connections are not significant and and testing, respectively. For each training instance, we
our decoder will not optimize for their similarity. use LNS with a random policy to run on it for 100 steps
and set a time limit of 2s for solving the sub-IP in each
3.2. Training Algorithm step with Gurobi. We run it 30 times and collect the
trajectory with the best objective values for training IPGPT,
Our model will be trained with supervised learning. and we randomly sample 20 sequences of
return-stateGiven a set of training IP instances, we first collect a action tokens with length  = 25 from each trajectory.
dataset of sequences of return-state-action tokens that Therefore, our dataset for training IPGPT includes 2000
are generated by the MDP with some expert policy:  = sequences of three tokens.
{(⟨−  , −  , −  ⟩ , · · · , ⟨, , ⟩ )}=1, where Initialization. LNS starts with a feasible initial
solu ≤  is the length of each sequence. For each sequence tion. For MVC, MAXCUT, SC, and CATS, we initialize a
 ∈ , our model will take as input  and generate a feasible solution by including all vertices in the cover set,
set of action predictions {ˆ }=−  , and we will also randomly partitioning all vertices into two
complemencollect the set of labels from , { }=−  . A supervised tary sets, including all sets in the set cover, and accepting
cross-entropy loss for each sequence is given by no bids, respectively. The initialization process does not
incur additional computational costs in our experiments.
ℒ = 1 =∑−︁ 1 ∑=︁1  log ˆ +(1−  ) log(1− ˆ(6)). a2cltaIimyoenpr,slaeanmnddepn1o2ts8aitthiiooidnndeeannnconddeHuerrysopnaser.erTpahallersasimmtapteelteeerMnsc.LoPRdseertwuiisrtnha,
The final objective function to minimize is given by GCN with 2 convolution layers and a mean pooling layer.</p>
        <p>We use the GPT2 as our casual transformer. Dimensions
(7) of all hidden embeddings are set to 128. We set the batch
ℒ = ℒ −  ℒ ,
Gurobi
FT-LNS
RL-LNS
LB-SRMRL</p>
        <p>IPGPT
Methods
Gurobi
FT-LNS
RL-LNS
LB-SRMRL</p>
        <p>IPGPT</p>
        <p>CA-2000</p>
        <p>Obj.± Std.%
− 111668 ± 2.0
− 110041 ± 1.6
− 111787 ± 2.6
− 110741 ± 3.1
− 114535 ± 1.8</p>
        <p>MVC-2000
Obj.± Std.%
392.5 ± 1.3
390.5 ± 1.1
375.8 ± 2.1
395.2 ± 1.9
365.1 ± 1.1
size and the number of training epochs to 128 and 100, tive of solutions returned by diferent algorithms within
respectively, for all experiments. Our model was imple- a time limit; (2) the gap between solutions, namely, the
mented with the Pytorch deep learning framework and ratio of objective diference w.r.t. the best result.
the whole model was trained using the Adam optimizer
[27] with a learning rate of 0.0001 and a weight decay 4.2. Experimental Results
ratio of 0.01 in an end-to-end fashion. All experiments
were carried out on a machine with a 4.2 GHz quad-core A comparison of IPGPT and other state-of-the-art
baseIntel i7 CPU, 16 GB RAM, and an Nvidia RTX 3090 24GB lines on 4 diverse benchmarks is given in Table 6. All
GPU card. Further implementation details can be found learning-based algorithms, including our IPGPT, call
in the appendix. Gurobi to solve sub-IPs with a time limit of 2s at every</p>
        <p>Baselines. We compare our method with four base- step. We can observe that LB-SRMRL is not comparable
lines: (1) Gurobi (version 9.5) with default settings: a to other algorithms. RL-LNS remains the most
competileading state-of-the-art IP solver; (2) FT-LNS: the best- tive baseline and consistently outperforms both Gurobi
performing LNS version by [9], which applies imitation and FT-LNS. Our IPGPT consistently outperforms
RLlearning to mimic the best demonstrations; (3) RL-LNS: LNS and Gurobi on all benchmarks by averaged ratios
the current state-of-the-art learning-based LNS method of 2.41% and 3.68%, respectively. Overall, these results
for solving IPs [8], which uses deep RL to learn LNS policy suggest that our approach can reliably ofer substantial
via action factorization to represent all potential variable improvements over state-of-the-art solvers.
subsets; (4) LB-SRMRL: the best-performing LB version We also compare the generalization ability of all
alby [32], which uses a regression model and RL to learn gorithms to solve large IPs. To this end, we generate
a hybrid model to predict and adapt the neighborhood two sets of testing instances following the same settings
size for the LB heuristic. We follow the default settings as in section 4.1 but double and quadruple the number
of these learning-based baselines and further fine-tune of variables respectively. Note that we only generate
them on our datasets to get the best hyperparameters. 50 testing instances for each problem type without
conFor more details of the settings of these baselines, we sidering training and validation. We test all (trained)
refer the reader to their original papers. models on these new instances and summarize results</p>
        <p>Evaluation Metrics. The performances of diferent in Tables 7 and 3. We can observe that the advantage
algorithms are compared in two measures: (1) the objec- of our IPGPT still preserves on larger problem instances
(c) SC-1000
(d) CA-2000
compared to baselines. Specifically, Table 7 shows that tories during test time, even when trained on random
IPGPT still consistently outperforms all baselines on the walk data for the shortest pathfinding problem.
4 benchmarks when the instance size is doubled; IPGPT
outperforms RL-LNS and Gurobi by averaged ratios of 4.4. Additional Experiments with SCIP
2.06% and 4.56% respectively. On the other hand, Table 3
shows that IPGPT outperforms all baselines on 3 out of 4 Our framework can integrate any ILP solver to enhance
benchmarks when the instance size is quadrupled; IPGPT incumbent solutions. We primarily conducted
experoutperforms RL-LNS and Gurobi by averaged ratios of iments with Gurobi, given its status as a leading ILP
1.12% and 3.03% respectively. In summary, our IPGPT solver. Additionally, we also present results utilizing
learned on small instances generalizes well to larger in- SCIP (v6.0.1) as an alternative ILP solver. By employing
stances, with a persistent advantage over other methods. the same settings as detailed in Section 5.1 and applying
them to the four benchmarks, we display the results in
4.3. Anytime Performance Table 6. These outcomes align with those observed when
using Gurobi as the ILP solver, albeit with SCIP exhibiting
a notably lower performance compared to Gurobi.</p>
      </sec>
      <sec id="sec-2-10">
        <title>We further showcase the anytime performance of various</title>
        <p>algorithms, including random LNS in this experiment, to
facilitate an easier comparison between random LNS and 4.5. Testing on Real-World Instances in
IPGPT across four benchmarks, as illustrated in Figure 3.</p>
        <p>Our observations indicate that: (1) IPGPT significantly MIPLIB
outperforms other baselines with a noteworthy margin. We follow the experimental settings for real-world
in(2) Even with extended time limits, IPGPT’s advantage stances in MIPLIB as described in [8]. We exclude “easy”
persists. (3) Interestingly, even though IPGPT is trained instances with relatively small sizes, as well as instances
on trajectories derived from random LNS, it can generate where Gurobi cannot find any feasible solutions within a
remarkably superior trajectories during testing, leading 3600-second time limit. Consequently, we choose 35
repto substantially improved performance compared to ran- resentative “hard” or “open” instances containing only
dom LNS. This mirrors the findings in [ 21], where the integer variables. Within these instances, the number of
Decision Transformer (DT) can generate optimal trajec- variables ranges from 150 to 393,800 (averaging 49,563),</p>
        <p>SCIP
FT-LNS
RL-LNS
LB-SRMRL</p>
        <p>IPGPT
and the number of constraints varies from 301 to 850,513 [2] W.-Y. Ku, J. C. Beck, Mixed integer programming
(averaging 96,778). We use the datasets in section 4.1 to models for job shop scheduling: A computational
train our model and evaluate our model (with Gurobi as analysis, Computers &amp; Operations Research 73
the repair solver) on this realistic dataset, in the style of (2016) 165–173.
active search [33, 8, 34] on each instance. Our findings [3] D. Chen, Y. Bai, S. Ament, W. Zhao, D. Guevarra,
indicate that, with a 3600-second time limit, IPGPT sur- L. Zhou, B. Selman, R. B. van Dover, J. M.
Grepasses both solvers on 20 of the 35 instances and exhibits goire, C. P. Gomes, Automating crystal-structure
comparable performance on 10 of the 35 instances. More phase mapping by combining deep learning with
details are provided in the appendix. constraint reasoning, Nature Machine Intelligence
3 (2021) 812–822.
[4] S. Gollowitzer, I. Ljubić, Mip models for connected
5. Conclusion facility location: A theoretical and computational
study, Computers &amp; Operations Research 38 (2011)
Addressing large-scale IP problems presents a formidable 435–449.
challenge. An increasingly prevalent approach for the [5] R. M. Karp, Reducibility among combinatorial
probautomated design and tuning of IP solvers leverages lems, in: Complexity of computer computations,
data-driven methodologies. This paper concentrates on Springer, 1972, pp. 85–103.
enhancing learning-based LNS approaches, given their [6] H. A. Taha, Integer programming: theory,
applicaability to conveniently utilize any existing solver as a tions, and computations, Academic Press, 2014.
subroutine. We introduce IPGPT, a novel approach that [7] Y. Bengio, A. Lodi, A. Prouvost, Machine learning
models policy learning as a sequence to an MLC prob- for combinatorial optimization: a methodological
lem. It seamlessly integrates a customized decision trans- tour d’horizon, European Journal of Operational
former encoder, encompassing a causal transformer, to Research 290 (2021) 405–421.
model the sequential processes of LNS, and an MLC de- [8] Y. Wu, W. Song, Z. Cao, J. Zhang, Learning large
coder with contrastive learning to exploit correlations neighborhood search policy for integer
programin variable selection. Furthermore, we carry out com- ming, Advances in Neural Information Processing
prehensive experiments on four diverse benchmarks and Systems 34 (2021) 30075–30087.
real-world instances. The results suggest that our IPGPT [9] J. Song, Y. Yue, B. Dilkina, et al., A general large
approach consistently delivers substantial improvements neighborhood search framework for solving integer
over state-of-the-art solvers and exhibits excellent gener- linear programs, Advances in Neural Information
alization capabilities for larger instances. Processing Systems 33 (2020) 20012–20023.
[10] N. Sonnerat, P. Wang, I. Ktena, S. Bartunov, V. Nair,
Acknowledgments Learning a large neighborhood search algorithm
for mixed integer programs, arXiv preprint
The work of Caihua Liu is supported and funded by arXiv:2107.10201 (2021).
the Humanities and Social Sciences Youth Foundation, [11] V. Nair, M. Alizadeh, et al., Neural large
neighborMinistry of Education of the People’s Republic of China hood search, in: Learning Meets Combinatorial
(Grant No.21YJC870009). Algorithms at NeurIPS2020, 2020.
[12] D. Pisinger, S. Ropke, Large neighborhood search,
in: Handbook of metaheuristics, Springer, 2010, pp.</p>
        <p>References 399–419.
[13] S. Ropke, D. Pisinger, An adaptive large
neighborhood search heuristic for the pickup and delivery
problem with time windows, Transportation
science 40 (2006) 455–472.
[1] J. Mula, R. Poler, J. P. García-Sabater, F. C. Lario,</p>
        <p>Models for production planning under uncertainty:
A review, International journal of production
economics 103 (2006) 271–285.
[14] L. Perron, P. Shaw, V. Furnon, Propagation guided [29] R. Albert, A.-L. Barabási, Statistical mechanics of
large neighborhood search, in: International Con- complex networks, Reviews of modern physics 74
ference on Principles and Practice of Constraint (2002) 47.</p>
        <p>Programming, Springer, 2004, pp. 468–481. [30] E. Balas, A. Ho, Set covering algorithms using
cut[15] D. Dumez, F. Lehuédé, O. Péton, A large neighbor- ting planes, heuristics, and subgradient
optimizahood search approach to the vehicle routing prob- tion: a computational study, Springer, 1980.
lem with delivery options, Transportation Research [31] K. Leyton-Brown, M. Pearson, Y. Shoham, Towards
Part B: Methodological 144 (2021) 103–132. a universal test suite for combinatorial auction
algo[16] R. S. Sutton, A. G. Barto, Reinforcement learning: rithms, in: Proceedings of the 2nd ACM conference</p>
        <p>An introduction, MIT press, 2018. on Electronic commerce, 2000, pp. 66–76.
[17] R. J. Williams, Simple statistical gradient-following [32] D. Liu, M. Fischetti, A. Lodi, Learning to search
algorithms for connectionist reinforcement learn- in local branching, in: Proceedings of the AAAI
ing, Machine learning 8 (1992) 229–256. Conference on Artificial Intelligence, volume 36,
[18] F. Torabi, G. Warnell, P. Stone, Behavioral cloning 2022, pp. 3796–3803.</p>
        <p>from observation, in: Proceedings of the 27th Inter- [33] I. Bello, H. Pham, Q. V. Le, M. Norouzi, S. Bengio,
national Joint Conference on Artificial Intelligence, Neural combinatorial optimization with
reinforce2018, pp. 4950–4957. ment learning, arXiv preprint arXiv:1611.09940
[19] L. P. Kaelbling, M. L. Littman, A. W. Moore, Rein- (2016).</p>
        <p>forcement learning: A survey, Journal of artificial [34] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, L. Song,
intelligence research 4 (1996) 237–285. Learning combinatorial optimization algorithms
[20] S. Ross, D. Bagnell, Eficient reductions for imi- over graphs, Advances in neural information
protation learning, in: Proceedings of the thirteenth cessing systems 30 (2017).
international conference on artificial intelligence [35] E. Larsen, S. Lachapelle, Y. Bengio, E. Frejinger,
and statistics, JMLR Workshop and Conference Pro- S. Lacoste-Julien, A. Lodi, Predicting solution
sumceedings, 2010, pp. 661–668. maries to integer linear programs under imperfect
[21] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, information with machine learning, arXiv preprint
M. Laskin, P. Abbeel, A. Srinivas, I. Mordatch, De- arXiv:1807.11876 (2018).
cision transformer: Reinforcement learning via se- [36] V. Nair, S. Bartunov, F. Gimeno, I. von Glehn,
quence modeling, Advances in neural information P. Lichocki, I. Lobov, B. O’Donoghue, N. Sonnerat,
processing systems 34 (2021) 15084–15097. C. Tjandraatmadja, P. Wang, et al., Solving mixed
[22] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, A. Lodi, integer programs using neural networks, arXiv
Exact combinatorial optimization with graph con- preprint arXiv:2012.13349 (2020).
volutional neural networks, Advances in Neural [37] C. K. Joshi, T. Laurent, X. Bresson, An
efiInformation Processing Systems 32 (2019). cient graph convolutional network technique for
[23] S. Zhang, H. Tong, J. Xu, R. Maciejewski, Graph the travelling salesman problem, arXiv preprint
convolutional networks: a comprehensive review, arXiv:1906.01227 (2019).</p>
        <p>Computational Social Networks 6 (2019) 1–23. [38] Y. Deng, S. Kong, C. Liu, B. An, Deep attentive belief
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, propagation: Integrating reasoning and learning
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- for solving constraint optimization problems, arXiv
tention is all you need, Advances in neural infor- preprint arXiv:2209.12000 (2022).</p>
        <p>mation processing systems 30 (2017). [39] Y. Razeghi, K. Kask, Y. Lu, P. Baldi, S. Agarwal,
[25] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, R. Dechter, Deep bucket elimination., in: IJCAI,
I. Sutskever, et al., Language models are unsuper- 2021, pp. 4235–4242.</p>
        <p>vised multitask learners, OpenAI blog 1 (2019) 9. [40] Y. Deng, S. Kong, B. An, Pretrained cost model for
[26] J. Bai, S. Kong, C. P. Gomes, Gaussian mixture vari- distributed constraint optimization problems, in:
ational autoencoder with contrastive learning for Proceedings of the AAAI Conference on Artificial
multi-label classification, in: International Confer- Intelligence, volume 36, 2022, pp. 9331–9340.
ence on Machine Learning, PMLR, 2022, pp. 1383– [41] S. Hochreiter, J. Schmidhuber, Long short-term
1398. memory, Neural computation 9 (1997) 1735–1780.
[27] D. P. Kingma, J. Ba, Adam: A method for stochas- [42] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to
setic optimization, arXiv preprint arXiv:1412.6980 quence learning with neural networks, Advances
(2014). in neural information processing systems 27 (2014).
[28] P. Erdős, A. Rényi, et al., On the evolution of ran- [43] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
dom graphs, Publ. Math. Inst. Hung. Acad. Sci 5 Bert: Pre-training of deep bidirectional
transform(1960) 17–60. ers for language understanding, arXiv preprint</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A. Ablation Study</title>
      <p>IPGPT⊖ MLC
IPGPT⊖ DT
IPGPT</p>
    </sec>
    <sec id="sec-4">
      <title>D. Other Related Work</title>
      <p>label correlations are prevalent in MLC. Early works
capture the correlations through classifier chains [ 46],
Learning to Optimize. Recently, there has been an in- Bayesian inference [47], and dimensionality reduction
creasing interest in applying ML to learn solving COPs. [48]. Thanks to the huge capacity of DNNs, one can
alleBroadly speaking, there are three categories of learning viate the laborious feature mapping and therefore focus
to optimize algorithms: (1) Learning to predict solutions on the loss function, feature-label and label-label
correlafrom inputs. Larsen et al. [35] train a deep neural net- tion modeling [49, 50, 51]. It has been shown that
conwork (DNN) to predict the solution of a stochastic load trastive learning can exploit label information efectively
planning problem. Nair et al. [36] propose neural diving in a data-driven manner, and learn meaningful feature
to learn a DNN to generate multiple partial assignments and label embeddings that capture the label correlations
for its integer variables, and the resulting smaller mixed and enhance the predictive power [26]. However, current
integer programs (MIPs) for un-assigned variables are LNS algorithms fail to explore the correlations of variable
solved with an of-the-shelf MIP solver to construct high- selection in each step. In this work, we will formulate the
quality joint assignments. Joshi et al. [37] learn a DNN policy learning of variable selection as an MLC problem
by supervision to predict the probability of an edge to be and adopt contrastive learning to model category-level
in the traveling salesman problem (TSP) tour. A feasible label correlations. To the best of our knowledge, we are
tour is then generated by beam search. (2) Learning to the first to use sequence to multi-label learning to
imconfigure COP algorithms. Liu et al. [32] learn to con- prove the performance of LNS, and thus enable us to
ifgure the search neighborhood size of LB in each step benefit from modeling long sequences of LNS behaviors
by using RL. Deng et al. [38] integrate belief propagation and exploiting correlations between variable selection
(BP), gated recurrent units (GRUs), and graph attention simultaneously.
networks (GATs) within the message-passing framework Primal Heuristics. Numerous primal heuristic
alto reason about dynamic weights and damping factors gorithms have been proposed to enhance the eficiency
for composing new BP messages. (3) Learning along- of solving ILPs [52]. Primal heuristics span from
simside COP algorithms. Nair et al. [11] learn a DNN to pler rounding heuristics [53] to more computationally
make variable selection decisions in B&amp;B to bound the demanding diving and large neighborhood search (LNS)
objective value gap with a small tree. Deep Bucket Elimi- heuristics, such as Relaxation Induced Neighborhood
nation (DBE) [39] uses DNNs to approximate the large Search (RINS) [54]. LNS heuristics are improvement
bucket functions. Deng et al. [40] propose a pretrained heuristics that solve auxiliary problems using the
branchcost model which predicts the optimal cost of a given par- and-bound technique.
tially instantiated COP. The predicted cost is then used RINS, a prominent LNS meta-heuristic, seeks to
imto construct heuristics for various COP algorithms such prove a given feasible MIP solution. By comparing the
as LNS and B&amp;B. Our work belongs to the third category. feasible solution with one obtained from relaxing integer</p>
      <p>Sequence Learning. The recent rapid development variables, it identifies and eliminates variables with
difof sequence modeling is largely due to the successful fering values between the two. The resulting sub-MIP is
applications of DNNs, from LSTMs [41] and sequence-to- then solved using a MIP solver.
sequence models [42] to transformer architectures with The solution-polishing heuristic [55] employs a
self-attention [24]. Sequence learning aims to capture variable-fixing neighborhood similar to RINS, while also
the temporal dependence of sequential data (text, speech, integrating an evolutionary algorithm approach. Unlike
video, etc.), and it is widely used in NLP [43, 44]. There RINS, this heuristic uses crossover and mutation
operais also a recent attempt to apply sequence learning in tions to combine multiple solutions chosen from a pool of
scientific discovery such as using a causal transformer to available feasible solutions. The crossover process fixes
model material property with sequential structures [45]. variables that have identical values across all selected
In light of this, it is also tempting to consider how such se- solutions, while mutation is introduced by randomly
fixquence models can lead to improved performance in LNS, ing additional variables to refine already high-quality
which is also concerned with sequential processes. A solutions.
causal transformer is an architecture to eficiently model Adaptive LNS [56] capitalizes on an ensemble of LNS
sequences, which is the cornerstone of a decision trans- algorithms for MIPs and employs a multi-armed bandit
former in RL [21]. However, little has been done to model to adaptively switch among them during a MIP solve.
the sequential processes of LNS with transformers; we Although our work does not focus on an ensemble
apwill address this limitation by adopting a causal trans- proach, it could be incorporated as another ensemble
former (GPT2). member to enhance performance.</p>
      <p>Multi-Label Classification. Multi-label classification In contrast, learning-based LNS approaches, such as
(MLC) is a prediction task where each sample can have IPGPT, can be regarded as primal heuristics automatically
more than one labels. Unlike the single-label scenario, learned through machine learning. These approaches
showcase significant potential by exploiting data-driven
techniques, which ultimately result in improved
performance and adaptability across a wide range of problem
instances. This work is particularly interested in
advancing the capabilities of learning-based LNS approaches.</p>
    </sec>
    <sec id="sec-5">
      <title>E. Further Implementation Details</title>
      <p>We implemented our GPT-2 using the Huggingface
Transformers library. The hyperparameters we employed are
as follows: (1) Number of layers: 3; (2) Number of
attention heads: 1; (3) Embedding dimension: 128; (4)
Nonlinearity function: ReLU; (5) Batch size: 128; (6) Context
length K: 25; (7) Dropout ratio: 0.1; (8) Learning rate:
1e-4; (9) Gradient norm clipping: 0.25. We maintained
other parameters at their default values. We trained the
model from scratch and did not utilize any pre-trained
weights.</p>
      <p>Grid search is adopted for tuning. We tune learning
rate from 0.00005 to 0.002 with interval 0.00005, dropout
ratio from [0.05, 0.1, 0.3, 0.5], weight decay from
[0, 0.01, 0.0001],  from [0.1, 0.5, 1, 1.5, 2.0], token
embedding size from [64, 128, 256, 512], context length 
from [15, 20, 25, 30], batch size from [64, 128, 256],
Gradient norm clipping from [0.15, 0.2, 0.25, 0.3, 0.35].</p>
      <p>In the data collection process, we utilize random LNS
with adaptive neighborhood size. The neighborhood size
is initially set to 10% of the number of variables in the
input problem instance. It is then adapted following the
approach described in the paper by [10].</p>
      <p>During testing, at each step , the model generates
action distributions , for each dimension 
autoregressively. We apply a threshold of 0.5 to convert these values
into 1 or 0, representing the selection or non-selection of
the corresponding variable  in step .</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>