<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>CILC</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GPU-Accelerated Propagation for the Stable Marriage Constraint⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Travasci</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Tardivo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Formisano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gruppo Nazionale per il Calcolo Scientifico - INdAM</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New Mexico State University, Dept. of Computer Science</institution>
          ,
          <addr-line>Las Cruces, NM</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Udine</institution>
          ,
          <addr-line>DMIF</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>40</volume>
      <fpage>25</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>The Stable Marriage Problem (SMP) consists of finding a stable matching between two equally sized disjoint sets, typically referred to as men and women, based on individual preference lists. A matching is stable if no man and woman prefer each other over their assigned partners. The Gale-Shapley algorithm is the classical polynomial-time solution to this problem. In Constraint Programming (CP), the Stable Marriage Constraint (SMC) encapsulates this problem as a global constraint, with a consistency-enforcing propagator derived from an extended version of the Gale-Shapley algorithm. In this paper, we present a GPU-accelerated propagator for the SMC and its integration into a CP solver. Experimental results against the sequential version demonstrate the potential of GPU acceleration in handling large instances of the stable marriage constraint.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Stable Marriage</kwd>
        <kwd>Global Constraint</kwd>
        <kwd>Constraint Programming</kwd>
        <kwd>Parallel Programming</kwd>
        <kwd>GPU Computing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Several variations of SMP exist. In the SMP with indiference , the preference lists are not strictly
ordered, allowing ties between elements [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The preference lists may be incomplete [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], meaning that
some partners are deemed unacceptable and that some couples are not admissible. Variations also
exist in which each element of one set may be matched with more members of the other set, i.e., the
solution sought for is a many-to-many relation. A similar problem is the stable roommates problem,
in which there is only one set and the preferences are referred to the other members of the such [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Generalizations have been proposed that involve three sets of elements [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. We refer the reader to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
0
1
2
3
4
3
0
0
2
0
4
1
4
1
1
1
3
3
4
4
2
2
2
3
2
      </p>
      <p>
        SMP and its variants find wide application in various fields such as computer science, mathematics,
biology and physics [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. SMP variants have been applied also in economics to study markets and
monetary exchange models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or to relate social networks and job markets [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Systems based on
SMP have been used to various solve real-word problems, for example, to assign medical students
to hospitals by the National Resident Matching Program in the US [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], to match students to high
schools [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], to solve the problems of college admissions, medical resident allocation, and optimal
assignment of sailors to billets, to mention some among many (cf., for instance, [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ] and the
references therein).
      </p>
      <p>
        The bursting advent of GPGPU (General Purpose computing on Graphics Processing Units) has
recently driven the growth of GPU-computing and, more broadly, heterogeneous computing across
numerous application domains. In the field of Computational Logic, research has shown that the massive
parallelism and extreme computational power provided by GPUs can be used to accelerate reasoning and
problem-solving tasks more efectively than traditional CPU-oriented solvers. For instance, approaches
to SAT- and ASP-solving have been put forth in [
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19">16, 17, 18, 19</xref>
        ], while [
        <xref ref-type="bibr" rid="ref20 ref21 ref22">20, 21, 22</xref>
        ] describe encouraging
outcomes when using GPUs for CSP-solving. In keeping with this research, in this paper we introduce
a GPU-accelerated propagator for the Stable Marriage constraint. As far as we are aware, this is the
ifrst study to use GPUs to speed up the propagation of such a global constraint.
      </p>
      <p>The paper is organized as follows. In Section 2 we briefly review the Gale–Shapley algorithm which
is the basis for the serial propagator described in Section 3. A GPU-accelerated implementation of the
propagator is detailed in Section 4. Experimental comparison with a serial version of the propagator is
reported on in Section 5. Finally, we draw our conclusions in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Extended Gale–Shapley Algorithm</title>
      <p>
        The Gale–Shapley algorithm [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], also known as deferred acceptance algorithm, can be seen as a sequence
of proposals from the men to the women. A man who is not already engaged proposes to his most
preferred woman to whom he has not proposed yet. The woman accepts the proposal if she is single,
or if it comes from a man she prefers more than her current partner. Starting with all individuals
unmatched, the algorithm iteratively proceeds until every man is engaged. This process creates a
so-called man-optimal stable matching, in which each man is engaged to his most preferred woman that
he can have. Note that a man-optimal stable matching is woman-pessimal, meaning that each woman is
matched with her least preferred man that she can have [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. By switching the roles of men and women
so that the women propose, the algorithm yields a woman-optimal and man-pessimal solution [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        Algorithm 1: The extended Gale–Shapley algorithm (from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ])
1 mark each person as free
2 while there is a free man  do
3 let  be the first woman in ’s list
4 if  is engaged to some man  then mark  as free
5 mark  and  as engaged to each other
6 for each successor ′ of  in ’s list do
7 delete ′ from the preference list of 
8 delete  from the preference list of ′
      </p>
      <p>
        Several algorithms have been proposed for finding a balanced solution, based on diferent optimization
criteria [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Consider the procedure outlined above. We can observe that if woman  receives a proposal from
man , then ’s partner in the final stable matching (which is man-optimal/woman-pessimal) cannot
be less preferred than . Hence, each ′ successor of  in ’s list cannot be paired with  in any
solution. Consequently, we can safely remove ′ from ’s preference list and  from that of ′.
This observation leads to an improved algorithm called extended Gale-Shapley algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], shown in
Algorithm 1.
      </p>
      <p>Note that in this algorithm proposals are always accepted. Moreover, when the algorithm ends, each
man is engaged with the first remaining woman in his list, and each woman with her last remaining
man. These final lists are known as the man-oriented Gale-Shapley lists (MGS-lists). Intersecting these
lists with those obtained by executing the algorithm with men and women switched, produces the
GS-lists. Every partner a person  can have in any stable matching occurs in ’s GS-list.</p>
    </sec>
    <sec id="sec-3">
      <title>3. CSP and the Stable Marriage Constraint</title>
      <p>A constraint satisfaction problem (CSP) is a triple  = ⟨, , ⟩, where  = {1, . . . , } is a finite
set of  variables,  is the set of (finite) domains for the  variables, and  is the set of constraints
over . Let dom() denote the domain of the variable . A constraint  on a set of variables var() =
{1 , . . . , ℎ } ⊆  describes a subset rel() ⊆ dom(1 ) × · · · × dom(ℎ ).</p>
      <p>
        A solution of  is an assignment  = [1/1, . . . , /] of values to the variables in  that satisfies
each  ∈ , i.e., assuming var() = {1 , . . . , ℎ }, it holds that ⟨1 , . . . , ℎ ⟩ ∈ rel(). An assignment
is locally consistent w.r.t. a set  ⊆  if it satisfies each  ∈  such that var() ⊆  . A CSP on  is
k-consistent [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] if, given any set  = {1 , . . . , − 1 } ⊆  of  − 1 variables, for each assignment
[1 /1 , . . . , − 1 /− 1 ] which is locally consistent w.r.t.  , for any -th variable  there exists a
value  such that [1 /1 , . . . ,  / ] is locally consistent w.r.t.  ∪ {}.
      </p>
      <p>
        The Stable Marriage Constraint (SMC) modelling a given SMP  was introduced in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Enforcing
2-consistency of such constraint corresponds to the application of the extended Gale–Shapley algorithm
on  . That is, each solution for the SMC describes a stable matching of  .
      </p>
      <p>
        Let us briefly describe the data structures and the basic functions introduced by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to deal with a
SMC involving  men and  women.
      </p>
      <p>As we will see, neither engaged couples nor the single/engaged status of men/women are explicitly
stored. Instead, the propagator works by removing from the domains the same values that Algorithm 1
would remove from the preference lists. This is achieved by examining the changes in the domains
(i.e., removals of elements) and removing the pairs that can be safely eliminated as a result of these
modifications, following the same logic as the extended Gale-Shapley algorithm. Note that the removal
of elements from domains can also occur as a result of propagations of other constraints (present in the
CSP model at hand), i.e. by actions “external” to the SMC propagator.</p>
      <p>The preference lists are stored into two 2-dimensional arrays (i.e. matrices),  and . Hence, the
array [] stores the preference list for the the -th man , so that [][] is the -th preference
Algorithm 2: Stable marriage propagation
1 Function deltaMin (i)
2 ℓ ← [][getMin([])]
3 setMax([ℓ], wPm[ℓ][])
4 for  ← [] to getMin([]) − 1 do
5  ← [][]
6 setMax([], wPm[][] − 1)
7</p>
      <p>[] ← getMin([])
8 Function deltaMax (j)
9 for  ← getMax([]) + 1 to [] do
10  ← [][]
11 remVal([], mPw [][])
12</p>
      <p>[] ← getMax([])
13 Function removeValue (i,a,isMan)
14 if isMan then
15  ← [][]
16 remVal([], wPm[][])
17 else
18  ← [][]
19 remVal([], mPw [][])
20 Function init ()
21 for  ← 0 to  − 1 do deltaMin()
22 Function inst (i,isMan)
23 if isMan then
24 for  ← [] to getVal([]) − 1 do
25  ← [][]
26 setMax([], wPm[][] − 1)
of the -th man . Analogously, women’s preference lists are stored in . The constraint makes
use of two arrays of variables  and . The domains of these variables are initially set to [0,  − 1] and
the value of each variable [] (resp., []) represents the partner of the man  (resp., the woman ).
More specifically, a value ℓ for a variable [] represents the ℓ-th preference in ’s preference list, i.e.,
an assignment of ℓ to [] describes the match (,[][ℓ]), coupling  and [][ℓ]. To easily switch
between preferences in the domains and their corresponding people, two additional matrices mPw and
wPm are used to store the inverse preference lists. Then, mPw [] is ’s inverse preference list and the
value mPw [][] is the index of  in ’s preference list, and similarly for wPm. Two additional arrays
of integers,  and , are also used. The former stores the previous upper bounds of the domains of
the variables in , that is the upper bounds they had when the previous propagation ended. Similarly,
 contains the previous lower bounds of the variables in . The values in  are initialized to  − 1,
the ones in  to 0.</p>
      <p>
        The original propagation algorithm is specified in terms of a set of functions to call when some variable
changes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In our implementation (cf., Algorithm 2) we extended the code of removeValue() and inst()
so that both men and women are processed, contrary to what was stated in their original descriptions,
where only men are treated. While the modification of inst() does not afect the correctness of the
algorithm, the change in removeValue() is indeed necessary. The functions are:
deltaMin(). Called when the minimum of dom([]) increases (i.e., either  has been rejected by
its preferred partner or the minimum has been changed externally to the propagator). The new
favourite partner is ℓ (line 2 in Algorithm 2), and the function updates the maximum of dom(ℓ)
to remove all men ℓ likes less that  (line 3). Moreover, in lines 4–6, for each value  that has
been removed from dom([]) the corresponding woman  is determined and the maximum of
dom([]) is updated to exclude men less preferred that []. Finally, in line 7, the lower bound
[] for  is updated.
deltaMax(). Called when the maximum of dom( ) decreases. The woman  is removed from the
domain of each man  that has been removed from dom( ) (lines 9–11 of Algorithm 2). The
upper bound [] for  is updated accordingly (line 12).
removeValue(, ,  ). Called when a value  is removed from dom([]) (resp., dom([])) and
 is not the minimum (resp., maximum) of the domain of the man  (resp., woman ). Then, 
(resp., ) is removed from the domain of the person  (resp.,  ) corresponding to the value .
inst(,  ). This function is called when a variable is bound to a specific value, namely, a match
between two persons  and  is created. It handles the two symmetric cases in which  represents
either a man or a woman. Whenever  is engaged with a partner , the function propagates the
decision imposing that  is engaged to  (lines 28 and 37).
init(). This function is called once, when the propagator is initialized. It consists in a call to deltaMin()
for each man . It is somewhat analogous to setting all the men free so that they will make
proposals at the start of the extended Gale-Shapley algorithm.
      </p>
      <p>
        These functions make use of some auxiliary simple operations on domains. Namely, those to get the
upper/lower value of a dom() (i.e., getMin() and getMax()); to get/set the value of  (i.e., getVal()
and setVal(, )); to decrease the maximum of dom() to  if  &lt; getMax() (i.e., setMax(, )); and
to remove a value  from dom() (i.e., remVal(, )). We omit their description (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for details).
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] suggest how the functions in Algorithm 2 can be integrated into a hypothetical
CSP-solver to implement the propagation procedure. We will return to this point in Section 4.1, while
describing the integration of the GPU-accelerated propagator for SMC into MiniCPP.
      </p>
      <sec id="sec-3-1">
        <title>3.1. The MiniCP Solver and its Extension Mechanism</title>
        <p>
          Among several eficient CSP-solvers, for the purpose of this paper we have chosen MiniCP [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], an
open source solver designed to ease the development of extensions. In particular, our work is based on
MiniCPP [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], a C++ re-engineering of MiniCP supporting CUDA code integration. Previous successful
attempts [
          <xref ref-type="bibr" rid="ref20 ref21 ref22">20, 21, 22</xref>
          ] to extend MiniCPP with GPU-based modules strongly motivated this choice.
        </p>
        <p>The main engine of MiniCPP performs the constraint-based search in the usual manner, by
alternating search and propagation phases, possibly involving backtracking steps. Whenever a
consistencyenforcing procedure is triggered, the corresponding method is responsible for starting the propagation
and reporting the results to the main engine.</p>
        <p>The possibility of extending MiniCPP with a new propagator is transparent to the way it is
implemented, it sufices that it complies with the solver interface. Indeed, new constraint propagators
can be easily integrated by extending the Constraint class of MiniCPP. To handle a new constraint,
the extension must provide a constructor (to allocate and initialize the main data structures) and two
procedures, post() and propagate(). The former is called by MiniCPP whenever a constraint is added
to the model to set the conditions under which the propagation algorithm, i.e., the propagate() method,
is called. This latter method is responsible to transparently ofload the computation on the GPU and to
suitably manage the related data transfers.</p>
        <p>Note that MiniCPP provides backtrackable data structures, meaning that when the solver backtracks
it reverts them to their values at the choice point to which the search backtracks, undoing any change
made in the meantime. This can be achieved with an apposite class provided by the solver. As we will
see, we exploit this feature in implementing our parallel propagator.</p>
        <p>
          As concerns the extension of the input syntax for the CSP-solver, we first introduced the definition
stable_matching(⟨⟩, ⟨⟩, ⟨menPrefLists ⟩, ⟨womenPrefLists ⟩)
to represent a SMC in the input MiniZinc models. Then, we relied on the MiniZinc annotation mechanism
to select the propagator to be used [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. In detail, when a constraint is annotated with ::gpu, it is
propagated using the GPU-accelerated implementation. Figure 1 shows a fragment of a MiniZinc
encoding for the SMP in Table 1, using the syntax extension just described.
        </p>
        <p>int: n = 5;
array [1..5] of var 0..n-1: men;
array [1..5] of var 0..n-1: women;
array [int, int] of 0..n-1: pm = [|3,4,1,2,0|0,1,3,2,4|0,4,3,2,1|2,1,4,3,0|0,1,4,2,3|];
array [int, int] of 0..n-1: pw = [|1,3,4,0,2|2,3,0,4,1|0,2,1,4,3|0,3,2,1,4|1,4,0,3,2|];
constraint stable_matching(men, women, pm, pw) ::gpu;</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The Parallel Propagator: SMP on GPU</title>
      <p>
        As mentioned, to extend MiniCPP with a new constraint, it sufices to specialize the class Constraint.
Our serial and parallel implementations are available at https://clp.dimi.uniud.it/sw, and fully
documented [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. In what follows, we focus on the GPU-accelerated version, in particular on the propagate()
method. For the convenience of the reader, a brief introduction to GPU computing and CUDA [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] is
provided in appendix A.
      </p>
      <p>Unlike the extended Gale-Shapley algorithm, the parallel implementation of the propagator for SMC
proceeds by processing the proposals of all free men simultaneously. This is done by launching a
thread for every free man. During the first execution of the propagator, all men are considered free. In
subsequent iterations, the free men are those whose domain minimum has been modified, corresponding
to the situation where their most preferred woman is no longer available. Each thread, independently,
makes the proposal for the man it was assigned. Such proposal may cause a previously engaged man to
break up and become free. The thread that handled the proposal will then take responsibility for the
newly freed man and begin making proposals on his behalf. If a proposal does not free any man, the
thread terminates. Note that each proposal can free at most one man, since a woman can be engaged to
at most one man. Therefore, in this operation, the total number of free men cannot increase, since each
proposal either reduces that number or leaves it unchanged. Consequently, at each propagator run, it is
suficient to start a thread for each currently free man.</p>
      <p>If a missing value  is found during the processing of a free man, the SMC propagator must pretend
that the proposal was made anyway, accepted, and the engagement was immediately broken. This is
necessary because a pair removed externally (i.e., by another constraint or the solver) could still be a
blocking pair (cf., Section 1), so the pairs it blocks must be removed from the domains. In presence of
such a value , not only the man who made the proposal remains free, but also another man may be
freed. In this situation the total number of free men could increase.</p>
      <p>For values in the men’s domains removed by the propagator that are beyond the minimum, the
corresponding values in the women’s domains are all greater than the current woman’s maximum.
When, instead, a value is removed from outside the propagator, the corresponding value in the woman’s
domain may be smaller than the domain maximum. In this case, the woman would have accepted the
proposal, forming a pair (, ) that would be a blocking pair for all matchings in which  is engaged
to a woman he likes less than , and  is engaged to a man she likes less than . The missing value in
the domain of  was found because the new minimum is now greater than it, so  will necessarily be
engaged to a woman he prefers less than . It follows that  cannot be engaged with a man she likes
less than , otherwise the resulting matching would be unstable. The values of the men that  likes
less than  must be removed from her domain. Since this removal changes the woman’s maximum, it
could potentially free up another man. To deal with this problem, the additional men freed this way are
stored in an array, and the propagator is launched again to process them.</p>
      <p>Note that, by applying the strategy just described, the parallel implementation runs only the necessary
number of threads and does not require any synchronization between them.</p>
      <p>Data structures: The propagator uses the same data structures described in Section 3 for the
preference and inverse-preference lists (stored in simple matrices) and the domains (represented as
bitmaps). Some additional data structures are also used. In particular:
• Four backtrackable arrays old _min_men, old _max _men, old _min_women and
old _max _women, store the old bounds the variables had after the latest propagation.
(These arrays play the role of xlb and yub seen in Section 3.)
• The arrays stack _mod _men and stack _mod _women store the lists of the men and the women
whose domains have been modified in any way since the latest propagation. The variables
length_men_stack and length_women_stack store their lengths. Note that values that are
missing from the domains (for example, as a consequence of previous propagation of other constraints,
or because they were not in the domain from the beginning) are treated as modifications of the
domains, suitably initializing these arrays.
• The array stack _mod _min_men collects the men whose minimum was changed. The integer
length_min_men_stack stores their number.
• As mentioned, when handling an externally removed value, a thread might end up freeing an
additional man. This man is added to an array new _stack _mod _min_men, whose current length
is new _length_min_men_stack .</p>
      <p>The propagator kernels: The method propagate() essentially corresponds to the execution, managed
by the host, of the following three phases:
1. Enforcing domains coherency: If a value was removed from a person’s domain, that person is
also removed from the domain of the person represented by that value.
2. The actual enforcing of the stable marriage constraint: This is where the strategy previously
described is implemented. As previously mentioned, this step may be repeated several times
3. Update of auxiliary data structures (see below).</p>
      <p>It is possible that an empty domain is detected in phase 2. Then, phase 3 will act accordingly to pass the
information to the solver (which will either backtrack or declare unsatisfiability).</p>
      <p>More specifically:
Phase 1. This step aims to restore domain consistency of the variables for men and women. As
mentioned, when domain changes occur outside the SMC propagator, it is possible for a partner  to
be removed from the domain of a person , but not vice versa. This kernel make_domains_coherent
restores consistency, if necessary, by removing  from the domain of . This removal could modify the
minimum of a man’s domain. In this case, such man is added to stack _mod _min_men. The kernel
is run with length_men_stack +length_women_stack threads, one for each person whose domain
was modified (and listed in stack _mod _men and stack _mod _women). Each thread simply scans the
domain associated to its person looking for missing values and acts as explained.</p>
      <p>Phase 2. This step performs the core part of the propagation by enforcing the constraint logic. After
an initialization step performed on the host, a grid of length_min_men_stack threads executing the
kernel apply_sm_constraint() is launched (see Algorithm 3). In this execution each thread is assigned
a man retrieved from the array stack _mod _min_men (line 2). Each thread starts a loop to process
elements of the man’s domain, starting from the value old _min_men[]. The array old _min_men
keeps track of the latest values of the men’s domains that were processed. In line 4 each thread retrieves
the first element to be checked. It may be the case that the current domain is empty (this is discovered by
checking whether w _index is beyond the domain boundary), then the thread exits (line 5). Otherwise
two scenarios are possible. Either w _index is in the domain and then it will be processed (lines 6–17),
or it has been removed and the missing value must be handled (lines 18–25) as explained earlier.</p>
      <p>In the first case, the thread/man makes a proposal to the woman  corresponding to w _index (lines 7–
9). In line 9 an atomic access is performed to update the partner of  (i.e., to set max _women[] to the
index m_val of  in ’s preference list). There may be diferent threads/men concurrently trying to
make this update. Therefore, to ensure that only the proposer who is preferred over the current partner
and all other proposers succeeds, we use an atomic instruction that coherently updates max _women[]
and returns its old value p_val (i.e., the index of ’s previous partner in her preference list).</p>
      <p>Algorithm 3: The kernel function apply_sm_constraint() (simplified)
1 get thread’s unique 
2  ← ___[] /* get the (first) man assigned to this thread */
3 loop forever
4 _ ← __[] /* get the next value/woman to process */
5 if dom() = ∅ then return
6 else if _ ∈ dom() then /* the value/woman is still in ’s domain */
7  ← [][_] /* get  corresponding to _ */
8 _ ← wPm[][] /* get ’s value _ in ’s domain */
9 _ ← atomicMin(_[], _) /* the best proposal succedes */
10 if _ &lt; _ then /*  prefers another man and rejects the proposal */
11 __[] ← _ + 1
12 continue /* proceed to the next iteration of the loop */
13 else if _ = _ then /*  is already engaged to  */
14 return
15 else
16
17</p>
      <p>/* _ &gt; _: the proposal has been accepted */
remove  from the domains of all the man following  in her list
 ← [][_] /* set  to be the man previously engaged with  */
__[] ← _ + 1
 ← [][_] /* get  corresponding to _ */
_ ← wPm[][] /* get ’s value _ in ’s domain */
atomicMin(_[], _− 1) /* remove men following  in ’s list */
remove  from the domains of the men who were just removed from her list
if a man  was freed and is not in ____ then /* prepare for the */
add  to ____ /* next call to apply_sm_constraint() */
The proposal may be rejected (line 10). This may happen because another concurrent thread removed
w _index from ’s domain (because, such thread made a proposal to  on behalf of a man ′ that 
prefers to ). It follows that a value that was thought to be in the domain of  has to be considered
as missing. The thread simply moves on to the next value in ’s domain (after having updated
old _min_men[] appropriately in line 11 to trigger the next iteration in the loop).</p>
      <p>If  is already engaged with  (line 13), no man has been freed, so the thread terminates.</p>
      <p>If the proposal is accepted (that is, m_val is smaller than ’s previous maximum and it becomes the
new maximum), then  is removed from the domains of all the men following  in ’s preference list,
up to and including ’s old maximum (line 16). If a man has been freed, the thread will treat such man
as its newly assigned man (the update of  in line 17 will trigger another iteration of the loop).</p>
      <p>In the case in which w _index is a missing value (the test in line 6 fails because it has been removed
from dom(), as mentioned above), the thread must remove from the domain of the woman associated
with w _index all men following  in her list/domain (line 22). If this step actually removes some
men from her domain, the woman must be deleted from their domains too (for brevity this is shown
succinctly in line 23 of Algorithm 3). The value of old _min_men[] is updated accordingly (line 19)
to avoid processing again the same woman. If ’s maximum was actually modified (line 22), the old
maximum must be added to new _stack _mod _min_men (lines 24–25), so that apply_sm_constraint()
can be launched again and handle him appropriately. Note that this step is simplified in Algorithm 3.
The real implementation avoids multiple additions of the same elements to new _stack _mod _min_men
by exploiting some auxiliary data structures and using atomic accesses to such data.</p>
      <p>When apply_sm_constraint() ends the host retrieves the updated data structures. In particular, the
array new _stack _mod _min_men is inspected. If some men occur in it, the host prepares for another
call to the kernel to process them. This process is repeated until no free men exist.
Phase 3. The aim of this phase is to coherently update in parallel all elements of the arrays
old _max _women, old _max _men, and old _min_women. This task is performed by the kernel
function finalize_changes(). The grid is composed of  threads. The -th thread processes the -th man
and the -th woman. The -th thread updates the -th woman’s values in old _max _women and in
old _min_women, and the -th man’s value in old _max _men. When looking for the current bounds
to put in old _max _men (resp., old _min_women), the thread scans the domain starting from the
corresponding bound stored in max _men (resp., min_women). For lack of space, we omit the complete
code of finalize_changes().</p>
      <sec id="sec-4-1">
        <title>4.1. Integration into MiniCPP</title>
        <p>
          In order to enable the integration of the functions of Algorithm 2 into a generic CSP-solver, the authors
of [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] assume that constraint propagation is managed with a stack where the constraints to propagate
are collected. Such constraints are extracted and propagated one by one, possibly causing some variable
domains to shrink. This, in turn, causes all the constraints the variable is involved in to be added to
the stack. The propagation process ends when there are no more constraints to propagate or a domain
becomes empty.
        </p>
        <p>MiniCPP has such design, but embodies trade-ofs that prioritize simplicity over extensive feature
support. One example is that it does not track which event (i.e., bound change or element removal)
caused the insertion into the stack. Because of this, when the propagator for SMC is invoked, it checks
all men’s and women’s variables for changes. When a change is found, the method to be called is added
to an internal queue. Then, the queue is processed, executing all the needed methods. These, in turn,
may cause further additions to the internal queue.</p>
        <p>Another trade-of concerns how domain changes are tracked. MiniCPP ofers various methods to
check whether a variable has been modified (e.g., changed() and changedMin()), but these refer to the
last time the variable was backed up (i.e., at the end of the previous propagation phase), not to the last
time the variable was actually modified. Whenever a propagator is run twice between two of these time
points, variables marked as modified before the first run will continue to be considered as modified
before the second run. Additionally, when a propagator changes one of the variables it monitors,
it is re-scheduled and will see that variable as modified. Consequently, using the built-in methods
would result in redundant calls to the propagator. To avoid this, we exploited backtrackable arrays
whose implementation is built in MiniCPP. In particular, the arrays __ and __ store
the dimensions of the men’s and women’s domains after the previous propagation of the SMC. By
comparing their current sizes with their old ones, it is possible to detect whether a variable has been
modified since the latest execution of the propagator.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Evaluation</title>
      <p>To compare the performance of the serial and parallel implementations, we randomly generated 200
instances of size  = 2400. Each instance includes two SMCs: One modelling the standard SMP, and
another with the roles of men and women swapped. Our experiments focus on propagation time and
aim to identify the conditions under which each propagator performs more efectively.</p>
      <p>The machine used in these tests is equipped with an Intel Core i7-13700KF CPU (with 16 cores and
24 threads in Hyper-Threading, CPU clock rate 5.40GHz, 30MB cache), 128GB of RAM, and an NVIDIA
GeForce RTX 4090 GPU (compute capability 8.9, 24GB VRAM, 16384 CUDA cores, clock rate 2.57GH).</p>
      <p>The first parameter of interest is the number of free men at the beginning of phase 2 of the method
propagate(). The design of the instances is such that all men are free in the first propagation, while
in the following propagations either two or no men are free. Figure 2 shows the distribution of the
propagation times in these three diferent situations. When there are no or only a few free men, the
serial propagator outperforms the parallel one, with median execution times of 0.44 ms and 0.61 ms,
respectively. In contrast, the parallel version yields median times of 1.36 ms and 8.07 ms under the
same conditions. The significant diference between these two parallel timings is due to an optimization
that avoids launching the kernel for the second phase when there are no free men. When all men are
free, the parallel propagator tends to be faster, with a median propagation time of 72.34 ms, compared
to 162.34 ms for the serial version.</p>
      <p>Figure 3 compares the execution times of the first 100 propagations for each of the three cases, for
both serial and parallel versions. When there are exactly two free men, the serial propagator consistently
outperforms the parallel one. Conversely, when all men are free, the parallel propagator shows a clear
performance advantage. Interestingly, when no free men remain, the results show that the parallel
version performs better in many instances, but there is no clear winner.</p>
      <p>The total number of modified variables is another useful parameter to consider in order to characterize
the class of instances in which the parallel version performs better. The instances are designed such that
this number varies only when there are no free men. Figure 4 shows how the propagation time varies
with respect to the number of modified variables. For illustrative purposes, the plot displays only a
subset of data points, selected to uniformly cover the full range of the number of modified variables. The
serial propagator is faster when the number of modified variables is relatively small, while the parallel
propagator becomes more eficient as the number of modified variables increases. Note that the parallel
propagation time remains nearly constant across all instances, demonstrating strong scalability. This
behavior is mainly because of the optimization that skips the launch of the kernel for the second phase.
More specifically, in the case of few free men, the limited parallelization is not enough to compensate
the overhead of launching the kernel, and may even prove detrimental due to the small number of
threads launched. Having skipped the second phase, the eficiency of the propagator depends only
on the performances of the other phases. Moreover, a larger number of modified variables allows for
greater parallelization of the first phase, resulting in a corresponding increase in performance.</p>
      <p>A final parameter we considered is the total domain size, computed as the sum of the sizes of all
domains. Specifically, we are interested in the old sizes, the domain sizes after the previous propagation
of the constraint. Each modified domain is scanned from its old lower bound to its old upper bound, so
the old size provides an estimate of the number of values the propagators may need to examine.</p>
      <p>Figure 5 shows the results, with points selected to uniformly cover the x-axis. The case for  = 2400
free men is omitted, as the total domain size at the first propagation step is always  *  * 2, resulting in
a plot that resembles the rightmost one in Figure 2. When there are no free men, the parallel propagator
scales significantly better than the sequential. This is because larger total domains involve more work,
which can be efectively parallelized, helping to ofset the various overheads. In contrast, when there
are two free men, the sequential propagator performs better.</p>
      <p>Overall, the performance of the parallel propagator aligns with expectations. The more free men
there are, the more advantageous it becomes. A notable exception is the case with no free men, where
the parallel propagator can skip launching the kernel that executes the second phase. In this scenario,
the parallel propagator demonstrates good scalability with respect to both the number of modified
variables and the total domain size.</p>
      <p>It is worth noting that the second phase is the most complex step of the parallel propagator.
Consequently, ineficiencies in this phase have a greater impact on the overall propagation time than any
performance gains achieved in the other phases.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This paper revisits the Stable Marriage constraint from a parallel perspective and demonstrates that
CSP-solvers can efectively leverage the computational power of modern GPUs.</p>
      <p>It presents a GPU-accelerated implementation of the constraint and compares it against a sequential
version. The results demonstrate that the GPU-accelerated propagator ofers better scalability, enabling
solvers to handle larger instances more eficiently.</p>
      <p>As future work, an interesting extension would be a hybrid propagator capable of automatically
switching between sequential and GPU-accelerated propagators based on performance considerations.
This mechanism could be implemented using insights from our analysis and determine threshold
parameters specific to the machine on which the hybrid propagator runs.</p>
      <p>Another valuable direction would be to extend the propagator to its symmetrical version. In this case,
proposals are made by both men and women, and enforcing the SMC produces the GS-Lists instead of
the MGS-Lists. This enables more values to be removed during propagation, at the cost of an increased
computational workload. The use of a GPU in this context is especially beneficial, as it allows for both
faster and stronger propagation.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-8">
      <title>A. CUDA and GPU Computing, in a Nutshell</title>
      <p>
        GPU-computing can be seen as the use of a graphics processing unit (GPU) to accelerate computations
traditionally handled by the central processing unit (CPU). NVIDIA introduced the Compute Unified
Device Architecture (CUDA), a general-purpose programming library that allows programmers to ignore
the underlying graphics concepts in favor of high-performance computing concepts [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. It has been
successfully used to accelerate computations in a variety of fields, such as physics, bioinformatics, and
machine learning, to mention some among many.
      </p>
      <p>The architecture of a GPU (cf., Fig. 6) includes a main memory (DRAM), typically external to the
chip and used as global memory, an L2 cache, and an array of streaming multiprocessors. Each streaming
multiprocessor contains a small amount of fast memory integrated into the chip, used as an L1 cache
or shared memory, and several CUDA cores, responsible for the actual execution of instructions. The
architecture is designed to allow fast context switching of lightweight threads.</p>
      <p>The conceptual model underlying the parallelism supported by CUDA is the Single-Instruction
Multiple-Thread (SIMT), in which the same instruction is executed by diferent threads, while data and
operands can vary from thread to thread.</p>
      <p>A CUDA program includes parts intended to execute on the CPU (also referred to as host) and
parts intended to execute concurrently with the GPU (usually referred to as device). In the simplest
possible interaction between host and device, the host code transfers data to the device global memory,
launches a sequence of parallel computations on the device, and retrieves the results from device
memory. Commands are sent to the device from the host in a stream and are executed in-order. Multiple
streams can be used to run concurrent parallel computations on the device. The CUDA library supports
interaction, synchronization, and communication between the host and the device. Each computation
on the device is described as a set of concurrent threads (a grid, in CUDA terminology), each of which
performs the same function (called a kernel). Each thread is executed by a CUDA core; these threads are
organized hierarchically into warps (groups of 32 threads that execute in lock-step mode) and thread
blocks. Each block is assigned for execution to a streaming multiprocessor. Threads in a warp/block are
intended to execute the same instruction on diferent data. If the control flow between threads within a
warp diverges, their execution is serialized. The global memory of the device is accessible to all threads,
while threads in the same block can access shared memory at high speed.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gusfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Irving</surname>
          </string-name>
          ,
          <article-title>The Stable marriage problem - structure and algorithms</article-title>
          , Foundations of computing series, MIT Press, Cambridge, MA, USA,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Unsworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prosser</surname>
          </string-name>
          ,
          <article-title>An n-ary constraint for the stable marriage problem</article-title>
          ,
          <source>in: Proc. of the Fifth Int. Workshop on Modelling and Solving Problems with Constraints</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Irving</surname>
          </string-name>
          ,
          <article-title>Stable marriage and indiference</article-title>
          , Discret. Appl. Math.
          <volume>48</volume>
          (
          <year>1994</year>
          )
          <fpage>261</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Irving</surname>
          </string-name>
          ,
          <article-title>An eficient algorithm for the “stable roommates” problem</article-title>
          ,
          <source>J. Alg</source>
          .
          <volume>6</volume>
          (
          <year>1985</year>
          )
          <fpage>577</fpage>
          -
          <lpage>595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Knuth</surname>
          </string-name>
          ,
          <article-title>Mariages Stables et leurs relations avec d'autres problèmes combinatoires</article-title>
          , Les Presses de l'Université de Montréal,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          ,
          <article-title>Three-dimensional stable matching problems</article-title>
          ,
          <source>SIAM J. Discret. Math. 4</source>
          (
          <year>1991</year>
          )
          <fpage>245</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Iwama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miyazaki</surname>
          </string-name>
          ,
          <article-title>A survey of the stable marriage problem and its variants</article-title>
          ,
          <source>in: Proc. of ICKS'08</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Iwama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Manlove</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miyazaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Morita</surname>
          </string-name>
          ,
          <article-title>Stable marriage with incomplete lists and ties</article-title>
          , in: J.
          <string-name>
            <surname>Wiedermann</surname>
            ,
            <given-names>P. van Emde</given-names>
          </string-name>
          <string-name>
            <surname>Boas</surname>
          </string-name>
          , M. Nielsen (Eds.),
          <source>Proc. of ICALP'99</source>
          , volume
          <volume>1644</volume>
          <source>of LNCS</source>
          , Springer,
          <year>1999</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>452</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Fenoaltea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. B.</given-names>
            <surname>Baybusinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.-C. Zhang,</surname>
          </string-name>
          <article-title>The stable marriage problem: An interdisciplinary review from the physicist's perspective</article-title>
          ,
          <source>Physics Reports</source>
          <volume>917</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Biró</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klijn</surname>
          </string-name>
          ,
          <article-title>Matching with couples: a multidisciplinary survey</article-title>
          ,
          <source>Int. Game Theory Review</source>
          <volume>15</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Arcaute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vassilvitskii</surname>
          </string-name>
          ,
          <article-title>Social networks and stable matchings in the job market</article-title>
          , in: S. Leonardi (Ed.),
          <source>Proc. of WINE'09</source>
          , volume
          <volume>5929</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2009</year>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <article-title>The stable marriage problem and school choice</article-title>
          ,
          <source>AMS Feature Column</source>
          (
          <year>2015</year>
          ). Online; accessed 4-March-
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Biró</surname>
          </string-name>
          ,
          <article-title>Applications of matching models under preferences</article-title>
          , in: U. Endriss (Ed.), Trends in Computational Social Choice,
          <source>AI Access</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>345</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Manlove</surname>
          </string-name>
          , Algorithmics of Matching Under Preferences, volume
          <volume>2</volume>
          of Series on Theoretical Computer Science, World Scientific,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. O.</given-names>
            <surname>Sotomayor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Two-Sided Matching</surname>
          </string-name>
          :
          <article-title>A Study in Game-Theoretic Modeling and Analysis</article-title>
          ,
          <source>Econometric Society Monographs</source>
          , Cambridge University Press,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>A. D. Palù</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dovier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Formisano</surname>
          </string-name>
          , E. Pontelli, CUD@SAT:
          <article-title>SAT solving on GPUs</article-title>
          ,
          <source>J. Exp. Theor. Artif. Intell</source>
          .
          <volume>27</volume>
          (
          <year>2015</year>
          )
          <fpage>293</fpage>
          -
          <lpage>316</lpage>
          . doi:
          <volume>10</volume>
          .1080/0952813X.
          <year>2014</year>
          .
          <volume>954274</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Formisano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pontelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vella</surname>
          </string-name>
          ,
          <article-title>Parallel execution of the ASP computation - an investigation on GPUs</article-title>
          , in: M. D.
          <string-name>
            <surname>Vos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Eiter</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lierler</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Toni</surname>
          </string-name>
          (Eds.),
          <source>Proc. of ICLP</source>
          <year>2015</year>
          ,
          <article-title>Technical Communications</article-title>
          , volume
          <volume>1433</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Formisano</surname>
          </string-name>
          , E. Pontelli,
          <article-title>Parallel answer set programming</article-title>
          , in: Y.
          <string-name>
            <surname>Hamadi</surname>
          </string-name>
          , L. Sais (Eds.),
          <source>Handbook of Parallel Constraint Reasoning</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>282</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -63516-
          <issue>3</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Formisano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vella</surname>
          </string-name>
          ,
          <article-title>GPU-based parallelism for ASP-solving</article-title>
          , in: P. Hofstedt,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abreu</surname>
          </string-name>
          , U. John,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuchen</surname>
          </string-name>
          , D. Seipel (Eds.),
          <source>Proc. of DECLARE'19</source>
          , volume
          <volume>12057</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>23</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -46714-
          <issue>2</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tardivo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Formisano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Michel</surname>
          </string-name>
          , E. Pontelli,
          <article-title>Constraint propagation on GPU: a case study for the cumulative constraint, Constraints An Int</article-title>
          . J.
          <volume>29</volume>
          (
          <year>2024</year>
          )
          <fpage>192</fpage>
          -
          <lpage>214</lpage>
          . doi:
          <volume>10</volume>
          .1007/ S10601-024-09371-W.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tardivo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dovier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Formisano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Michel</surname>
          </string-name>
          , E. Pontelli,
          <article-title>Constraint propagation on GPU: a case study for the alldiferent constraint</article-title>
          ,
          <source>J. Log. Comput</source>
          .
          <volume>33</volume>
          (
          <year>2023</year>
          )
          <fpage>1734</fpage>
          -
          <lpage>1752</lpage>
          . doi:
          <volume>10</volume>
          .1093/LOGCOM/ EXAD033.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tardivo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Michel</surname>
          </string-name>
          , E. Pontelli,
          <article-title>CP for bin packing with multi-core and GPUs</article-title>
          , in: P. Shaw (Ed.),
          <source>Proc. of CP'24</source>
          , volume
          <volume>307</volume>
          of LIPIcs,
          <source>Schloss Dagstuhl - Leibniz-Zentrum für Informatik</source>
          ,
          <year>2024</year>
          , pp.
          <volume>28</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          :
          <fpage>19</fpage>
          . doi:
          <volume>10</volume>
          .4230/LIPICS.CP.
          <year>2024</year>
          .
          <volume>28</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Shapley</surname>
          </string-name>
          ,
          <article-title>College admissions and the stability of marriage</article-title>
          ,
          <source>The American Mathematical Monthly</source>
          <volume>69</volume>
          (
          <year>1962</year>
          )
          <fpage>9</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rossi</surname>
          </string-name>
          , P. van Beek, T. Walsh (Eds.),
          <source>Handbook of Constraint Programming, volume 2 of Foundations of Artificial Intelligence , Elsevier</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaus</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Van Hentenryck</surname>
          </string-name>
          ,
          <article-title>MiniCP: a lightweight solver for constraint programming</article-title>
          ,
          <source>Mathematical Programming Computation</source>
          <volume>13</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gentzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. J. van Hoeve</surname>
          </string-name>
          ,
          <article-title>HADDOCK: a language and architecture for decision diagram compilation</article-title>
          , in: H.
          <string-name>
            <surname>Simonis</surname>
          </string-name>
          (Ed.),
          <source>Proc. of CP'20</source>
          , volume
          <volume>12333</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Travasci</surname>
          </string-name>
          ,
          <article-title>A GPU-based Parallel Propagator for the Stable Marriage Constraint</article-title>
          ,
          <source>Master's thesis</source>
          , University of Udine, Italy,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>NVIDIA</given-names>
            <surname>Corporation</surname>
          </string-name>
          ,
          <source>NVIDIA CUDA Zone</source>
          ,
          <year>2025</year>
          . https://developer.nvidia.com/cuda-zone.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>NVIDIA</surname>
          </string-name>
          ,
          <string-name>
            <surname>CUDA C+</surname>
          </string-name>
          <article-title>+: Programming Guide</article-title>
          , NVIDIA Press, Santa Clara, CA,
          <year>2025</year>
          . Version 12.9.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>