=Paper=
{{Paper
|id=Vol-2454/paper_51
|storemode=property
|title=Learning Bit by Bit: Extracting the Essence of Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-2454/paper_51.pdf
|volume=Vol-2454
|authors=Sascha Mücke,Nico Piatkowski,Katharina Morik
|dblpUrl=https://dblp.org/rec/conf/lwa/MuckePM19
}}
==Learning Bit by Bit: Extracting the Essence of Machine Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2454/paper_51.pdf</pdf>
<pre>
     Learning Bit by Bit: Extracting the Essence
                of Machine Learning

              Sascha Mücke, Nico Piatkowski, and Katharina Morik

                    TU Dortmund, AI Group, Dortmund, Germany
                        http://www-ai.cs.tu-dortmund.de


        Abstract. Data mining and Machine Learning research has led to a
        wide variety of training methods and algorithms for different types of
        models. Many of these methods solve or approximate NP-hard optimiza-
        tion problems at their core, using vastly different approaches, some al-
        gebraic, others heuristic. This paper demonstrates another way of solv-
        ing these problems by reducing them to quadratic polynomial optimiza-
        tion problems on binary variables. This class of parametric optimization
        problems is well-researched and powerful, and offers a unifying frame-
        work for many relevant ML problems that can all be tackled with one
        efficient solver. Because of the friendly domain of binary values, such a
        solver lends itself particularly well to hardware acceleration, as we fur-
        ther demonstrate in this paper by evaluating our problem reductions
        using FPGAs.

        Keywords: · Machine Learning · Optimization · Hardware Acceleration


1     Introduction

Hardware acceleration for machine learning usually involves GPU implementa-
tions that can do fast linear algebra to enhance the speed of numerical compu-
tations. Our approach is different to GPU programming in that we make use
of a fixed class C of parametric optimization problems that we solve directly on
efficient specialized hardware. Solving a problem instance in C thus amounts to
finding the correct parameters β and feeding them to the solver.
     The underlying idea of using a non-universal compute-architecture for ma-
chine learning is indeed not new: State-of-the-art quantum annealers rely on the
very same principle of parametric problem classes. There, optimization problems
are encoded as potential energy between qubits; the global minimum of a loss
function can be interpreted as the quantum state of lowest energy [4]. The fun-
damentally non-deterministic nature of quantum methods makes the daunting
task of traversing an exponentially large solution space feasible. However, their
practical implementation is a persisting challenge, and the development of actual
quantum hardware is still in its infancy. The latest flagship, theD-Wave 2000Q,
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2       Mücke, Piatkowski, Morik

can handle problems with 641 fully connected bits, which is by far not sufficient
for realistic problem sizes.
    Nevertheless, the particular class of optimization problems that quantum
annealers can solve is well understood which motivates its use for hardware
accelerators outside of the quantum world.


2     Boolean Optimization
A pseudo-Boolean function (PBF) is any function f : Bn 7→ R that assigns a
real value to a fixed-length binary vector. Every PBF on n binary variables can
be uniquely expressed as a polynomial of some degree d ≤ n with real-valued
coefficients [2]. Quadratic Unconstrained Binary Optimization (QUBO) is the
problem of finding an assignment x∗ of n binary variables that is minimal with
respect to a Boolean polynomial of degree 2:
                                       n
                                       X                 n X
                                                         X i
                  x∗ = arg minx∈Bn            β i xi +              0
                                                                   βij xi xj ,   (1)
                                       i=1               i=1 j=1

                0
where βi and βij  are real-valued parameters that are fixed for a given problem
instance. As x · x = x for all x ∈ B, the linear coefficients βi can be integrated
                                  0                     0
into the quadratic parameters βii   by setting βii = βii  + βi , which makes for a
simpler formula and leaves us with a perfect lower triangle matrix for β:
                                    n X
                                    X i
                                              βij xi xj                          (2)
                                    i=1 j=1

   It has been shown that all higher-degree pseudo-Boolean optimization prob-
lems can be reduced to quadratic problems [2]. For this reason a variety of well-
known optimization problems like (Max-)3SAT and prime factorization, but also
ML-related problems like clustering, maximum-a-posterior (MAP) estimation in
Markov Random Fields, and binary constrained SVM learning can be reduced
to QUBO or its Ising variety (where x ∈ {−1, +1}n ).


3     Evolutionary QUBO Solver
If no specialized algorithm is known for a particular hard combinatorial op-
timization problem, randomized search heuristics, like simulated annealing or
evolutionary algorithms (EA), provide a generic way to generate good solutions.
    Inspired by biological evolution, EAs employ recombination or mutation on a
set of “parent” solutions to produce a set of “offspring” solutions. A loss function,
also called fitness function in the EA-context, is used to select those solutions
which will constitute the next parent generation. This process is repeated until
convergence or a pre-specified time-budget is exhausted [8].
1
    https://www.dwavesys.com/sites/default/files/mwj_dwave_qubits2018.pdf
           Learning Bit by Bit: Extracting the Essence of Machine Learning      3

    Motivated by the inherently parallel nature of digital circuits, we developed
a highly customizable (µ + λ)-EA FPGA hardware architecture implemented
in VHDL. Here, customizable implies that different types and sizes of FPGA
hardware can be used.
    Moreover, our hardware synthesizer allows the end-user to customize the
maximal problem dimension n, the number of parent solutions µ, the number
of offspring solutions λ, and the number of bits per coefficient βij . In case of
low-budget FPGA, this allows us to either allocate more FPGA resources for
parallel computation (µ and λ) or for the problem size (n and β).
    We used this hardware optimizer for evaluating the QUBO and Ising model
embeddings we are going to present in the following sections.


4     Exemplary Learning Tasks

With an efficient solver for QUBO and the Ising model at hand, solving a specific
optimization problem reduces to finding an efficient embedding into the domain
of n-dimensional binary vectors and devising a method for calculating β such
that the minimizing vector x∗ corresponds to an optimal solution of the original
problem. The actual optimization step then amounts to uploading β to the hard-
ware solver which approximates a global optimum that yields an approximately
optimal solution to the original problem, once decoded back into the original
domain. As we will show, it is possible for multiple valid embeddings to exist for
the same problem class, but their effectiveness differs from instance to instance.


4.1   Clustering

A prototypical data mining problem is k-means clustering which is already NP-
hard for k = 2. To derive the coefficients β for an Ising model solving the
2-means clustering problem, we can use the method devised in [1], where each
binary variable σi ∈ {+1, −1} indicates whether a corresponding data point
xi ∈ D belongs to cluster +1 or cluster −1 – the problem dimension is thus
n = |D|.                                                P i
   First, we assume that D has zero mean, such that       i x = 0. Next, we
compute the Gramian matrix G, where every entry corresponds to an inner
product of two data points:
                               Gij = hxi , xj i
As shown in [1], minimizing an Ising model using G as coupling matrix corre-
sponds to maximizing the between cluster scatter under the assumption that the
clusters are approximately of equal size. Optionally a kernel function can be used
instead of the inner product, and the resulting kernel matrix can be centered to
maintain the zero mean property in the resulting feature space.
    For simplicity we will stick to the linear kernel and assume that G is already
centered. As our hardware optimizer encodes β as signed b-bit integers (with
b ≤ 32), we have to round the entries of G, which are real-valued. To minimize
4                                         Mücke, Piatkowski, Morik

loss of precision, we scale the parameters to use the full range of b bits before
rounding. As the overall objective function is a linear combination of β, scaling
all coefficients by α is the same as multiplying the objective function by α, which
means that the position of the optimum is unaffected. The final formula for a
single coefficient comes out to

                                                                                                                                            2b−1 − 1
                                                             βij = bαGij + 0.5c with α =                                                                 .
                                                                                                                                           maxi,j |Gij |

Notice that there is no practical difference in precision between integer and
fixed-point representation, as the latter is nothing more than a scaled integer
representation to begin with.
    Exemplary optimization results using this method on the UCI datasets Iris
and Sonar are shown in Fig. 1.

                         5x106
                                                                                    κ = 1/n                               2x106
                                                                                                                                                                             κ = 1/n
                                                                                    κ = 2/n                                        0
                                  0                                                                                                                                          κ = 2/n
                                                                                    κ = 3/n
                                                                                                                         -2x106                                              κ = 3/n
                                                                                    κ = 4/n                                                                                  κ = 4/n
                                                                                    κ = 5/n                              -4x106                                              κ = 5/n
                        -5x106
QUBO Objective


                                                                                                       QUBO Objective


                                                                                    κ = 6/n                                                                                  κ = 6/n
                                                                                    κ = 7/n                              -6x106                                              κ = 7/n
                        -1x107                                                      κ = 8/n                                                                                  κ = 8/n
                                                                                    κ = 9/n                              -8x106
                                                                                                                                                                             κ = 9/n
                                                                                    κ = 10/n                             -1x107                                              κ = 10/n
                       -1.5x107
                                                                                                                        -1.2x107
                                                                                                                        -1.4x107
                        -2x107
                                                                                                                        -1.6x107
                       -2.5x107                                                                                         -1.8x107
                                      0           20         40               60          80     100                                   0       20      40               60        80    100
                                                                  Time (ms)                                                                                 Time (ms)

                        10                                                                                                 0
                                                                                   κ = 1/n                                                                                   κ = 1/n
                         0                                                         κ = 2/n                               -10                                                 κ = 2/n
                        -10                                                        κ = 3/n                               -20                                                 κ = 3/n
                                                                                   κ = 4/n                                                                                   κ = 4/n
                        -20                                                        κ = 5/n                               -30                                                 κ = 5/n
      QUBO Objective


                                                                                                       QUBO Objective


                                                                                   κ = 6/n                                                                                   κ = 6/n
                        -30                                                        κ = 7/n                               -40                                                 κ = 7/n
                        -40                                                        κ = 8/n                               -50                                                 κ = 8/n
                                                                                   κ = 9/n                                                                                   κ = 9/n
                        -50                                                        κ = 10/n                              -60                                                 κ = 10/n
                        -60                                                                                              -70
                        -70                                                                                              -80
                        -80                                                                                              -90
                        -90                                                                                             -100
                              0             200        400               600           800     1000                            0             200     400               600       800    1000
                                                             Time (ms)                                                                                     Time (ms)


Fig. 1. QUBO loss value over time with different mutation rates, each averaged over 10
runs. Uncertainty indicated by transparent areas. Top-left: 2-means on Iris (n = 150,
d = 4). Top-right: 2-means on Sonar (n = 208, d = 61). Bottom-left: MRF-MAP with
edge encoding. Bottom-right: MRF-MAP with vertex encoding.


4.2                               Support Vector Machine
A linear Support Vector Machine (SVM) is a classifier that – in the simplest
case – takes a labeled dataset D = {(xi , yi ) | 1 ≤ i ≤ n} ⊆ X × Y of two classes
(Y = {+1, −1}) and tries to separate them with a hyperplane [3]. As there may
be infinitely many such hyperplanes, an additional objective is to maximize the
           Learning Bit by Bit: Extracting the Essence of Machine Learning       5

margin, which is the area around the hyperplane containing no data points, in an
attempt to obtain best generalization. The hyperplane is represented as a normal
vector w and an offset b. To ensure correctness the optimization is subject to the
requirement that every data point be classified correctly, i.e. every data point
lies on the correct side of the plane:
                                (hw, xi i − b) · yi ≥ 1
    As for real-world data perfect linear separability is unlikely, each data point
is assigned a slack variable ξi such that ξi > 0 indicates that xi violates the
correctness condition. This yields the second objective, which is to minimize the
total slack, so that the entire minimization problem comes out to be
                                1           X
                      minimize kwk22 + C        ξi
                                2            i
                             s.t. ∀i. (hw, xi i − b) · yi ≥ 1 − ξi
    The optimization is done by adjusting w, b and ξi , while C is a free parameter
controlling the impact of wrong classification on the total loss value. In fact, C
is nothing more Pthan an inverse regularization factor, as the loss function can
be rewritten as i ξi + λkwk22 , which illustrates that SVM learning is merely
a regularized hinge loss minimization. Typically this problem is solved using its
Lagrange dual, which is
                              n              n   n
                              X        1 XX
                  maximize        αi −           αi αj yi yj hxi , xj i        (3)
                              i=1
                                       2 i=1 j=1
                                                  X
                         s.t. ∀i. 0 ≤ αi ≤ C and        αi yi = 0.
                                                        i

    Note that eq. (3) has striking similarity to eq. (1), the QUBO objective
function, only a) it is a maximization instead of a minimization problem, b)
index j goes from 1 to n instead to i, which makes for a full matrix instead
of a lower triangle matrix, and c) αi is continuous between 0 and C instead
of a boolean. Points a) and b) are easy to deal with by flipping the sign and
multiplying all entries except for the main diagonal of the triangle matrix by 2.
Point c) requires a radical simplification: Instead of a continuous αi ∈ [0, C] we
assume that αi is either 0 or C, reducing it to a boolean indicator whether xi
is a support vector or not. Thus we can substitute αi for C · zi with zi ∈ B and
end up with the following QUBO formulation:
                                                                       
                       n            n    i−1
                      X            X     X                  1
                          −Czi +            C 2 tij zi zj + C 2 tii zi 
                      i=1          i=1   j=1
                                                            2

              where tij = yi yj hxi , xj i
   In terms of eq. (2) we can express β as
                              (
                                1 2
                                 C tii − C           if i = j
                        βij = 2 2
                                C tij                otherwise
6       Mücke, Piatkowski, Morik

4.3   Markov Random Field

Another typical NP-hard ML problem is to determine the most likely configu-
ration of variables in a Markov Random Field, known as the MAP prediction
problem [9]. Given the weight parameters θuv=xy ∈ Z of an MRF with graphical
structure G = (V, E) and variables (Xv )v∈V from finite state spaces (Xv )v∈V , we
demonstrate two different QUBO encodings to maximize the joint probability
over all possible variable assignments. These two encodings ultimately have the
same objective, but lead to significantly different runtimes on our optimizer hard-
ware, depending on the given MRF’s structure. We assume θ to be integer-valued
to fit our optimizer’s architecture, and also because integer MRF’s have been
shown to be generally well-performing alternatives to MRFs with real-valued
weights, especially when resources are constrained [7, 6].
    For the first QUBO embedding of the MAP problem we encode the assign-
ments of all Xv as a concatenation of one-hot vectors

         (X1 = xi11 , . . . , Xm = ximm ) 7→ 0| . . . 010
                                                       {z . . . 0} . . . 0| . . . 010
                                                                                   {z . . . 0},
                                                            |X1 |                 |Xm |


where m = |V | and xik is the i-th value in Xk . The QUBO  P problem dimension
is equal to the total number of variable states, n =          v∈V |Xv |. The weights
are encoded into the quadratic coefficients: Let ι : V × X 7→ N be a helper
function that, given a vertex and variable state, returns the corresponding bit
index within the concatenated one-hot vector. Given two indices i = ι(u, x) and
j = ι(v, y) with u 6= v, the corresponding coefficient is βij = −θuv=xy ; the
negative sign turns MAP into a minimization problem. If u = v and i 6= j, the
indices belong to the one-hot encoding of the same variable; as a variable can
only be in one state at a time, these two bits cannot both be 1, as this would lead
to an invalid one-hot encoding. Therefore the corresponding coefficient must be
set to an arbitrary positive penalty value P  θ, large enough to ensure that
even the worst valid encoding has a lower loss value than P . On the other hand,
if i = j then βij = 0, because the variable assignments have no weights on their
own.
                               
                               0
                                            if i = j
                         βij = P             if i 6= j and u = v
                               
                                  −θuv=xy otherwise
                               

                      where i = ι(u, x),
                               j = ι(v, y)

    The dimension can be reduced by removing bits with index i where all βij
are either 0 or P for all j ≤ i, as those can never be part of an optimal solution.
   For a different QUBO embedding, we may assign bits zuv=xy to all non-zero
weights θuv=xy between specific values x and y of two variables Xu and Xv ,
indicating that Xu = x and Xv = y (see fig. 2). Again, to avoid multiple or
            Learning Bit by Bit: Extracting the Essence of Machine Learning           7


Fig. 2. Schemes for embedding an MRF into QUBO: Variables (Xv )v∈V are represented
as dotted circles containing their possible states, weights θ as blue edges between
individual variable states (top). For the first embedding (bottom left), variable states
become binary variables z with their respective quadratic coefficients βij = −θuv=xy
if i = ι(u, x) and j = ι(v, y). For the second embedding (bottom right), all non-zero
parameters θ become binary variables with −θ as their respective linear coefficient. For
both embeddings, invalid variable combinations receive a penalty weight P , represented
as red edges.
8       Mücke, Piatkowski, Morik


Fig. 3. Possible edge configurations. Top: Valid configurations, where two edges are
either disjoint or, if they share an edge, agree on the variable state. Bottom: Invalid
configurations, where in case both edges were “active” at least one variable would
simultaneously be in two states. These pairs of edges require a penalty weight between
their corresponding binary variables within the second embedding.


conflicting variable assignments we introduce penalty weights between pairs of
edges that share a variable but disagree in its specific value (see fig. 3).
   The result z ∗ of the optimization is a vector of indicators for “active” edges
indexed by the set of all non-zero weights θ, from which we can derive an assign-
ment for every variable, provided it has at least one incident edge of non-zero
weight:            [
                                                           ∗
             Xv = {ỹ | u ∈ V \{v}, u ∈ Xu , ỹ ∈ Xv , zuv=xỹ   = 1}

For a valid encoding, the above set is always a singleton, such that its union
yields the contained element.


5     Evaluation

We evaluated our embeddings using our optimizer hardware on multiple UCI
datasets.
   For 2-means clustering we took five datasets and compared the resulting
k-means loss values of the Ising model embedding to Lloyd’s algorithm [5] as
implemented by R’s kmeans method2 . For calculating β we chose the simplest
case of a linear kernel. The results are listed in table 1; the resulting clusterings
2
    https://stat.ethz.ch/R-manual/R-patched/library/stats/html/kmeans.html
                                 Learning Bit by Bit: Extracting the Essence of Machine Learning                                                                       9

are similar to those obtained through Lloyd’s algorithm, though our loss values
were mostly slightly higher. Our values seem to be better the higher the data
dimension d is, surpassing Lloyd’s algorithm on Sonar containing 60 numerical
variables per data point. We assume that the generally slightly higher loss values
are due to the simplifying assumption that the clusters are about equal in size,
not due to insufficient convergence of our optimizer, as we found that we reached
identical optima in every run, leading to a standard deviation of 0 for every loss
value. Convergence plots of our hardware optimizer for two exemplary datasets
are shown in fig. 4.


Table 1. 2-means loss values (sums of cluster variances) on different datasets; each
optimization run on the Ising model was repeated 10 times, the EA parameters were
µ = 2 and λ = 30. The columns ”variables“ and ”data points“ contain the data
dimension d and number of points n used for each experiment.

dataset                                variables                        data points                                            Lloyd                   Ising model

Iris                                   4                                150                                                  152.3480 163.1659 ± 0.0
Abalone                                8                                500                                                  1947.141 2121.1438 ± 0.0
Sonar                                  60                               208                                                  280.5821 280.5696 ± 0.0
Wilt                                   5                                500                                                  14663360 14734828 ± 0.0
Ionosphere                             32                               351                                                 2387.2917 2388.3108 ± 0.0


                   5x106
                                                                  κ = 1/n                              2x106
                                                                                                                                                      κ = 1/n
                                                                  κ = 2/n                                  0
                       0                                                                                                                              κ = 2/n
                                                                  κ = 3/n
                                                                                                      -2x106                                          κ = 3/n
                                                                  κ = 4/n                                                                             κ = 4/n
                                                                  κ = 5/n                             -4x106                                          κ = 5/n
                  -5x106
QUBO Objective


                                                                                    QUBO Objective


                                                                  κ = 6/n                                                                             κ = 6/n
                                                                  κ = 7/n                             -6x106                                          κ = 7/n
                  -1x107                                          κ = 8/n                                                                             κ = 8/n
                                                                  κ = 9/n                             -8x106
                                                                                                                                                      κ = 9/n
                                                                  κ = 10/n                            -1x107                                          κ = 10/n
                 -1.5x107
                                                                                                     -1.2x107
                                                                                                     -1.4x107
                  -2x107
                                                                                                     -1.6x107
                 -2.5x107                                                                            -1.8x107
                            0     20        40               60        80     100                               0      20       40               60        80    100
                                                 Time (ms)                                                                           Time (ms)

                                Iris (d = 4, n = 150)                                                               Sonar (d = 60, n = 208)

Fig. 4. Convergence plots for the 2-means Ising model optimization performed on our
hardware optimizer using different mutation rates κ, averaged over 10 runs.


    The SVM embedding was tested extensively on the Sonar dataset, where
we created ten random splits of 2/3 training data and 1/3 test data. Using our
hardware optimizer, we then trained the binary SVM on each test set of each
split with C = 10k for k ∈ {−3, −2, −1, 0, 1, 2} and calculated the accuracy
on the respective test set; see table 2 for the full results. For C = 10−3 , our
10            Mücke, Piatkowski, Morik

optimizer found the optimum z ∗ = 1, which means that every data point was
a support vector. For bigger C the number of support vectors decreased, until
for C = 102 the optimum was z ∗ = 0, which is why 102 is missing from the
table, as w and b could not be calculated without at least one support vector.
The best accuracy was achieved with C = 1 and came out to be about 75%. On
the same splits we trained a conventional SVM (libsvm using the svm method
from R’s e1071 package3 ) and found that the test accuracies were in fact very
similar, the difference being less than one percent on average (see table 3), which
is a promising result considering the radical simplification steps taken for this
model.

Table 2. Test accuracies for varying C values on ten random splits (2/3 train, 1/3
test) of the Sonar dataset. (Values of the individual splits are ×10−4 )

                                                   split
      C          1         2        3       4       5          6          7          8      9     10   mean ± sd

0.001         6115       6619   6547    5899    6619      6619      6259       6331      6547   5899 0.6345 ± 0.0291
 0.01         6547       6619   6619    5899    6835      6619      6763       6187      6547   6043 0.6468 ± 0.0313
  0.1         7410       6403   6403    6165    7324      6835      7331       6547      6619   7770 0.6881 ± 0.0540
    1         7712       7748   6691    7957    6986      7554      7540       7942      7432   7180 0.7474 ± 0.0413
   10         6129       6209   6151    5676    5849      6129      6187       6777      6849   6511 0.6247 ± 0.0371


Table 3. Test accuracies for C = 1 on the same ten random splits achieved by libsvm.
(Values of the individual splits are ×10−4 )

                                                split
          1          2          3       4        5         6         7           8         9      10   mean ± sd

    7101       7246       7826      7826    7536        7681       7681       6667       8551   7536 0.7565 ± 0.0501


   The two MRF embeddings were tested on the Sonar and Mushroom datasets,
using weights θ from pre-trained integer MRFs with Chow-Liu tree structures.
Let |θ| denote the number of non-zero weights of a trained MRF: While the
MRF on Sonar had only few non-zero weights (|θ| = 102), Mushroom yielded a
much more densely connected MRF (|θ| = 679). Convergence plots for Mushroom
can be seen in fig. 5. Obviously the state embedding (where each bit encodes
one variable state as a one-hot vector) is superior on Mushroom, as it converges
much more quickly to a very good optimum, whereas the edge embedding (where
each bit encodes one value from θ) takes much longer and does not converge to
3
     https://www.rdocumentation.org/packages/e1071/versions/1.7-1/topics/svm
                              Learning Bit by Bit: Extracting the Essence of Machine Learning                                                      11

nearly as good an optimum, even after several minutes. On the contrary, the
edge embedding on Sonar is much more efficient than the state embedding (see
fig. 6). The different convergence rates are due to the different QUBO dimensions
of the embeddings: The state embedding on Mushroom has dimension ns =
P
   v∈V |Xv | = 127, which can be further reduced to 107 by removing unnecessary
bits as described in section 4.3. The edge embedding on the other hand leads to
a dimension ne = |θ| = 679. As our hardware optimizer needed about O(n2 ) to
perform the optimization, the latter encoding lead to a performance about 40
times slower. The Sonar MRF has ns = 662, which the reduction step improves
significantly to 117. As |θ| = 102 (and ne = 102, consequently) for the Sonar
MRF, the edge embedding is even better, though.


                   0                                                                           10
                                                            κ = 1/n
                  -10                                       κ = 2/n                             0
                  -20                                       κ = 3/n
                                                            κ = 4/n                            -10
                  -30                                       κ = 5/n
QUBO Objective


                                                                              QUBO Objective
                                                            κ = 6/n                            -20
                  -40                                       κ = 7/n
                                                                                               -30
                  -50                                       κ = 8/n
                                                            κ = 9/n                            -40
                  -60                                       κ = 10/n
                                                                                               -50
                  -70
                  -80                                                                          -60

                  -90                                                                          -70

                 -100                                                                          -80
                        0     200   400               600       800    1000                          0   200       400               600   800   1000
                                          Time (ms)                                                                      Time (ms)

                            reduced state embedding                                                            edge embedding

Fig. 5. Convergence plots for both QUBO embeddings of MRFs on the Mushroom
dataset; the state embedding clearly converges much faster on this MRF.


    As a general rule we observe that for relatively dense MRFs, the (reduced)
state
P encoding       leads to faster convergence, while for sparse MRFs with |θ| <
         θ            θ
   v∈V |Xv | (where Xv is the set of variable states that have at least one non-zero
θ associated with them) the edge encoding is preferable.

6                   Conclusion and Future Work
QUBO and Ising models constitute simple but versatile optimization problems
that capture the essence of many problems highly relevant to Machine Learning,
and in this work we have summarized some interesting embeddings of NP-hard
Machine Learning problems into both models. A general hardware-based solver
for either of these models is thus a powerful tool for Machine Learning, and
the domain of bit vectors makes for an efficient problem representation even
when resources are restricted, e.g. in embedded systems. Further we showed
that there can exist multiple embeddings of the same problem that perform
differently depending on properties of the problem instance.
    It would be interesting to uncover more embeddings of ML problems in the
future, both with conventional hardware acceleration like we did, but also with
the advent of quantum annealing in mind.
12                     Mücke, Piatkowski, Morik


                 10
                                                                           κ = 1/n
                  0                                                        κ = 2/n
                 -10                                                       κ = 3/n
                                                                           κ = 4/n
                 -20                                                       κ = 5/n
QUBO Objective


                                                                           κ = 6/n
                 -30                                                       κ = 7/n
                 -40                                                       κ = 8/n
                                                                           κ = 9/n
                 -50                                                       κ = 10/n
                 -60
                 -70
                 -80
                 -90
                       0            200            400               600       800    1000
                                                         Time (ms)

Fig. 6. Convergence plot for the edge embedding on Sonar ; unlike on Mushroom, this
embedding leads to quick convergence to a good optimum.


Acknowledgement This research has been funded by the Federal Ministry
of Education and Research of Germany as part of the competence center for
machine learning ML2R (01|S18038A)


References
1. Bauckhage, C., Ojeda, C., Sifa, R., Wrobel, S.: Adiabatic quantum computing for
   kernel k=2 means clustering. In: Proceedings of the LWDA 2018. pp. 21–32 (2018)
2. Boros, E., Hammer, P.L.: Pseudo-boolean optimization. Discrete applied mathemat-
   ics 123(1-3), 155–225 (2002)
3. Cortes, C., Vapnik, V.: Support vector machine. Machine learning 20(3), 273–297
   (1995)
4. Kadowaki, T., Nishimori, H.: Quantum annealing in the transverse ising model.
   Physical Review E 58(5), 5355 (1998)
5. Lloyd, S.: Least squares quantization in pcm. IEEE transactions on information
   theory 28(2), 129–137 (1982)
6. Piatkowski, N.: Exponential families on resource-constrained systems. Ph.D. thesis,
   Technical University of Dortmund, Germany (2018)
7. Piatkowski, N., Lee, S., Morik, K.: Integer undirected graphical models for resource-
   constrained systems. Neurocomputing 173, 9–23 (2016)
8. Schwefel, H.P., Rudolph, G.: Contemporary evolution strategies. In: European con-
   ference on artificial life. pp. 891–907. Springer (1995)
9. Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and
   variational inference. F+Trends in Machine Learning 1(1–2), 1–305 (2008)

</pre>