<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Inductive Databases to a Modeling Language for Pattern Mining, to a Next-Generation Library for Constraint Solving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tias Guns</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KU Leuven</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>At the occasion of the 20th anniversary of the KDID workshop, I share the tale of how the inductive databases concept has inspired our work on MiningZinc, a constraint-based language for pattern mining; and how even today it influences how we are building a next-generation modeling library for constraint solving. The inductive databases idea [1] called for systems to 1) support both traditional (SQL) database queries as well as knowledge discovery (KDD) queries, 2) that the output of both queries should be basic objects that can then be used in further queries (the closure principle), and 3) that in such nested cases, conditions from one query could flow into the processing of the next. At the time, there were multiple query-language extensions proposed that aimed at putting KDD queries in database management systems. They were often thin layers that expose the parameters of a KDD system, an itemset mining or rule learning system, as attributes that could be constrained in the WHERE part of the KDD query. We felt that it did not support concept 3) suficiently, that it does not allow for more complex WHERE conditions which the system would be able to take into account. During my PhD I developed the use of Constraint Programming for solving Constraint-based Itemset Mining problems [2]. In the constraint programming literature, there is a rich study of modeling languages. A modeling language is a declarative algebraic language that allows to specify constraints over (integer or Boolean) decision variables, and to compute one solution, enumerate all or optimize an objective function. MiningZinc To better support concept 3), instead of extending SQL with complex KDD queries, we decided to investigate the use of algebraic modeling languages to express complex</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>KDD queries. Examples of complex KDD queries include aggregation and weigted sum
constraints over the items or transactions. Figure 1 shows an example model in this language, called
MiningZinc and based on the popular MiniZinc constraint solving language.</p>
      <p>This clearly is a declarative language, though not a query language. Still, it allows to analyse
the specified model and determine how it should be evaluated. In MiningZinc, our model
analyzer generates multiple execution plans, just like a database management system, which
can be grouped into 3 categories: a) directly translating the specified problem to a specialized
itemset mining that supports all specified constraints; b) identifying that the core of the problem
is supported by specialized itemset mining algorithms but that some constraints are not, where
the system then uses the algorithm to enumerate patterns and filters that output using the
additional constraints; and c) translating the specified model to a generic constraint solver.
These execution plans were then ranked using a heuristic estimating the eficiency of the plan,
and the top one was executed.</p>
      <p>While it supported concept 3), nesting queries and their processing, really well, it left concept
1) and 2) open, namely: where does the data come from, and where does it go to.</p>
      <p>The text-based language was great for query optimisations at the KDD query level (what
the possible execution plans are, and which one is expected to be most eficient). However it
did not satisfy the closure principle, as data loading and post-processing had to be written in a
procedural programming language. Actually, looking back, most of our efort in developing the
system was in developing glue code to connect diferent systems together. And even more glue
code would have been needed to support generic data loading and post-processing.
2. From MiningZinc to a constraint modeling language
Today, the MiniZinc language that we had built upon is still the most popular modeling language
for modeling Constraint Programming problems. However, the questions: where does the data
come from (concept 1) and what to do with the results afterwards (concept 2), which were
unanswered in our MiningZinc extension, are still unanswered in the MiniZinc language.</p>
      <p>Meanwhile my research interests have shifted from using Constraint Programming (CP)
for data mining/machine learning (ML) to the opposite direction: using ML to learn part of
the CP problem specification, such as learning coeficients of an objective function, learning
preferences of operators and generating (optimal) explanations of UNSAT and SAT problems.</p>
      <p>For these use cases, integration with a) machine learning libraries, 2) multiple solver
technologies 3) data loading libraries, and 4) visualisation libraries prompted us to develop a new
Python-based library for CP.</p>
      <p>Constraint solving systems from the viewpoint of inductive databases Our needs for
this system fit exactly the concepts that were brought forward for inductive database systems:
1) support data manipulation (data queries) as well as computing the solutions to combinatorial
problems (solve queries); 2) that the output of solve queries should be basic objects that can then
be used in further queries (the closure principle); and 3) that in such nested cases, conditions
from one type of operation flow into the processing of the next.</p>
      <p>
        Notably, the CPMpy system [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]1 we are developing as part of my ERC Consolidator Grant
is a library that builds symbolic expression trees that can be reasoned on, rather than a
standalone language like SQL. We remark that in the machine learning field in general, the use of
programmatic manipulation of data, numpy arrays more specifically, is much more prevalent
then query languages. However, the functions ofered by numpy and other scientific libraries
like pandas are high level and often close the algebraic manipulations underlying SQL systems.
      </p>
      <p>So a first import design decision in CPMpy is to adapt the numpy array as basic object that
will enable the closure principle, meaning: numpy array’s in and numpy array’s out (note that
they can be matrices, but also vectors or higher-order tensors; in contrast to tables all elements
are typically of the same type though).</p>
      <p>The key part of our modeling library is hence to have decision variables be numpy arrays,
thereby allow all operations that you do on such arrays, to do them with the same syntax and
meaning on decision variables. If the operations are done on data, the output is data; if the
operations are done on decision variables, the output are expression trees of relations between
variables, for which a solver can compute satisfying or optimal assignments. After a solver
is called on them, the decision variables can be asked for their value with the var.value()
function, which returns a numpy array with the values as data.</p>
      <p>
        In this framework, the circle is hence round, and the three design goals are achieved.
Here are some examples of what this enables, in a single framework with a single syntax:
1. Diverse solutions[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] We read data from an excel file, use that to construct a packing
problem, find the optimal solution; then feed that solution back into the packing problem
ensuring that the next solution is suficiently diferent, repeating this  times. (Equally
applicable to diverse pattern mining with a constraint programming formulation of
itemset mining for example.))
Data is from an external source, used to create constraints, and solutions are used as data
to create new constraints in turn.
1https://github.com/CPMpy/cpmpy
      </p>
      <p>There is no other constraint programming system that supports these operations in a single
system with a single syntax, that is, in which data queries and solve queries are equal first-class
citizens.</p>
      <p>Who knows, perhaps the reason is that other constraint programming researchers had not
read the seminal paper of Imielinksi and Mannila.</p>
      <p>
        Acknowledgments I would like to wholeheartedly thank Siegfried Nijssen, Luc De Raedt, Anton
Dries and all partners of the EU ICON project [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for the many interesting exchanges we had that shaped
these ideas.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <article-title>A database perspective on knowledge discovery, Commun</article-title>
          . ACM
          <volume>39</volume>
          (
          <year>1996</year>
          )
          <fpage>58</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Guns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          , L. De Raedt,
          <article-title>Itemset mining: A constraint programming perspective</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>175</volume>
          (
          <year>2011</year>
          )
          <fpage>1951</fpage>
          -
          <lpage>1983</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Guns</surname>
          </string-name>
          ,
          <article-title>Increasing modeling language convenience with a universal n-dimensional array, cppy as python-embedded example</article-title>
          .,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hebrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hnich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Walsh</surname>
          </string-name>
          ,
          <article-title>Finding diverse and similar solutions in constraint programming</article-title>
          ,
          <source>in: AAAI</source>
          , volume
          <volume>5</volume>
          ,
          <year>2005</year>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mulamba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Canoy</surname>
          </string-name>
          , T. Guns,
          <article-title>Hybrid classification and reasoning for image-based constraint solving</article-title>
          ,
          <source>in: International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>364</fpage>
          -
          <lpage>380</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mulamba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diligenti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lombardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bucarey</surname>
          </string-name>
          , T. Guns,
          <article-title>Contrastive losses and solution caching for predict-and-optimize</article-title>
          , in: Z.
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2021</year>
          , Virtual Event / Montreal, Canada,
          <fpage>19</fpage>
          -27
          <source>August</source>
          <year>2021</year>
          ,
          <article-title>ijcai</article-title>
          .org,
          <year>2021</year>
          , pp.
          <fpage>2833</fpage>
          -
          <lpage>2840</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bessiere</surname>
          </string-name>
          , L. De Raedt,
          <string-name>
            <given-names>T.</given-names>
            <surname>Guns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kotthof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paparrizou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Simonis</surname>
          </string-name>
          ,
          <article-title>The inductive constraint programming loop</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>32</volume>
          (
          <year>2017</year>
          )
          <fpage>44</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>