<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>MID: A New Strategy for Learning Optimal Decision Trees on Continuous Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Dal Maso</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harold Kiossou</string-name>
          <email>harold.kiossou@uclouvain.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siegfried Nijssen</string-name>
          <email>siegfried.nijssen@uclouvain.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ICTEAM</institution>
          ,
          <addr-line>UCLouvain, Place Sainte-Barbe 2, bte L5.02.01, B-1348 Louvain-la-Neuve</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Politecnico di Torino</institution>
          ,
          <addr-line>Corso Duca degli Abruzzi 24, 10129 Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Supervised Discretization</institution>
          ,
          <addr-line>Impurity Metrics</addr-line>
          ,
          <country>Optimal Decision Trees</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>of this method. Existing Optimal Decision Tree (ODT) algorithms, which are designed to find the best decision tree on training data, predominantly work on binary features and require the use of discretization algorithms to preprocess continuous ones. In most cases, these techniques make it unfeasible in practice to find an ODT. We propose a new approach for learning decision trees on continuous data by combining a new discretization algorithm, MID, with ODT algorithms. Given infinite time, this allows the ODT algorithm to find an ODT, but if interrupted early, the approach still produces a good decision tree. Experiments on 14 datasets provide evidence of the efectiveness Discovery Science - Late Breaking Contributions 2024 ∗Corresponding author.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Extended Abstract</title>
      <p>
        Decision tree learning algorithms are among the most widely used machine learning methods today,
thanks to the predictive power and interpretability of the learned models. Since finding an optimal
decision tree (that minimizes the training error) under constraints is NP-hard, greedy algorithms like
CART [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and C4.5 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have long been the preferred methods for this task. These algorithms are fast and
can handle all types of data, from binary to continuous, without the need for preprocessing. However,
they often find suboptimal trees that do not minimize error across the entire dataset. In recent years,
advancements in optimization solvers and novel algorithmic concepts have led to the development of
new methods for inferring decision trees that are optimal on training data (ODTs). Various approaches
have been proposed, including mixed integer programming [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ], constraint programming [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and
SAT solvers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Among these, dynamic programming methods like DL8, DL8.5, and MurTree [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]
have emerged as particularly efective, ofering both accuracy and eficiency in depth-constrained
settings. These methods leverage advanced optimization techniques to explore the search space more
thoroughly than traditional greedy algorithms. As these techniques are typically designed for binary
datasets, they often require continuous features to be discretized before the learning process – a step
that can significantly impact the quality and eficiency of the resulting decision tree. As a consequence,
there is no assurance that such a tree is optimal with respect to the training data.
      </p>
      <p>
        A number of discretization techniques have been proposed in the literature. Straightforward
techniques, such as equal-width and equal-frequency binning, often lead to suboptimal trees because they
fail to capture the underlying structure of the data efectively. More sophisticated methods, like the
MDLP discretizer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or ChiMerge [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], use heuristics to select the most informative cut-points, ofering
better results in practice.
      </p>
      <p>All these discretization techniques share the following characteristics: (1) they operate on each
continuous feature individually; (2) they ofer parameters to determine the number of discretized
features generated per continuous feature. In general, depending on the choice of parameters, these
(S. Nijssen)</p>
      <p>CEUR</p>
      <p>ceur-ws.org
discretizers can either produce too many binary features, increasing the number of trees that optimal
algorithms have to consider, or oversimplify the data, leading to a loss of accuracy and learning trees
that are far from optimal.</p>
      <p>In this paper, we introduce the Minimum Impurity Discretizer (MID) as a new technique to address
these weaknesses. MID distinguishes itself by combining several characteristics: (1) it generates a set of
binary features using an iterative approach that starts with an empty set and adds new features to it at
each iteration; (2) it considers all input continuous features jointly when creating the new binary ones,
allowing some features to be discretized more finely than others; (3) it can be used with ODT algorithms
in an anytime manner, where the learner is run iteratively over data of increasing dimensionality, and
the search process stops when resources are no longer available; (4) it allows one to find good decision
trees even when a small set of binary features is used for training.</p>
      <p>
        Algorithm 1 MID training process
Require: Dataset  , target variable  , number of output binary features 
1:   _ _ _   ← empty list [ ]
2: for  in columns of  do
3:  and  are sorted based on the values of 
4:  ← compute_and_rank_feature_candidates(,  )
5:  is appended to   _ _ _  
6: ℎ ℎ ← get_best_thresholds(  _ _ _  
7:
8: Procedure get_best_thresholds( _ _   ,  )
9: ℎ ℎ ← empty list [ ]
10: while length(ℎ ℎ ) &lt;  do
11: ,  _  _ ← −1, −∞
12:  ← 0
13: for (, ) in enumerate( _ _   ) do
14: if length( ) &gt; 0 then
15: if   &lt; [0][
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] then
16:  ← 
17:  _  _ ← [0][
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
18:  _ ← [0][0]
,  )
19:
20:
21:
22:
23:
24:
25:
else
      </p>
      <p>←  + 1
if  ==</p>
      <p>break
the tuple (, 
the head of 
return ℎ ℎ
length(
_ _</p>
      <p>) then
_)</p>
      <p>is appended to ℎ ℎ
_ _  [] is popped</p>
      <p>Algorithm 1 outlines how MID computes thresholds for discretizing a continuous dataset. First, each
feature is considered individually: for each of them, potential thresholds are ranked using an impurity
metric, such as entropy or the Gini index (line 4). Each threshold in this list is the one that, if applied to
further split the feature’s range, would minimize the impurity of the class labels, under the assumption
that all higher-ranked thresholds have been applied. The list  , introduced at line 4, consists of
tuples in the form ( _ℎ ℎ ,   _) , where   _ quantifies the reduction in
impurity achieved by applying  _ℎ ℎ .</p>
      <p>The lists associated with the individual features are then merged together, ensuring that the thresholds
retain their relative order from the original rankings. This is done by get_best_thresholds (lines 8-25),
which also extracts the top  thresholds from the resulting global ranking. The procedure iteratively
compares the top thresholds from the lists of all features, selecting and removing the best threshold
at each step. This process continues until  binary features are retrieved (line 10) or until no more
thresholds are available (lines 21-22). Each threshold in the global ranking is the one that would reduce
the impurity of the class labels the most if applied to further discretize the corresponding continuous
feature (lines 15-18). We consider this candidate as the optimal choice to continue the discretization.</p>
      <p>At this point, a set of  binary features representing the dataset can be obtained by selecting the
top  thresholds from the global ranking. Let  1 and  2 be two sets of thresholds containing  1 and
 2 elements, respectively. By construction, if  1 &lt;  2, then  1 ⊆  2, meaning that the latter set of
binary features is a superset of the former one. If an optimal decision tree learner is trained on both
datasets, its performance on  2 will be better than or equal to that on  1. MID makes it straightforward
to represent the observations in a dataset using a small number of binary features. Moreover, additional
features can easily be added if the user has more availability in terms of time and resources. Intuitively,
whenever a new binary feature is needed, MID selects both the best feature to split and the optimal cut
point according to the impurity minimization heuristic.</p>
      <p>As anticipated before, MID allows an ODT algorithm to be run repeatedly on input data with growing
dimensionality. If the process is run without a time limit, an optimal decision tree can be found, but the
search process can also be interrupted at any time for a smaller number of features. It is important to
notice that the way in which MID operates is particularly suitable for this approach, as producing more
than one set of binary features starting from the same continuous dataset only requires to train MID a
single time. Moreover, the diference between the binary datasets used in two consecutive iterations
consists in only one feature. Thus, repeating the discretization process multiple times only adds a small
overhead in terms of runtime. An extensive analysis of the pseudocode of MID and its characteristics is
available at https://github.com/antoniodalmaso/MID, together with the Python implementation used in
the experiments and all associated results.</p>
      <p>
        We now discuss the experiments. MID was evaluated using 14 datasets from the UCI Machine
Learning Repository [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], comparing its performance with that of other discretization methods and
greedy algorithms. Our experiments revealed that MID, when combined with DL8.5, consistently
achieves higher accuracy than both CART (trained on both binary and continuous data) and DL8.5
paired with other discretizers, such as MDLP1 or equal-frequency binning. For example, Figure 1 shows
1We used the implementation available at https://github.com/navicto/Discretization-MDLPC.
how the test accuracies obtained by DL8.5 and CART on the banknote dataset with a maximum depth
of 3 vary when diferent numbers of binary features are used for the training. The x-axis represents
the number of binary features fed to the learners. Please note that this axis hence does not indicate
how many features the resulting decision tree model uses. The accuracies achieved by CART on
continuous data are added as dots using the number of unique thresholds tested in the resulting trees as
x-coordinates. The results show that, initially, the test accuracies of CART and DL8.5 sharply increase
with the number of binary features produced by MID. Both then reach a plateau as the number of
features increases. This behavior can be observed in most of the experiments across all datasets, and it
suggests that one could determine how many binary features to use by iteratively adding them until
there is no significant change in the test accuracy (i.e., when the accuracy stops improving or decreases).
Furthermore, when comparing the number of features actually used in the final trees, as reported in
Figure 2 for the banknote dataset, it appears that MID and DL8.5 use less features (at most 6) than CART
on the continuous data. Thus, they manage to achieve higher accuracies while producing smaller trees.
Experiments on the remaining datasets give results consistent with those described above. Moreover,
in terms of run time, the proposed approach is performant: DL8.5’s performance is very good when its
number of input features is small, as is the case here.
      </p>
      <p>In conclusion, we introduced MID, a heuristic-based, supervised, multivariate discretizer designed to
enhance the performance of optimal decision tree algorithms like DL8.5 on continuous datasets. As a
future work, it could be interesting to develop an MDL criterion similar to the one used by MDLP to
make the finetuning of MID easier for the user, and to explore the efect of other impurity metrics on
its performance.</p>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgments</title>
      <p>Computational resources have been provided by the supercomputing facilities of the Université
catholique de Louvain (CISM/UCL) and the Consortium des Équipements de Calcul Intensif en
Fédération Wallonie Bruxelles (CÉCI) funded by the Fond de la Recherche Scientifique de Belgique (F.R.S.-FNRS)
under convention 2.5020.11 and by the Walloon Region.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          ,
          <article-title>Classification and regression trees</article-title>
          ,
          <source>Routledge</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Quinlan</surname>
          </string-name>
          ,
          <year>C4</year>
          .
          <article-title>5: programs for machine learning</article-title>
          ,
          <source>Elsevier</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Aghaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vayanos</surname>
          </string-name>
          ,
          <article-title>Strong optimal classification trees</article-title>
          ,
          <source>Operations Research</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1287/opre.
          <year>2021</year>
          .
          <volume>0034</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bertsimas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <article-title>Optimal classification trees</article-title>
          ,
          <source>Machine Learning</source>
          <volume>106</volume>
          (
          <year>2017</year>
          )
          <fpage>1039</fpage>
          -
          <lpage>1082</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verwer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Learning optimal classification trees using a binary linear program formulation</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>1625</fpage>
          -
          <lpage>1632</lpage>
          . doi:
          <volume>10</volume>
          .1609/aaai.v33i01.
          <fpage>33011624</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Boutilier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Michini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Shattering inequalities for learning optimal decision trees</article-title>
          ,
          <source>in: International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>90</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          - 08011-
          <issue>1</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Verhaeghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          , G. Pesant, C.-G. Quimper,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaus</surname>
          </string-name>
          ,
          <article-title>Learning optimal decision trees using constraint programming</article-title>
          ,
          <source>Constraints</source>
          <volume>25</volume>
          (
          <year>2020</year>
          )
          <fpage>226</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Narodytska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ignatiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marques-Silva</surname>
          </string-name>
          ,
          <article-title>Learning optimal decision trees with sat</article-title>
          ,
          <source>in: International Joint Conference on Artificial Intelligence</source>
          <year>2018</year>
          ,
          <article-title>Association for the</article-title>
          <source>Advancement of Artificial Intelligence (AAAI)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1362</fpage>
          -
          <lpage>1368</lpage>
          . doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2018</year>
          /189.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Aglin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaus</surname>
          </string-name>
          ,
          <article-title>Learning optimal decision trees using caching branch-and-bound search</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>3146</fpage>
          -
          <lpage>3153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Demirović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lukina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hebrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leckie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramamohanarao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Stuckey</surname>
          </string-name>
          , Murtree:
          <article-title>Optimal decision trees via dynamic programming and search</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          , E. Fromont,
          <article-title>Mining optimal decision trees from itemset lattices</article-title>
          ,
          <source>in: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>530</fpage>
          -
          <lpage>539</lpage>
          . doi:
          <volume>10</volume>
          .1145/1281192.1281250.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>U. M. Fayyad</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>B</article-title>
          .
          <article-title>Irani, Multi-interval discretization of continuous-valued attributes for classification learning</article-title>
          ,
          <source>in: Ijcai</source>
          , volume
          <volume>93</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>1993</year>
          , pp.
          <fpage>1022</fpage>
          -
          <lpage>1029</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kerber</surname>
          </string-name>
          ,
          <article-title>Chimerge: Discretization of numeric attributes</article-title>
          ,
          <source>in: Proceedings of the tenth national conference on Artificial intelligence</source>
          ,
          <year>1992</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Longjohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nottingham</surname>
          </string-name>
          ,
          <article-title>The uci machine learning repository</article-title>
          ,
          <source>last accessed: May</source>
          <volume>28</volume>
          ,
          <year>2024</year>
          . URL: https://archive.ics.uci.edu.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>