<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. Hernández); angelica@usal.es (A. González Arrieta); chamoso@usal.es (P. Chamoso);
corchado@usal.es (J. M. Corchado)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Mapped supervised learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guillermo Hernández</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angélica González Arrieta</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Chamoso</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan M. Corchado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Air Institute, IoT Digital Innovation Hub</institution>
          ,
          <addr-line>37188 Salamanca</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electronics, Information and Communication, Osaka Institute of Technology</institution>
          ,
          <addr-line>535-8585 Osaka</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Grupo de Investigación BISITE, Departamento de Informática y Automática, Facultad de Ciencias, Universidad de Salamanca</institution>
          ,
          <addr-line>Pl. Caídos, s/n, 37008 Salamanca</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>ion capacity. The MSL paradigm represents a valuable addition to the ensemble methods toolkit and holds promise for improving the accuracy and interpretability of machine learning models in a variety of applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;machine learning</kwd>
        <kwd>supervised learning</kwd>
        <kwd>ensemble learning</kwd>
        <kwd>explainable machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Ensemble methods have emerged as a powerful approach to machine learning, ofering improved
performance, robustness, and generalizability compared to traditional single-model approaches.
Ensemble methods work by combining the predictions of multiple models, typically through
some form of voting or averaging, with the aim of reducing bias and variance and improving
accuracy. While ensemble methods have been shown to be efective in many applications, they
also come with their own set of challenges, such as increased computational complexity, higher
memory requirements, and potential overfitting [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Ensemble methods are widely used in machine learning to improve the performance of
predictive models by combining the outputs of multiple base models, usually fit with the same
training method. Two main families of ensemble methods are commonly distinguished based
on their underlying principles: averaging and boosting.</p>
      <p>
        Averaging methods aim to reduce the variance of a predictive model by building several
base models independently and then combining their predictions. This approach reduces the
risk of overfitting to the training data and improves the generalization performance of the
model. A well-known example of averaging methods is bagging [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which uses bootstrap
samples to train multiple base models and combines their predictions by taking a simple average.
Variations using other sampling methods exist as well [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Another example is the random
forest algorithm, which builds an ensemble of decision trees by randomly selecting subsets of
features and observations at each node split [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Boosting methods, on the other hand, aim to reduce the bias of a predictive model by building
a sequence of base models that focus on the misclassified instances from the previous models.
Boosting combines several weak models to produce a powerful ensemble, where each model
contributes its expertise to the final prediction. A popular example of boosting is AdaBoost [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
which assigns higher weights to misclassified instances and trains a new base model on the
updated weighted training set. Gradient tree boosting [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is another example of boosting that
uses gradient descent to minimize the residual error between the predicted and true values.
      </p>
      <p>In addition to averaging and boosting, we propose a new ensemble method called the "mapping
method". This method involves fitting a set of independent models with disjoint subsets of
data using the same base learner. The mapping method decomposes complex problems into
simpler sub-problems that can be solved independently, reducing the complexity of the problem
and improving model accuracy. Our proposed method enables the use of diferent types of
base learners and can be easily parallelized for eficient computation. We demonstrate the
efectiveness of the mapping method on various machine learning applications in this paper.</p>
      <p>The remainder of this paper is organized as follows: Section 2 provides a formal description
of the proposed method. In Section 3, we present the results of a statistical evaluation of the
method using multiple regression tasks for a dataset. Finally, in Section 4, we summarize our
conclusions and discuss possible future directions for research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Formal description</title>
      <p>
        Supervised machine learning algorithms can be described, following [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], as an application 
mapping a collection of instances of a set  with a label in a set  into the a so called model,
which is itself an application from  to , chosen from a subset of applications ℱ which are
the candidate models. Formally:
 : ⋃︁ ( ×  ) → ℱ .
      </p>
      <p>∈N
(1)</p>
      <p>Typically, these algorithms minimize a loss function, such as the quadratic loss ℒ(, (, )) =
( () − )2, which is commonly used in one-dimensional real regression problems. However,
the development of this work does not rely on the notion of loss, and the algorithms can be
viewed solely as applications in the form of Equation 1.</p>
      <p>To introduce the mapping ensemble approach, consider any surjetive mapping into a finite
set of rational numbers,  :  → {1, . . . , }. This decomposes the input space  into the 
disjoint subsets defined by the inverse application − 1, i.e., {︀  = − 1 () ∀ ∈ 1, . . . , }︀ such
that ⋃︀</p>
      <p>=1  =  and  ∩  ̸= ∅ if and only if  ̸= .</p>
      <p>The mapping ensemble approach is based on the previously described decomposition of the
input space into disjoint subsets. This approach utilizes a set of independent models, all trained
with the same base learner, with each model being trained on a diferent subset . Specifically,
the mapping of an algorithm  using a map  is an application ℳ, defined as
ℳ, ((, )∀ ∈ 1 . . . ) () =  ((, )∀ : () = ()) () .
(2)
It is important to note that the application defined in 2 follows the same form as the supervised
learning algorithms described in 1. Therefore, the mapping ensemble approach can be seen as a
supervised learning algorithm itself.</p>
      <p>
        The choice of mapping function  is a hyperparameter that can be optimized using standard
procedures [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. Some natural options include using bijections for categorical attributes and
mapping real-valued attributes to the index of subintervals of a partition of their domain after
scaling transformations.
      </p>
      <p>In the case of a classification problem, an alternative algorithm ℬ can be used to construct .
The resulting mapping ensemble can be represented as M,ℬ, defined as:</p>
      <p>M,ℬ ((, )∀ ∈ 1 . . . ) = ℳ, ∘ℬ (((,)∀∈1...)) ((, )∀ ∈ 1 . . . ) ,
(3)
where  is an arbitrary bijection from the set of  classes to the first  integers.</p>
      <p>For regression problems, a similar strategy can be applied by replacing  with a partition as
described above.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>
        To show that mapped supervised ensemble is able to improve the results of some learning
algorithms, we will apply it to a simple set of problems built using the UCI Wine Data Set [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
which contains chemical analysis results for wines grown in the same Italian region by three
diferent cultivators. This dataset includes 13 attributes, one of which is the class label, and
the remaining 12 are quantitative measurements of constituents found in the wines. While the
dataset is commonly used for initial testing of new classification models, we will use it for a
diferent purpose. Specifically, we will treat the class label as a categorical attribute and create
12 separate regression tasks using the remaining attributes but one to predict the value of the
one left out. Note these tasks can also be regarded as imputation-building tasks.
      </p>
      <p>We have employed a set of commonly used supervised learning methods from the scikit-learn
library [11] to study these tasks. The methods considered include linear regressions, support
vector machines, -nearest neighbors, decision trees, random forests, extra randomized trees,
gradient boosting, and baseline regression. We used the default hyperparameter values available
in v1.2.2 for all the methods.</p>
      <p>The mapping  is defined as the projection of the first attribute (sometimes denoted as Π1),
which is here the categorical attribute. We have chosen this simple mapping to study the
scenario where a simple categorical attribute suggests splitting the dataset. Although more
model
Linear regression
Support vector machine
k-nearest neighbors
Decision tree
Random forest
Extra trees
Gradient boosting
Baseline regression
20
t
n
u15
o
C
10
5
0
0.0
0.2
Δ r20.4
0.6
0.8</p>
      <p>Δ r20.4
0.0
0.2
0.6
0.8
(a) All of the results
(b) Statistically significant results (  &lt; 0.05)
complicated mappings could be used, including the use of other machine learning methods, for
the purpose of this work, we will focus on this simpler scenario.</p>
      <p>The performance of the models will be evaluated using 2 to enable comparison across tasks.
To determine whether the diferences between the original model and the mapped model are
statistically significant, we will use the 5x2cv paired  test, which was proposed by Dietterich to
address the limitations of traditional cross-validation or resampling tests [12]. This allows us to
obtain measures of 2 for both the original and mapped methods, and to calculate their diference
Δ2, along with a -value that can be used to determine if such a diference is statistically
significant.</p>
      <p>Experiments where both the original model and the mapped model have an 2 score less than
0.1 will be excluded from the analysis. This is because any model whose performance is not
superior to the baseline model should be rejected in favor of the baseline model. Additionally, the
failure to pass the statistical test could also be influenced by a lack of suficient data. Therefore,
we will report two types of analyses: one that includes all results and another that only includes
statistically significant results.</p>
      <p>Figure 1 displays the evaluation results, with Figure 1a presenting all the results (provided
that at least one of the two 2 scores is greater than 0.1, as explained earlier), and Figure 1b
presenting only the statistically significant results (  &lt; 0.05). As shown, the mapping is efective
in improving the performance of simple models such as support vector machines and -nearest
neighbors, sometimes in a large amount in terms of 2. These models have the added benefit
of being easy to interpret and providing reasonable extrapolations, particularly in the case of
linear models. Baseline models (which here always predict the mean value) also benefit from
the mapping, demonstrating that when a clear option exists to split a dataset, the concept of a
baseline model can be adapted accordingly.</p>
      <p>However, tree-based models such as decision trees, extra randomized trees, and gradient
boosting do not exhibit any improvement. This is unsurprising, as these algorithms naturally
incorporate dataset splitting into their behavior. Nevertheless, these algorithms could be used
as an alternative method to define a mapping, which we plan to explore in future work.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we have introduced mapping supervised learning, an ensemble method with
applications to supervised learning. Our proposed method was evaluated on multiple regression
tasks using the UCI Wine Data Set and a simple mapping function defined by the original class
label. Our findings indicate that the method can significantly improve the results of several
regression methods, including support vector machines and -nearest neighbors, as well as
improving baseline models.</p>
      <p>However, we have also observed that tree-based methods, which naturally incorporate the
discovery of the simple mapping used here in their algorithms, do not significantly benefit from
the application of the mapping ensemble technique. In future work, we plan to propose an
alternative mapping construction technique that could improve the performance of tree-based
models.</p>
      <p>Moreover, we aim to further evaluate the supervised learning application of the mapping
technique and explore its potential application to other paradigms of machine learning.
Additionally, we intend to investigate the interpretability of the mapped models and how the
mappings can be used to gain insights into the relationships between features and the target
variable. We believe that the proposed mapping ensemble technique has significant potential
for enhancing the performance and interpretability of supervised learning models.
[11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python,
the Journal of machine Learning research 12 (2011) 2825–2830. doi:10.5555/1953048.
2078195.
[12] T. G. Dietterich, Approximate statistical tests for comparing supervised
classification learning algorithms, Neural computation 10 (1998) 1895–1923. doi:10.1162/
089976698300017197.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>A survey on ensemble learning</article-title>
          ,
          <source>Frontiers of Computer Science</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>241</fpage>
          -
          <lpage>258</lpage>
          . doi:https://doi.org/10.1007/s11704-019-8208-z.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          , Bagging predictors,
          <source>Machine learning 24</source>
          (
          <year>1996</year>
          )
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          . doi:https://doi. org/10.1007/BF00058655.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          ,
          <article-title>Pasting small votes for classification in large databases and on-line, Machine learning 36 (</article-title>
          <year>1999</year>
          )
          <article-title>85</article-title>
          . doi:https://doi.org/10.1023/A:
          <fpage>1007563306331</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          , Random forests,
          <source>Machine learning 45</source>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          . doi:https://doi.org/ 10.1023/A:
          <fpage>1010933404324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Freund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Schapire</surname>
          </string-name>
          ,
          <article-title>A decision-theoretic generalization of on-line learning and an application to boosting</article-title>
          ,
          <source>Journal of computer and system sciences 55</source>
          (
          <year>1997</year>
          )
          <fpage>119</fpage>
          -
          <lpage>139</lpage>
          . doi:https://doi.org/10.1006/jcss.
          <year>1997</year>
          .
          <volume>1504</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          ,
          <source>Annals of statistics</source>
          (
          <year>2001</year>
          )
          <fpage>1189</fpage>
          -
          <lpage>1232</lpage>
          . doi:
          <volume>10</volume>
          .1214/aos/1013203451.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Grohs</surname>
          </string-name>
          , G. Kutyniok,
          <string-name>
            <given-names>P.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <source>The Modern Mathematics of Deep Learning</source>
          , Cambridge University Press,
          <year>2022</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>111</lpage>
          . doi:
          <volume>10</volume>
          .1017/9781009025096.002.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bergstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Random search for hyper-parameter optimization</article-title>
          .,
          <source>Journal of machine learning research 13</source>
          (
          <year>2012</year>
          ). doi:
          <volume>10</volume>
          .5555/2188385.2188395.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jamieson</surname>
          </string-name>
          , G. DeSalvo,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rostamizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talwalkar</surname>
          </string-name>
          ,
          <article-title>Hyperband: A novel bandit-based approach to hyperparameter optimization</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <year>2017</year>
          )
          <fpage>6765</fpage>
          -
          <lpage>6816</lpage>
          . doi:
          <volume>10</volume>
          .5555/3122009.3242042.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. Lichman,</surname>
          </string-name>
          <article-title>UCI machine learning repository</article-title>
          ,
          <year>2013</year>
          . URL: https://archive.ics.uci.edu/ml.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>