=Paper= {{Paper |id=Vol-3052/short20 |storemode=property |title=Knowledge Injection via ML-based Initialization of Neural Networks |pdfUrl=https://ceur-ws.org/Vol-3052/short20.pdf |volume=Vol-3052 |authors=Lars Hoffmann,,Christian Bartelt,,Heiner Stuckenschmidt |dblpUrl=https://dblp.org/rec/conf/cikm/HoffmannBS21 }} ==Knowledge Injection via ML-based Initialization of Neural Networks== https://ceur-ws.org/Vol-3052/short20.pdf
Knowledge Injection via ML-based Initialization of Neural
Networks
Lars Hoffmann1 , Christian Bartelt1 and Heiner Stuckenschmidt2
1
    University of Mannheim, Institute for Enterprise Systems, L15, 1-6, 68131 Mannheim, Germany
2
    University of Mannheim, Chair of Artificial Intelligence, B6, 26, 68131 Mannheim, Germany


                                             Abstract
                                             Despite the success of artificial neural networks (ANNs) for various complex tasks, their performance and training duration
                                             heavily rely on several factors. In many application domains these requirements, such as high data volume and quality, are
                                             not satisfied. To tackle this issue, different ways to inject existing domain knowledge into the ANN generation provided
                                             promising results. However, the initialization of ANNs is mostly overlooked in this paradigm and remains an important
                                             scientific challenge. In this paper, we present a machine learning framework enabling an ANN to perform a semantic map-
                                             ping from a well-defined, symbolic representation of domain knowledge to weights and biases of an ANN in a specified
                                             architecture.

                                             Keywords
                                             Knowledge Injection, Neural Networks, Initialization, Machine Learning



1. Introduction                                                                                                       at local minima or saddle points [8].
                                                                                                                         Consequently, there is a great variety of approaches in
Despite the substantial achievements of artificial neural                                                             this field focusing on different elements within the ANN
networks (ANNs) driven by high generalization capa-                                                                   generation process. One prominent category targets the
bilities, flexibility, and robustness, the training duration                                                          learning process by adding domain-specific constraints
and performance still highly depend on several factors,                                                               or loss terms to the cost function, such as [9], [10], [11]
such as the network architecture, the loss function, the                                                              and [12]. However, there is little to no research on initial-
initialization method, and most importantly on the avail-                                                             izing the weights and biases of an ANN based on domain
able training data. However, in many real-world appli-                                                                knowledge. Such knowledge can act as a pointer towards
cations, e.g., in safety-critical systems, there are various                                                          a promising starting point in the optimization landscape.
issues regarding data collection and generation. In these                                                             The resulting “warm start” of the learning process can
scenarios, the capabilities of exclusively data-oriented                                                              reduce the required training time as well as improve the
approaches to train an ANN are limited.                                                                               overall performance. This effect shall be exploited effi-
   To tackle these domain-specific challenges, the concept                                                            ciently by the framework presented in this paper.
of integrating or injecting existing domain knowledge                                                                    With this goal in mind, the existing collection of net-
into the generation process of ANNs becomes increas-                                                                  work initialization techniques were analyzed. Aguirre
ingly attractive in research and practice, indicated by sev-                                                          and Fuentes [13] define three groups. “Data-independent”
eral survey papers for machine learning (e.g., [1, 2, 3, 4])                                                          methods are based on randomly drawing samples of dif-
as well as deep learning in particular (e.g., [5, 6, 7]).                                                             ferent distributions, e.g., LeCun [14], Xavier [15] and He
Furthermore, this paradigm bares the potential to miti-                                                               [16]. “Data-dependent” approaches, such as WIPE [17],
gate general weaknesses of ANNs, like slow convergence                                                                LSUV [18] and MIWI [19], additionally take statistical
speed, high data demands and the risk of getting stuck                                                                properties of the available training data into account. Ap-
KINN@CIKM’21: Proceedings of CIKM Workshop on Knowledge                                                               proaches within the third group, like [20, 21, 22, 23, 24],
Injection in Neural Networks, November 1, 2021, Online Virtual Event                                                  apply the concept of “pre-training“. Their goal is to learn
" hoffmann@es.uni-mannheim.de (L. Hoffmann);                                                                          an ANN on a related problem (with sufficient availability
bartelt@es.uni-mannheim.de (C. Bartelt);                                                                              of high-quality data) and use it as an initialization for the
heiner@informatik.uni-mannheim.de (H. Stuckenschmidt)
                                                                                                                      primary task. Consequently, they are not limited to the
~ https://www.uni-mannheim.de/ines/ueber-uns/
wissenschaftliche-mitarbeiter/lars-hoffmann (L. Hoffmann);                                                            actual training data, and thus to some extent indepen-
https://www.uni-mannheim.de/ines/ueber-uns/                                                                           dent to task-specific data issues. Although, none of these
wissenschaftliche-mitarbeiter/dr-christian-bartelt (C. Bartelt);                                                      approaches explicitly considers domain knowledge, they
https://www.uni-mannheim.de/dws/people/professors/                                                                    could be adapted to pre-train an ANN on synthetic data
prof-dr-heiner-stuckenschmidt/ (H. Stuckenschmidt)
                                                                                                                      encoding domain knowledge; also proposed by Karpatne
 0000-0002-9667-0310 (L. Hoffmann); 0000-0003-0426-6714
(C. Bartelt); 0000-0002-0209-3859 (H. Stuckenschmidt)                                                                 et al. [4]. These data can be generated, for instance, by
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      simulations or querying a domain model. But this comes
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
with several disadvantages. First, the data samples must       2.1. 𝒯 -Net Generation
be efficiently generated to fully represent the domain
                                                               Before a given task-specific function approximation, also
knowledge, and second, the ANN must be able to learn
                                                               referred to as domain model, can be injected, a suitable
the contained knowledge. On top of this potential loss
                                                               𝒯 -Net must be generated once with the proposed ML
of information, this entire process must be repeated ev-
                                                               framework. This can be done completely with synthetic
ery time the domain knowledge or the target network
                                                               data. The overall objective is to maximize the transforma-
architecture changes.
                                                               tion fidelity between the input function and the predicted
   Replacing this indirect, data-based knowledge transfer
                                                               ANN. A schematic overview of the framework consisting
from an already existing domain model into an ANN with
                                                               of three main steps is shown in Figure 1.
a direct mapping or transformation can potentially solve
these problems. Although they do not relate to knowl-
edge injection, several authors engineered explicit map-       2.1.1. Algebra Selection and 𝜆-Function
ping algorithms for different representations. A promi-               Generation
nent example are Decision Trees (DTs), because they            At first, a diverse set of functions Λ in the same algebra
also have a graph-based structure. Early work was al-          as the given task-specific domain model is created. If
ready performed in the 1990s (e.g., [25, 26, 27, 28, 29]),     such a domain model is not already defined, a well-suited
but this topic is recently becoming more attention again       algebra for representing the existing domain knowledge
(e.g., [30, 31, 32, 33]). Nevertheless, there are two ma-      needs to be selected and the model generated. This can
jor shortcomings if applied to knowledge injection with        be done implicitly by pre-training or explicitly by an
initialization. On the one hand, they are model-specific       expert. However, each function 𝜆𝑖 ∈ Λ must operate on
and hard to engineer, which makes them impractical             the same solution space defined by the overall task to be
considering the diversity of knowledge representations.        solved, for instance, a binary classification or regression
On the other hand, they cannot map to arbitrary ANN            problem. In addition, each function 𝜆𝑖 requires a set of
architectures, which may restrict an ANN’s ability to dis-     𝑁 representative examples 𝒟𝜆𝑖 as
cover new characteristics in the subsequent optimization.
This becomes increasingly critical as the gap between                       {︁                          }︁|Λ|
                                                                              𝒟𝜆𝑖 = {(𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 )}𝑁         ,
the expressed knowledge and the entire task complexity                                              𝑗=1
                                                                                                          𝑖=1
widens.
   Instead of engineering such mappings by hand, this          where 𝑥𝑖,𝑗 = (𝑥𝑖,𝑗1 , 𝑥𝑖,𝑗2 , . . . , 𝑥𝑖,𝑗𝐷 ) denotes one data
paper introduces a machine learning (ML) framework ca-         point of dimensionality 𝐷 and 𝑦𝑖,𝑗 = 𝜆𝑖 (𝑥𝑖,𝑗 ) is the
pable of training an ANN to become a semantic mapping          result after applying the function 𝜆𝑖 to 𝑥𝑖,𝑗 . How to
from a well-defined, algebraic representation of domain        generate Λ and set 𝑁 depends on the given context.
knowledge to a network’s weights and biases. We call
such a mapping “Transformation Network” or 𝒯 -Net.             2.1.2. Data Preparation for 𝒯 -Net Training
This data-driven framework can be applied to various           Before the 𝒯 -Net training, the data and 𝜆-functions must
model algebras, such as DTs or polynomials, with only          be put in the correct shape, i.e., numeric vectors for ANNs.
slight adaptions. Thereby, it tackles the challenge of vari-   Therefore, an encoding method, denoted as 𝐸𝑛𝑐𝜆 , is
ability in domain knowledge representations by transfer-       required. Similarly, a decoding method 𝐷𝑒𝑐𝜇 enables the
ring the complex mapping generation from humans to             translation of the returned network weights and biases
machines. Furthermore, an arbitrary network structure          to an executable ANN 𝜇𝑖 . The dataset required for the
as 𝒯 -Net output can be selected, which achieves an inde-      𝒯 -Net training is defined as
pendence between the complexity of the domain model
and the target ANN.                                                         𝒟𝒯 := {(𝐸𝑛𝑐𝜆 (𝜆𝑖 ), 𝒟𝜆𝑖 )}|Λ|
                                                                                                      𝑖=1 ,

                                                               where each example is a tuple of the encoded function
2. Framework and Approach                                      𝜆𝑖 and its representative samples 𝒟𝜆𝑖 . For clarification,
In this section, we give a brief introduction on (1) how       these samples are not the target output of the 𝒯 -Net, but
the proposed framework trains an ANN to become a               are required for the loss calculation. This is described in
semantic mapping (𝒯 -Net) from the internals of a given        the next step.
algebraic model to a network’s weights and biases, and
(2) how to utilize its capabilities for knowledge injection    2.1.3. 𝒯 -Net Training
via ANN initialization.                                        After the preparations, the 𝒯 -Net is trained. Its weights
                                                               and biases are adjusted based on the backpropagated pre-
                                                               diction error over the training dataset 𝒟𝒯 . This error
                                                                                               𝜆|Λ|
                                   𝜆1, 2, … , |Λ| ∈ Λ                                         …
                                                                                         𝜆2
                                       𝑁   |Λ|                   𝑥1,𝑗1
              𝒟𝜆𝑖 = ൛(𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 )ൟ                             𝑥1,𝑗2
                                       𝑗=1 𝑖=1          𝑥1,𝑗 =     ⋮         ⋮        𝛌𝟏              𝑦1,𝑗 = 𝜆1 𝑥1,𝑗1 𝑥1,𝑗2 … 𝑥1,𝑗𝐷
                                                                 𝑥1,𝑗𝐷


                                           |Λ|
            𝒟 ∶= ൛(𝐸𝑛𝑐𝜆 (𝜆𝑖 ), 𝒟𝜆𝑖 )ൟ
                                           𝑖=1                                                              𝑤1,1
                                                                                                      ⋮      ⋮      ⋮
                                                                                                           𝑤1,|𝑊|
                                                        𝜆1               ⋮                                                     𝜇1
                                                                                                            𝑏1,1
             minimize 𝐸𝑟𝑟𝑜𝑟(𝒚𝒊 , 𝜇𝑖 (𝒙𝒊 ))                                                            ⋮      ⋮      ⋮
            𝑖∈{1, 2, … , Λ }




                                                                   𝜆




                                                                                                                        𝜇
                                                                                                           𝑏1,|𝐵|

Figure 1: Overview of proposed machine learning framework for training a transformation ANN (𝒯 -Net) in three main steps:
function generation, data preparation and 𝒯 -Net training.



measures how different each input function 𝜆𝑖 is com-                        it for ANN initialization. Therefore, we conducted ex-
pared to the currently predicted ANN counterpart 𝜇𝑖 . To                     periments on polynomials as symbolic domain models to
quantify this difference, a traditional task-specific mea-                   support solving random regression problems including
sure (𝐸𝑟𝑟𝑜𝑟), e.g., categorical cross-entropy, is applied to                 two variables. The 𝒯 -Net was trained on 10,000 polyno-
the true values 𝑦 𝑖 = 𝜆𝑖 (𝑥𝑖 ) and the predictions 𝜇𝑖 (𝑥𝑖 )                  mials with orders between 0 and 8 to find the closest ANN
given the vector of all input samples 𝑥𝑖 . The overall                       approximations with one hidden layer of 225 neurons.
optimization goal can be formally described as                                  Without extensive hyperparameter optimization, the
                                                                             𝒯 -Net could achieve on average a mapping fidelity quan-
             minimize 𝐸𝑟𝑟𝑜𝑟(𝑦 𝑖 , 𝜇𝑖 (𝑥𝑖 )).                                 tified by the coefficient of determination (𝑅2 ) of 0.77
            𝑖∈{1,2,...,|Λ|}
                                                                             (±0.26) over representative samples on a set of 2,500 test
Thus, the 𝒯 -Net training aims to maximize the transfor-                     polynomials. Despite the noticeable distance to a perfect
mation fidelity. By that, we want to enable the 𝒯 -Net to                    mapping (𝑅2 = 1), it significantly proves the learning
generalize to previously unseen 𝜆-functions, making it a                     capability of the proposed ML framework.
capable mapping for this family of functions.                                   To investigate the impact of injecting knowledge by
                                                                             utilizing 𝒯 -Nets on a given ANN task, the training dura-
                                                                             tion and prediction performance were analyzed. A total
2.2. Knowledge Injection via 𝒯 -Net                                          of 2,500 synthetic regression problems were randomly
     Execution                                                               created and then two ANNs were trained on each prob-
After the one-time effort of generating a suitable 𝒯 -Net, it                lem; one lets the 𝒯 -Net predict the initial weights and
is able to instantly initialize ANNs for all possible domain                 biases based on a polynomial approximation, and the
models within the trained function algebra. Therefore,                       second one applies the Xavier uniform initializer [15] as
we just need to pass the encoded representation to the                       a benchmark. Early stopping was used to indicate con-
𝒯 -Net and let it predict the initial weights and biases.                    vergence during the optimization. Besides the different
In the current state, one 𝒯 -Net maps to ANNs with a                         initialization, all other factors and parameters remained
pre-defined specification, i.e., architecture and activation                 the same.
functions. To achieve a high fidelity, it must be assumed                       By applying the 𝒯 -Net for initialization, in 91% of
that ANNs with this specification are capable of accu-                       the regression test cases the prediction performance in-
rately approximating the input functions. However, if                        creased and 96% required less epochs to converge, i.e.,
changes to the network specification are required, only                      hitting the early stopping criterion, compared to the naive
the 𝒯 -Net training must be repeated with an adapted                         benchmark. More specifically, the training duration could
output layer and/or 𝜇-Decoding.                                              be reduced on average by 64%. In addition, the resulting
                                                                             ANN performance in terms of the mean absolute error
                                                                             (MAE) showed an average increase of 2.7%. Figure 2
3. Evaluation                                                                illustrates these two benefits in more detail.
                                                                                This condensed evaluation demonstrates the benefits
In this section, we want to briefly show that our frame-                     of the proposed framework and emphasizes the potential
work shows promising results in practice in terms of the                     for knowledge injection into ANNs. A more sophisticated
𝒯 -Net mapping fidelity as well as the effects of applying                   evaluation is currently a work in progress.
                                                                        Guided Machine Learning Techniques With Cyber-
                                                                        Physical System (CPS) Focus, IEEE Access 8
                                                                        (2020) 71050–71073. doi:10.1109/ACCESS.2020.
                                                                        2987324, conference Name: IEEE Access.
                                                                    [2] L. von Rueden, S. Mayer, K. Beckh, B. Georgiev,
                                                                        S. Giesselbach, R. Heese, B. Kirsch, M. Walczak,
                                                                        J. Pfrommer, A. Pick, R. Ramamurthy, J. Garcke,
                                                                        C. Bauckhage, J. Schuecker, Informed Machine
                                                                        Learning - A Taxonomy and Survey of Integrat-
                                                                        ing Prior Knowledge into Learning Systems, IEEE
                                                                        Transactions on Knowledge and Data Engineering
                                                                        (2021) 1–1. doi:10.1109/TKDE.2021.3079836,
                                                                        conference Name: IEEE Transactions on Knowl-
                                                                        edge and Data Engineering.
                                                                    [3] C. Deng, X. Ji, C. Rainey, J. Zhang, W. Lu,
                                                                        Integrating Machine Learning with Hu-
                                                                        man Knowledge,               iScience 23 (2020)
                                                                        101656.       URL:      https://www.sciencedirect.
                                                                        com/science/article/pii/S2589004220308488.
                                                                        doi:10.1016/j.isci.2020.101656.
                                                                    [4] A. Karpatne, G. Atluri, J. H. Faghmous, M. Stein-
                                                                        bach, A. Banerjee, A. Ganguly, S. Shekhar, N. Sam-
                                                                        atova, V. Kumar, Theory-Guided Data Science: A
Figure 2: Comparison between 𝒯 -Net injection and Xavier
                                                                        New Paradigm for Scientific Discovery from Data,
uniform initialization in terms of (a) training duration and (b)        IEEE Transactions on Knowledge and Data Engi-
performance.                                                            neering 29 (2017) 2318–2331. doi:10.1109/TKDE.
                                                                        2017.2720168, conference Name: IEEE Transac-
                                                                        tions on Knowledge and Data Engineering.
                                                                    [5] H. D. Gupta, V. S. Sheng, A Roadmap to Do-
4. Conclusion                                                           main Knowledge Integration in Machine Learn-
                                                                        ing, in: 2020 IEEE International Conference
In this paper, we have introduced a novel approach of
                                                                        on Knowledge Graph (ICKG), IEEE, Nanjing,
knowledge injection into ANNs by utilizing existing do-
                                                                        China, China, 2020, pp. 145–151. doi:10.1109/
main models for initialization. Therefore, a semantic
                                                                        ICBK50248.2020.00030.
mapping from the domain model’s internals to weights
                                                                    [6] A. Borghesi, F. Baldo, M. Milano, Improving Deep
and biases is applied. Instead of engineering such an ex-
                                                                        Learning Models via Constraint-Based Domain
plicit mapping by hand, we designed a machine learning
                                                                        Knowledge: a Brief Survey, arXiv:2005.10691 [cs,
framework capable of training an ANN to perform this
                                                                        stat] (2020). URL: http://arxiv.org/abs/2005.10691,
transformation. We call such a transformation network
                                                                        arXiv: 2005.10691.
𝒯 -Net. Besides the reduction of manual effort, it has
                                                                    [7] T. Dash, S. Chitlangia, A. Ahuja, A. Srinivasan, In-
the big advantage of decoupling the complexity of the
                                                                        corporating Domain Knowledge into Deep Neu-
domain model and ANN space.
                                                                        ral Networks, arXiv:2103.00180 [cs] (2021). URL:
   Based on promising initial experiments, we hypothe-
                                                                        http://arxiv.org/abs/2103.00180, arXiv: 2103.00180.
size that this framework can generate 𝒯 -Nets with suffi-
                                                                    [8] Ç. Gülçehre, Y. Bengio, Knowledge matters: impor-
cient fidelity by appropriately addressing the following
                                                                        tance of prior information for optimization, The
aspects: (1) its network specification (e.g., architecture
                                                                        Journal of Machine Learning Research 17 (2016)
and activation functions), (2) the learning behavior (e.g.,
                                                                        226–257.
loss function and optimizer), (3) the training data gen-
                                                                    [9] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. Broeck,
eration (e.g., diversity of domain models), and (4) the
                                                                        A Semantic Loss Function for Deep Learning with
numeric encoding of the domain model algebra.
                                                                        Symbolic Knowledge, in: Proceedings of the 35th
                                                                        International Conference on Machine Learning,
References                                                              PMLR, Stockholm, Sweden, 2018, pp. 5502–5511.
                                                                        URL: http://proceedings.mlr.press/v80/xu18h.html,
 [1] R. Rai, C. K. Sahu, Driven by Data or Derived                      iSSN: 2640-3498.
     Through Physics? A Review of Hybrid Physics                   [10] Z. Hu, Z. Yang, R. Salakhutdinov, X. Liang, L. Qin,
     H. Dong, E. P. Xing, Deep generative models with              2016. URL: http://arxiv.org/abs/1511.06422.
     learnable knowledge constraints, in: Proceedings         [19] J. Qiao, S. Li, W. Li, Mutual information based
     of the 32nd International Conference on Neural In-            weight initialization method for sigmoidal feedfor-
     formation Processing Systems, NIPS’18, Curran As-             ward neural networks, Neurocomputing 207 (2016)
     sociates Inc., Red Hook, NY, USA, 2018, pp. 10522–            676–683. URL: https://doi.org/10.1016/j.neucom.
     10533.                                                        2016.05.054. doi:10.1016/j.neucom.2016.05.
[11] M. Diligenti, S. Roychowdhury, M. Gori, Integrating           054.
     Prior Knowledge into Deep Learning, in: 16th IEEE        [20] G. Li, H. Alnuweiri, Y. Wu, H. Li, Acceleration
     International Conference on Machine Learning and              of back propagation through initial weight pre-
     Applications (ICMLA), IEEE, Cancun, Mexico, 2017,             training with delta rule, in: IEEE International
     pp. 920–923. doi:10.1109/ICMLA.2017.00-37.                    Conference on Neural Networks, IEEE, San Fran-
[12] N. Muralidhar, M. R. Islam, M. Marwah, A. Karpatne,           cisco, CA, USA, 1993, pp. 580–585 vol.1. doi:10.
     N. Ramakrishnan, Incorporating Prior Domain                   1109/ICNN.1993.298622.
     Knowledge into Deep Neural Networks, in: 2018            [21] H. Shimodaira, A weight value initialization
     IEEE International Conference on Big Data (Big                method for improving learning performance of
     Data), IEEE, Seattle, WA, USA, USA, 2018, pp. 36–             the backpropagation algorithm in neural networks,
     45. doi:10.1109/BigData.2018.8621955.                         in: Proceedings Sixth International Conference on
[13] D. Aguirre, O. Fuentes, Improving Weight Ini-                 Tools with Artificial Intelligence. TAI 94, IEEE, New
     tialization of ReLU and Output Layers, in: I. V.              Orleans, LA, USA, 1994, pp. 672–675. doi:10.1109/
     Tetko, V. Kůrková, P. Karpov, F. Theis (Eds.), Ar-            TAI.1994.346429.
     tificial Neural Networks and Machine Learning            [22] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast
     – ICANN 2019: Deep Learning, Lecture Notes                    learning algorithm for deep belief nets, Neural
     in Computer Science, Springer International Pub-              Computation 18 (2006) 1527–1554. URL: https://
     lishing, Cham, 2019, pp. 170–184. doi:10.1007/                doi.org/10.1162/neco.2006.18.7.1527. doi:10.1162/
     978-3-030-30484-3_15.                                         neco.2006.18.7.1527.
[14] Y. LeCun, L. Bottou, G. B. Orr, K. R. Müller, Ef-        [23] H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin,
     ficient BackProp, in: G. B. Orr, K.-R. Müller                 Exploring Strategies for Training Deep Neural Net-
     (Eds.), Neural Networks: Tricks of the Trade,                 works, The Journal of Machine Learning Research
     Lecture Notes in Computer Science, Springer,                  10 (2009) 1–40.
     Berlin, Heidelberg, 1998, pp. 9–50. URL: https:          [24] S. Z. Seyyedsalehi, S. A. Seyyedsalehi,             A
     //doi.org/10.1007/3-540-49430-8_2. doi:10.1007/               fast and efficient pre-training method based
     3-540-49430-8_2.                                              on layer-by-layer maximum discrimination for
[15] X. Glorot, Y. Bengio, Understanding the diffi-                deep neural networks,          Neurocomputing 168
     culty of training deep feedforward neural networks,           (2015) 669–680. URL: https://www.sciencedirect.
     in: Proceedings of the Thirteenth International               com/science/article/pii/S0925231215007389. doi:10.
     Conference on Artificial Intelligence and Statis-             1016/j.neucom.2015.05.057.
     tics, JMLR Workshop and Conference Proceed-              [25] G. G. Towell, J. W. Shavlik, Knowledge-based
     ings, Chia Laguna Resort, Sardinia, Italy, 2010,              artificial neural networks, Artificial Intelligence
     pp. 249–256. URL: http://proceedings.mlr.press/v9/            70 (1994) 119–165. URL: https://www.sciencedirect.
     glorot10a.html, iSSN: 1938-7228.                              com/science/article/pii/0004370294901058. doi:10.
[16] K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into            1016/0004-3702(94)90105-8.
     Rectifiers: Surpassing Human-Level Performance           [26] I. Ivanova, M. Kubat,               Initialization of
     on ImageNet Classification, in: 2015 IEEE Inter-              neural networks by means of decision
     national Conference on Computer Vision (ICCV),                trees,       Knowledge-Based Systems 8 (1995)
     IEEE, Santiago, Chile, 2015, pp. 1026–1034. doi:10.           333–344.        URL:      https://www.sciencedirect.
     1109/ICCV.2015.123, iSSN: 2380-7504.                          com/science/article/pii/0950705196819174.
[17] P. Costa, P. Larzabal, Initialization of Super-               doi:10.1016/0950-7051(96)81917-4.
     vised Training for Parametric Estimation, Neu-           [27] G. Thimm, E. Fiesler, Neural network initialization,
     ral Processing Letters 9 (1999) 53–61. URL: https://          in: J. Mira, F. Sandoval (Eds.), From Natural to Arti-
     doi.org/10.1023/A:1018671912219. doi:10.1023/A:               ficial Neural Computation, Lecture Notes in Com-
     1018671912219.                                                puter Science, Springer, Berlin, Heidelberg, 1995, pp.
[18] D. Mishkin, J. Matas, All you need is a good init, in:        535–542. doi:10.1007/3-540-59497-3_220.
     Y. Bengio, Y. LeCun (Eds.), 4th International Con-       [28] A. Banerjee, Initializing Neural Networks Using De-
     ference on Learning Representations: Conference               cision Trees, in: Computational Learning Theory
     Track Proceedings, ICLR, San Juan, Puerto Rico,               and Natural Learning Systems: Volume IV: Making
     Learning Systems Practical, volume Making Learn-
     ing Systems Practical of Computational Learning
     Theory and Natural Learning Systems, MIT Press,
     Cambridge, MA, USA, 1997, pp. 3–15.
[29] R. Setiono, W. K. Leow,        On mapping deci-
     sion trees and neural networks, Knowledge-
     Based Systems 12 (1999) 95–99. URL: https://doi.
     org/10.1016/S0950-7051(99)00009-X. doi:10.1016/
     S0950-7051(99)00009-X.
[30] R. Balestriero,         Neural Decision Trees,
     arXiv:1702.07360 [cs,       stat] (2017). URL:
     http://arxiv.org/abs/1702.07360.
[31] S. Wang, C. Aggarwal, H. Liu,           Using a
     Random Forest to Inspire a Neural Network
     and Improving on It,         in: Proceedings of
     the 2017 SIAM International Conference on
     Data Mining (SDM), SIAM, Houston, Texas,
     USA, 2017, pp. 1–9. URL: https://epubs.siam.org/
     doi/abs/10.1137/1.9781611974973.1. doi:10.1137/
     1.9781611974973.1.
[32] G. Biau, E. Scornet, J. Welbl, Neural Random
     Forests, Sankhya A 81 (2019) 347–386. URL: https://
     doi.org/10.1007/s13171-018-0133-y. doi:10.1007/
     s13171-018-0133-y.
[33] K. D. Humbird, J. L. Peterson, R. G. Mcclarren,
     Deep Neural Network Initialization With Deci-
     sion Trees, IEEE Transactions on Neural Net-
     works and Learning Systems 30 (2019) 1286–1295.
     doi:10.1109/TNNLS.2018.2869694, conference
     Name: IEEE Transactions on Neural Networks and
     Learning Systems.