=Paper=
{{Paper
|id=Vol-3052/short20
|storemode=property
|title=Knowledge Injection via ML-based Initialization of Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-3052/short20.pdf
|volume=Vol-3052
|authors=Lars Hoffmann,,Christian Bartelt,,Heiner Stuckenschmidt
|dblpUrl=https://dblp.org/rec/conf/cikm/HoffmannBS21
}}
==Knowledge Injection via ML-based Initialization of Neural Networks==
Knowledge Injection via ML-based Initialization of Neural Networks Lars Hoffmann1 , Christian Bartelt1 and Heiner Stuckenschmidt2 1 University of Mannheim, Institute for Enterprise Systems, L15, 1-6, 68131 Mannheim, Germany 2 University of Mannheim, Chair of Artificial Intelligence, B6, 26, 68131 Mannheim, Germany Abstract Despite the success of artificial neural networks (ANNs) for various complex tasks, their performance and training duration heavily rely on several factors. In many application domains these requirements, such as high data volume and quality, are not satisfied. To tackle this issue, different ways to inject existing domain knowledge into the ANN generation provided promising results. However, the initialization of ANNs is mostly overlooked in this paradigm and remains an important scientific challenge. In this paper, we present a machine learning framework enabling an ANN to perform a semantic map- ping from a well-defined, symbolic representation of domain knowledge to weights and biases of an ANN in a specified architecture. Keywords Knowledge Injection, Neural Networks, Initialization, Machine Learning 1. Introduction at local minima or saddle points [8]. Consequently, there is a great variety of approaches in Despite the substantial achievements of artificial neural this field focusing on different elements within the ANN networks (ANNs) driven by high generalization capa- generation process. One prominent category targets the bilities, flexibility, and robustness, the training duration learning process by adding domain-specific constraints and performance still highly depend on several factors, or loss terms to the cost function, such as [9], [10], [11] such as the network architecture, the loss function, the and [12]. However, there is little to no research on initial- initialization method, and most importantly on the avail- izing the weights and biases of an ANN based on domain able training data. However, in many real-world appli- knowledge. Such knowledge can act as a pointer towards cations, e.g., in safety-critical systems, there are various a promising starting point in the optimization landscape. issues regarding data collection and generation. In these The resulting “warm start” of the learning process can scenarios, the capabilities of exclusively data-oriented reduce the required training time as well as improve the approaches to train an ANN are limited. overall performance. This effect shall be exploited effi- To tackle these domain-specific challenges, the concept ciently by the framework presented in this paper. of integrating or injecting existing domain knowledge With this goal in mind, the existing collection of net- into the generation process of ANNs becomes increas- work initialization techniques were analyzed. Aguirre ingly attractive in research and practice, indicated by sev- and Fuentes [13] define three groups. “Data-independent” eral survey papers for machine learning (e.g., [1, 2, 3, 4]) methods are based on randomly drawing samples of dif- as well as deep learning in particular (e.g., [5, 6, 7]). ferent distributions, e.g., LeCun [14], Xavier [15] and He Furthermore, this paradigm bares the potential to miti- [16]. “Data-dependent” approaches, such as WIPE [17], gate general weaknesses of ANNs, like slow convergence LSUV [18] and MIWI [19], additionally take statistical speed, high data demands and the risk of getting stuck properties of the available training data into account. Ap- KINN@CIKM’21: Proceedings of CIKM Workshop on Knowledge proaches within the third group, like [20, 21, 22, 23, 24], Injection in Neural Networks, November 1, 2021, Online Virtual Event apply the concept of “pre-training“. Their goal is to learn " hoffmann@es.uni-mannheim.de (L. Hoffmann); an ANN on a related problem (with sufficient availability bartelt@es.uni-mannheim.de (C. Bartelt); of high-quality data) and use it as an initialization for the heiner@informatik.uni-mannheim.de (H. Stuckenschmidt) primary task. Consequently, they are not limited to the ~ https://www.uni-mannheim.de/ines/ueber-uns/ wissenschaftliche-mitarbeiter/lars-hoffmann (L. Hoffmann); actual training data, and thus to some extent indepen- https://www.uni-mannheim.de/ines/ueber-uns/ dent to task-specific data issues. Although, none of these wissenschaftliche-mitarbeiter/dr-christian-bartelt (C. Bartelt); approaches explicitly considers domain knowledge, they https://www.uni-mannheim.de/dws/people/professors/ could be adapted to pre-train an ANN on synthetic data prof-dr-heiner-stuckenschmidt/ (H. Stuckenschmidt) encoding domain knowledge; also proposed by Karpatne 0000-0002-9667-0310 (L. Hoffmann); 0000-0003-0426-6714 (C. Bartelt); 0000-0002-0209-3859 (H. Stuckenschmidt) et al. [4]. These data can be generated, for instance, by © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). simulations or querying a domain model. But this comes CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) with several disadvantages. First, the data samples must 2.1. 𝒯 -Net Generation be efficiently generated to fully represent the domain Before a given task-specific function approximation, also knowledge, and second, the ANN must be able to learn referred to as domain model, can be injected, a suitable the contained knowledge. On top of this potential loss 𝒯 -Net must be generated once with the proposed ML of information, this entire process must be repeated ev- framework. This can be done completely with synthetic ery time the domain knowledge or the target network data. The overall objective is to maximize the transforma- architecture changes. tion fidelity between the input function and the predicted Replacing this indirect, data-based knowledge transfer ANN. A schematic overview of the framework consisting from an already existing domain model into an ANN with of three main steps is shown in Figure 1. a direct mapping or transformation can potentially solve these problems. Although they do not relate to knowl- edge injection, several authors engineered explicit map- 2.1.1. Algebra Selection and 𝜆-Function ping algorithms for different representations. A promi- Generation nent example are Decision Trees (DTs), because they At first, a diverse set of functions Λ in the same algebra also have a graph-based structure. Early work was al- as the given task-specific domain model is created. If ready performed in the 1990s (e.g., [25, 26, 27, 28, 29]), such a domain model is not already defined, a well-suited but this topic is recently becoming more attention again algebra for representing the existing domain knowledge (e.g., [30, 31, 32, 33]). Nevertheless, there are two ma- needs to be selected and the model generated. This can jor shortcomings if applied to knowledge injection with be done implicitly by pre-training or explicitly by an initialization. On the one hand, they are model-specific expert. However, each function 𝜆𝑖 ∈ Λ must operate on and hard to engineer, which makes them impractical the same solution space defined by the overall task to be considering the diversity of knowledge representations. solved, for instance, a binary classification or regression On the other hand, they cannot map to arbitrary ANN problem. In addition, each function 𝜆𝑖 requires a set of architectures, which may restrict an ANN’s ability to dis- 𝑁 representative examples 𝒟𝜆𝑖 as cover new characteristics in the subsequent optimization. This becomes increasingly critical as the gap between {︁ }︁|Λ| 𝒟𝜆𝑖 = {(𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 )}𝑁 , the expressed knowledge and the entire task complexity 𝑗=1 𝑖=1 widens. Instead of engineering such mappings by hand, this where 𝑥𝑖,𝑗 = (𝑥𝑖,𝑗1 , 𝑥𝑖,𝑗2 , . . . , 𝑥𝑖,𝑗𝐷 ) denotes one data paper introduces a machine learning (ML) framework ca- point of dimensionality 𝐷 and 𝑦𝑖,𝑗 = 𝜆𝑖 (𝑥𝑖,𝑗 ) is the pable of training an ANN to become a semantic mapping result after applying the function 𝜆𝑖 to 𝑥𝑖,𝑗 . How to from a well-defined, algebraic representation of domain generate Λ and set 𝑁 depends on the given context. knowledge to a network’s weights and biases. We call such a mapping “Transformation Network” or 𝒯 -Net. 2.1.2. Data Preparation for 𝒯 -Net Training This data-driven framework can be applied to various Before the 𝒯 -Net training, the data and 𝜆-functions must model algebras, such as DTs or polynomials, with only be put in the correct shape, i.e., numeric vectors for ANNs. slight adaptions. Thereby, it tackles the challenge of vari- Therefore, an encoding method, denoted as 𝐸𝑛𝑐𝜆 , is ability in domain knowledge representations by transfer- required. Similarly, a decoding method 𝐷𝑒𝑐𝜇 enables the ring the complex mapping generation from humans to translation of the returned network weights and biases machines. Furthermore, an arbitrary network structure to an executable ANN 𝜇𝑖 . The dataset required for the as 𝒯 -Net output can be selected, which achieves an inde- 𝒯 -Net training is defined as pendence between the complexity of the domain model and the target ANN. 𝒟𝒯 := {(𝐸𝑛𝑐𝜆 (𝜆𝑖 ), 𝒟𝜆𝑖 )}|Λ| 𝑖=1 , where each example is a tuple of the encoded function 2. Framework and Approach 𝜆𝑖 and its representative samples 𝒟𝜆𝑖 . For clarification, In this section, we give a brief introduction on (1) how these samples are not the target output of the 𝒯 -Net, but the proposed framework trains an ANN to become a are required for the loss calculation. This is described in semantic mapping (𝒯 -Net) from the internals of a given the next step. algebraic model to a network’s weights and biases, and (2) how to utilize its capabilities for knowledge injection 2.1.3. 𝒯 -Net Training via ANN initialization. After the preparations, the 𝒯 -Net is trained. Its weights and biases are adjusted based on the backpropagated pre- diction error over the training dataset 𝒟𝒯 . This error 𝜆|Λ| 𝜆1, 2, … , |Λ| ∈ Λ … 𝜆2 𝑁 |Λ| 𝑥1,𝑗1 𝒟𝜆𝑖 = ൛(𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 )ൟ 𝑥1,𝑗2 𝑗=1 𝑖=1 𝑥1,𝑗 = ⋮ ⋮ 𝛌𝟏 𝑦1,𝑗 = 𝜆1 𝑥1,𝑗1 𝑥1,𝑗2 … 𝑥1,𝑗𝐷 𝑥1,𝑗𝐷 |Λ| 𝒟 ∶= ൛(𝐸𝑛𝑐𝜆 (𝜆𝑖 ), 𝒟𝜆𝑖 )ൟ 𝑖=1 𝑤1,1 ⋮ ⋮ ⋮ 𝑤1,|𝑊| 𝜆1 ⋮ 𝜇1 𝑏1,1 minimize 𝐸𝑟𝑟𝑜𝑟(𝒚𝒊 , 𝜇𝑖 (𝒙𝒊 )) ⋮ ⋮ ⋮ 𝑖∈{1, 2, … , Λ } 𝜆 𝜇 𝑏1,|𝐵| Figure 1: Overview of proposed machine learning framework for training a transformation ANN (𝒯 -Net) in three main steps: function generation, data preparation and 𝒯 -Net training. measures how different each input function 𝜆𝑖 is com- it for ANN initialization. Therefore, we conducted ex- pared to the currently predicted ANN counterpart 𝜇𝑖 . To periments on polynomials as symbolic domain models to quantify this difference, a traditional task-specific mea- support solving random regression problems including sure (𝐸𝑟𝑟𝑜𝑟), e.g., categorical cross-entropy, is applied to two variables. The 𝒯 -Net was trained on 10,000 polyno- the true values 𝑦 𝑖 = 𝜆𝑖 (𝑥𝑖 ) and the predictions 𝜇𝑖 (𝑥𝑖 ) mials with orders between 0 and 8 to find the closest ANN given the vector of all input samples 𝑥𝑖 . The overall approximations with one hidden layer of 225 neurons. optimization goal can be formally described as Without extensive hyperparameter optimization, the 𝒯 -Net could achieve on average a mapping fidelity quan- minimize 𝐸𝑟𝑟𝑜𝑟(𝑦 𝑖 , 𝜇𝑖 (𝑥𝑖 )). tified by the coefficient of determination (𝑅2 ) of 0.77 𝑖∈{1,2,...,|Λ|} (±0.26) over representative samples on a set of 2,500 test Thus, the 𝒯 -Net training aims to maximize the transfor- polynomials. Despite the noticeable distance to a perfect mation fidelity. By that, we want to enable the 𝒯 -Net to mapping (𝑅2 = 1), it significantly proves the learning generalize to previously unseen 𝜆-functions, making it a capability of the proposed ML framework. capable mapping for this family of functions. To investigate the impact of injecting knowledge by utilizing 𝒯 -Nets on a given ANN task, the training dura- tion and prediction performance were analyzed. A total 2.2. Knowledge Injection via 𝒯 -Net of 2,500 synthetic regression problems were randomly Execution created and then two ANNs were trained on each prob- After the one-time effort of generating a suitable 𝒯 -Net, it lem; one lets the 𝒯 -Net predict the initial weights and is able to instantly initialize ANNs for all possible domain biases based on a polynomial approximation, and the models within the trained function algebra. Therefore, second one applies the Xavier uniform initializer [15] as we just need to pass the encoded representation to the a benchmark. Early stopping was used to indicate con- 𝒯 -Net and let it predict the initial weights and biases. vergence during the optimization. Besides the different In the current state, one 𝒯 -Net maps to ANNs with a initialization, all other factors and parameters remained pre-defined specification, i.e., architecture and activation the same. functions. To achieve a high fidelity, it must be assumed By applying the 𝒯 -Net for initialization, in 91% of that ANNs with this specification are capable of accu- the regression test cases the prediction performance in- rately approximating the input functions. However, if creased and 96% required less epochs to converge, i.e., changes to the network specification are required, only hitting the early stopping criterion, compared to the naive the 𝒯 -Net training must be repeated with an adapted benchmark. More specifically, the training duration could output layer and/or 𝜇-Decoding. be reduced on average by 64%. In addition, the resulting ANN performance in terms of the mean absolute error (MAE) showed an average increase of 2.7%. Figure 2 3. Evaluation illustrates these two benefits in more detail. This condensed evaluation demonstrates the benefits In this section, we want to briefly show that our frame- of the proposed framework and emphasizes the potential work shows promising results in practice in terms of the for knowledge injection into ANNs. A more sophisticated 𝒯 -Net mapping fidelity as well as the effects of applying evaluation is currently a work in progress. Guided Machine Learning Techniques With Cyber- Physical System (CPS) Focus, IEEE Access 8 (2020) 71050–71073. doi:10.1109/ACCESS.2020. 2987324, conference Name: IEEE Access. [2] L. von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, M. Walczak, J. Pfrommer, A. Pick, R. Ramamurthy, J. Garcke, C. Bauckhage, J. Schuecker, Informed Machine Learning - A Taxonomy and Survey of Integrat- ing Prior Knowledge into Learning Systems, IEEE Transactions on Knowledge and Data Engineering (2021) 1–1. doi:10.1109/TKDE.2021.3079836, conference Name: IEEE Transactions on Knowl- edge and Data Engineering. [3] C. Deng, X. Ji, C. Rainey, J. Zhang, W. Lu, Integrating Machine Learning with Hu- man Knowledge, iScience 23 (2020) 101656. URL: https://www.sciencedirect. com/science/article/pii/S2589004220308488. doi:10.1016/j.isci.2020.101656. [4] A. Karpatne, G. Atluri, J. H. Faghmous, M. Stein- bach, A. Banerjee, A. Ganguly, S. Shekhar, N. Sam- atova, V. Kumar, Theory-Guided Data Science: A Figure 2: Comparison between 𝒯 -Net injection and Xavier New Paradigm for Scientific Discovery from Data, uniform initialization in terms of (a) training duration and (b) IEEE Transactions on Knowledge and Data Engi- performance. neering 29 (2017) 2318–2331. doi:10.1109/TKDE. 2017.2720168, conference Name: IEEE Transac- tions on Knowledge and Data Engineering. [5] H. D. Gupta, V. S. Sheng, A Roadmap to Do- 4. Conclusion main Knowledge Integration in Machine Learn- ing, in: 2020 IEEE International Conference In this paper, we have introduced a novel approach of on Knowledge Graph (ICKG), IEEE, Nanjing, knowledge injection into ANNs by utilizing existing do- China, China, 2020, pp. 145–151. doi:10.1109/ main models for initialization. Therefore, a semantic ICBK50248.2020.00030. mapping from the domain model’s internals to weights [6] A. Borghesi, F. Baldo, M. Milano, Improving Deep and biases is applied. Instead of engineering such an ex- Learning Models via Constraint-Based Domain plicit mapping by hand, we designed a machine learning Knowledge: a Brief Survey, arXiv:2005.10691 [cs, framework capable of training an ANN to perform this stat] (2020). URL: http://arxiv.org/abs/2005.10691, transformation. We call such a transformation network arXiv: 2005.10691. 𝒯 -Net. Besides the reduction of manual effort, it has [7] T. Dash, S. Chitlangia, A. Ahuja, A. Srinivasan, In- the big advantage of decoupling the complexity of the corporating Domain Knowledge into Deep Neu- domain model and ANN space. ral Networks, arXiv:2103.00180 [cs] (2021). URL: Based on promising initial experiments, we hypothe- http://arxiv.org/abs/2103.00180, arXiv: 2103.00180. size that this framework can generate 𝒯 -Nets with suffi- [8] Ç. Gülçehre, Y. Bengio, Knowledge matters: impor- cient fidelity by appropriately addressing the following tance of prior information for optimization, The aspects: (1) its network specification (e.g., architecture Journal of Machine Learning Research 17 (2016) and activation functions), (2) the learning behavior (e.g., 226–257. loss function and optimizer), (3) the training data gen- [9] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. Broeck, eration (e.g., diversity of domain models), and (4) the A Semantic Loss Function for Deep Learning with numeric encoding of the domain model algebra. Symbolic Knowledge, in: Proceedings of the 35th International Conference on Machine Learning, References PMLR, Stockholm, Sweden, 2018, pp. 5502–5511. URL: http://proceedings.mlr.press/v80/xu18h.html, [1] R. Rai, C. K. Sahu, Driven by Data or Derived iSSN: 2640-3498. Through Physics? A Review of Hybrid Physics [10] Z. Hu, Z. Yang, R. Salakhutdinov, X. Liang, L. Qin, H. Dong, E. P. Xing, Deep generative models with 2016. URL: http://arxiv.org/abs/1511.06422. learnable knowledge constraints, in: Proceedings [19] J. Qiao, S. Li, W. Li, Mutual information based of the 32nd International Conference on Neural In- weight initialization method for sigmoidal feedfor- formation Processing Systems, NIPS’18, Curran As- ward neural networks, Neurocomputing 207 (2016) sociates Inc., Red Hook, NY, USA, 2018, pp. 10522– 676–683. URL: https://doi.org/10.1016/j.neucom. 10533. 2016.05.054. doi:10.1016/j.neucom.2016.05. [11] M. Diligenti, S. Roychowdhury, M. Gori, Integrating 054. Prior Knowledge into Deep Learning, in: 16th IEEE [20] G. Li, H. Alnuweiri, Y. Wu, H. Li, Acceleration International Conference on Machine Learning and of back propagation through initial weight pre- Applications (ICMLA), IEEE, Cancun, Mexico, 2017, training with delta rule, in: IEEE International pp. 920–923. doi:10.1109/ICMLA.2017.00-37. Conference on Neural Networks, IEEE, San Fran- [12] N. Muralidhar, M. R. Islam, M. Marwah, A. Karpatne, cisco, CA, USA, 1993, pp. 580–585 vol.1. doi:10. N. Ramakrishnan, Incorporating Prior Domain 1109/ICNN.1993.298622. Knowledge into Deep Neural Networks, in: 2018 [21] H. Shimodaira, A weight value initialization IEEE International Conference on Big Data (Big method for improving learning performance of Data), IEEE, Seattle, WA, USA, USA, 2018, pp. 36– the backpropagation algorithm in neural networks, 45. doi:10.1109/BigData.2018.8621955. in: Proceedings Sixth International Conference on [13] D. Aguirre, O. Fuentes, Improving Weight Ini- Tools with Artificial Intelligence. TAI 94, IEEE, New tialization of ReLU and Output Layers, in: I. V. Orleans, LA, USA, 1994, pp. 672–675. doi:10.1109/ Tetko, V. Kůrková, P. Karpov, F. Theis (Eds.), Ar- TAI.1994.346429. tificial Neural Networks and Machine Learning [22] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast – ICANN 2019: Deep Learning, Lecture Notes learning algorithm for deep belief nets, Neural in Computer Science, Springer International Pub- Computation 18 (2006) 1527–1554. URL: https:// lishing, Cham, 2019, pp. 170–184. doi:10.1007/ doi.org/10.1162/neco.2006.18.7.1527. doi:10.1162/ 978-3-030-30484-3_15. neco.2006.18.7.1527. [14] Y. LeCun, L. Bottou, G. B. Orr, K. R. Müller, Ef- [23] H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, ficient BackProp, in: G. B. Orr, K.-R. Müller Exploring Strategies for Training Deep Neural Net- (Eds.), Neural Networks: Tricks of the Trade, works, The Journal of Machine Learning Research Lecture Notes in Computer Science, Springer, 10 (2009) 1–40. Berlin, Heidelberg, 1998, pp. 9–50. URL: https: [24] S. Z. Seyyedsalehi, S. A. Seyyedsalehi, A //doi.org/10.1007/3-540-49430-8_2. doi:10.1007/ fast and efficient pre-training method based 3-540-49430-8_2. on layer-by-layer maximum discrimination for [15] X. Glorot, Y. Bengio, Understanding the diffi- deep neural networks, Neurocomputing 168 culty of training deep feedforward neural networks, (2015) 669–680. URL: https://www.sciencedirect. in: Proceedings of the Thirteenth International com/science/article/pii/S0925231215007389. doi:10. Conference on Artificial Intelligence and Statis- 1016/j.neucom.2015.05.057. tics, JMLR Workshop and Conference Proceed- [25] G. G. Towell, J. W. Shavlik, Knowledge-based ings, Chia Laguna Resort, Sardinia, Italy, 2010, artificial neural networks, Artificial Intelligence pp. 249–256. URL: http://proceedings.mlr.press/v9/ 70 (1994) 119–165. URL: https://www.sciencedirect. glorot10a.html, iSSN: 1938-7228. com/science/article/pii/0004370294901058. doi:10. [16] K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into 1016/0004-3702(94)90105-8. Rectifiers: Surpassing Human-Level Performance [26] I. Ivanova, M. Kubat, Initialization of on ImageNet Classification, in: 2015 IEEE Inter- neural networks by means of decision national Conference on Computer Vision (ICCV), trees, Knowledge-Based Systems 8 (1995) IEEE, Santiago, Chile, 2015, pp. 1026–1034. doi:10. 333–344. URL: https://www.sciencedirect. 1109/ICCV.2015.123, iSSN: 2380-7504. com/science/article/pii/0950705196819174. [17] P. Costa, P. Larzabal, Initialization of Super- doi:10.1016/0950-7051(96)81917-4. vised Training for Parametric Estimation, Neu- [27] G. Thimm, E. Fiesler, Neural network initialization, ral Processing Letters 9 (1999) 53–61. URL: https:// in: J. Mira, F. Sandoval (Eds.), From Natural to Arti- doi.org/10.1023/A:1018671912219. doi:10.1023/A: ficial Neural Computation, Lecture Notes in Com- 1018671912219. puter Science, Springer, Berlin, Heidelberg, 1995, pp. [18] D. Mishkin, J. Matas, All you need is a good init, in: 535–542. doi:10.1007/3-540-59497-3_220. Y. Bengio, Y. LeCun (Eds.), 4th International Con- [28] A. Banerjee, Initializing Neural Networks Using De- ference on Learning Representations: Conference cision Trees, in: Computational Learning Theory Track Proceedings, ICLR, San Juan, Puerto Rico, and Natural Learning Systems: Volume IV: Making Learning Systems Practical, volume Making Learn- ing Systems Practical of Computational Learning Theory and Natural Learning Systems, MIT Press, Cambridge, MA, USA, 1997, pp. 3–15. [29] R. Setiono, W. K. Leow, On mapping deci- sion trees and neural networks, Knowledge- Based Systems 12 (1999) 95–99. URL: https://doi. org/10.1016/S0950-7051(99)00009-X. doi:10.1016/ S0950-7051(99)00009-X. [30] R. Balestriero, Neural Decision Trees, arXiv:1702.07360 [cs, stat] (2017). URL: http://arxiv.org/abs/1702.07360. [31] S. Wang, C. Aggarwal, H. Liu, Using a Random Forest to Inspire a Neural Network and Improving on It, in: Proceedings of the 2017 SIAM International Conference on Data Mining (SDM), SIAM, Houston, Texas, USA, 2017, pp. 1–9. URL: https://epubs.siam.org/ doi/abs/10.1137/1.9781611974973.1. doi:10.1137/ 1.9781611974973.1. [32] G. Biau, E. Scornet, J. Welbl, Neural Random Forests, Sankhya A 81 (2019) 347–386. URL: https:// doi.org/10.1007/s13171-018-0133-y. doi:10.1007/ s13171-018-0133-y. [33] K. D. Humbird, J. L. Peterson, R. G. Mcclarren, Deep Neural Network Initialization With Deci- sion Trees, IEEE Transactions on Neural Net- works and Learning Systems 30 (2019) 1286–1295. doi:10.1109/TNNLS.2018.2869694, conference Name: IEEE Transactions on Neural Networks and Learning Systems.