1. Introduction

Neurosymbolic Learning With Random Forest

MichalBarnišin

0 1

Lubomír Popelínský

0 1 0 Faculty of Informatics, Masaryk University , Czech Republic 1 ITAT'25: Information Technologies - Applications and Theory

2025

Neurosymbolic learning combines the representational power of neural networks with the data eficiency and interpretability of symbolic methods. In this thesis, we investigate a hybrid approaGcohoguLsienNget as a feature extractor and a random forest classifier trained on intermediate neuron activations. Focusing on moderately small image datasets, we show that this combination can improve classification accuracy compared to the neural network alone. Furthermore, we analyze the training time and find that the hybrid model can reach the neural network's peak accuracy in less time, depending on the layer used for feature extraction.

neurosymbolic learning random foreGsoto gLeNet feature extraction image classification training eficiency

1. Introduction

CEUR Workshop ISSN1613-0073

2. Related Work

The relation between neural networks and decision trees (and random forests) has been studied multiple times [ 5, 6 ]. The main motivation to combine these methods is to increase the interpretability of the resulting model, increase accuracy or adapt NNs for problems with small da7]tapsreetses.n[ts a well organised survey of neural trees—the diferent combinations of NNs and DTs.

One common direction uses NNs as feature extractors, feeding intermediate activations into treebased models. This allows the NN to learn high-level representations, while the tree model performs ifnal classification. For example,8[, 9] trained RFs on the final layer of convolutional neural networks (CNNs) for medical and aerial image classification, reporting higher accuracy and better robustness, especially on small datasets.

Neural-backed Decision Tree1s0[] integrate soft decision trees into the architecture by reshaping the NN’s final layer into a WordNet-aligned hierarchy, yielding interpretable misclassification paths without sacrificing performance.

A setup closer to ours is presented1i1n],[where RFs are trained on full-layer activations from multiple levels of a CNN, and their outputs are aggregated via voting. This leverages both low-level and high-level features.

In contrast to prior work, we systematically evaluate which layers are most efective for hybrid NN–RF classifiers, aiming for faster training and high accuracy with limited data. We also explore the possibility of combining diferent neuron activations from various layers and epochs.

3. Hybrid Classifier

We study a hybrid classification model that combines deep neural networks with random forests. The NN serves as a feature extractor, while the RF acts as the final classifier.

To train the model (see Algorit1h),mwe fine-tune a pretrainedGoogLeNet on an image classification task. During training, we periodically extract activations from five selected layers. These span from early convolutional outputs to the final logits. The collected activations serve as input features for a random forest classifier. The training alternates between updating the NN and training the RF on the neural network’s activations.

During inference, samples are passed through the NN, and the RF uses the selected activations to predict the final class label.

Algorithm 1 Training Hybrid NN–RF Classifier Require: Training datas ettrain,set of NN layers, stopping critersiatopping_criteria 1: Initialize neural netw ork(0) 2: for epoch = 1 to∞ do 3: Train () on trainfrom (−1) for one epoch 4: Extract activati on(s) from layer s on train 5: Train a RFℱ on some subsets of⋃∈0… () 6: if stopping_criteria( () , ℱ ) then 7: return () , ℱ 8: end if 9: end for

3.1. Neural Network Setup

We use thePyTorch implementation oGf oogLeNet, selected for its moderate size and compatibility with freely available GPU environments. The original final layer is replaced with a new fully connected layer, initialized wiKthaiming Uniform. All layers are fine-tuned using tAhedam optimizer (learning rate is fixed at 0.001), batch size 64, and cross-entropy loss. Auxiliary classifier heads are removed.

We extract activations from five layers (Ta1b)ldeistributed throughout the network to study the efect of feature depth.

3.2. Random Forest Settings

The RF classifier is implemented usingscikit-learn. We fix the number of trees to 100 and maximum tree depth to 20, based on preliminary tuninFgaoshnion-MNIST. These hyperparameters remain constant across all experiments to ensure consistency. RFs are trained on flattened layer activations stored during NN training.

3.3. Datasets

We evaluate the hybrid classifier on three small-to-moderate image classification datasets: • Fashion-MNIST [12]: 10 grayscale clothing classes. • EMNIST (letters)1[ 3 ]: 26-class handwritten character grayscale dataset.

• CIFAR-100 [14]: 100-class colored object dataset.

All datasets are resized to×224 pixels. Grayscale images are converted to 3-channel RGB format. Inputs are normalized usinImgageNet statistics15[]. Since we focus on moderately small datasets only, the accuracy scores and training times are reported for random 1,000- and 10,000-sample subsets, averaged across multiple random selections.

4. Results

This section presents the empirical findings from our experiments comparing the performance and training eficiency of hybrid models to those of a fully traGinoeodgLeNet and a conventional fully trained Random Forest.

We evaluate classification accuracy, training1,taimned performance stability using diferent neural network layers and their combinations. The experiments include two specializations of A1,lgorithm wheresome subsets refers to: • Single-layer evaluation: A separate RF is trained on activations from a single layer of

GoogLeNet from the most recent epoch to identify the most informative layers. • Multi-layer and cross-epoch evaluation: Activations from multiple layers and epochs are aggregated to find minimal configurations achieving strong performance with minimal training time.

4.1. Main Findings

Our results show that: 1. A random forest trained on activations from a single layer of a NN can exceed the NN’s classification accuracy, with the efect being more pronounced for deeper layers—at the cost of longer training time. 2. This hybrid architecture can match the NN’s performance in less time, showing that this approach is viable in dynamic environments or prototyping. 3. We have also found that full-layer activations are often unnecessary; subsets of layers are equally informative and allow training time savings without loss in performance.

4.2. Baseline

To establish a baseline, we evaluaGtoedogLeNet trained end-to-end using cross-entropy loss and an RF trained directly on flattened input images. The results are in2T.able

4.3. Single Layer Evaluation

We evaluated RFs trained on the activations from individual layers of a preG-toroagiLneeNdet. Specifically, for each pre-selected layer, we trained a separate RF to assess how informative diferent stages of neural network processing are for downstream tasks. This setup allows us to probe the representational quality of features extracted at various depths of the network. The performance of these models is summarized in Tabl3e.

Across all datasets, the hybrid NN+RF models consistently achhiigehverd accuracy, with improvements ranging from 2% to 6%. These results validate our core hypothesis and align with prior ifndings [ 8, 9, 11], where forests were shown to benefit from intermediate neural features.

Interestingly, the observed gains in accuracy and training eficiency were stdreopnegnldyent on the selected layer. Shallower layers tended to converge more quickly, but deeper—and often sparser— layers such adsropout ormaxpool4 ultimately delivered better predictive performance. This suggests that mid-to-deep layers strike a favorable balance between feature richness, dimensionality, and training time, especially when training a single-layer RF.

Fashion-MNIST EMNIST CIFAR-100 NN

The hybrid approach also exhibited notalobwlyer variance across runs, indicating more stable training dynamics. As illustrated in Fig1,utrhee NN+RF method consistently outperformed the NN baseline at nearly every epoch, maintaining a lead until convergence. This increased stability can be attributed to the RF’s ability to mitigate the efects of noisy or suboptimal mini-batch selections during NN training—an issue especially pronounced in smaller datasets. By decoupling the prediction mechanism from stochastic gradient updates, the RF introduces an ensemble-based smoothing efect that dampens variance across training seeds.

Furthermore, the combined NN+RF approach maintains a consistent accuracy advantage across epochs. Although both the NN and hybrid methods tend to converge toward similar final performance, the RF-enhanced models often sustain a small but measurable lead. This suggests that RFs are particularly efective at leveraging the internal representations learned by NNs—sometimes even more so than linear classifiers typically used at the output layer.

Taken together, these results demonstrate that using RFs on top of neural features can provide both accuracy and stability gains, especially in scenarios with limited data or noisy optimization dynamics.

The approach requires no retraining of the neural network and can be flexibly applied to various intermediate layers, making it a lightweight and robust enhancement to existing NN pipelines.

4.4. Multi-Layer and Cross-Epoch Evaluation

We further examined whether the random forest could perform well when trained on only a subset of neuron activations from a pre-trained neural network. All experiments in this section were conducted on theFashion-MNIST dataset, using a 1,000-sample subset due to memory constraints. We also adjusted the RF configuration by settimnign_samples_leaf = 5, which encourages generalization by limiting the size of individual branches.

Surprisingly, training on as few as 1–10% of a layer’s total neurons was often suficient to match the accuracy obtained using the whole layer (F2ig).uTrheis suggests a high degree of redundancy in the learned representation, and supports the idea that compact activation subsets can generalize efectively—particularly in low-data regimes.

We then explored randomly sampled configurations that draw neurons from multiple layers and epochs. As shown in Figur3e, most of these configurations outperformed the NN baseline in terms of accuracy, with training time primarily influenced by the number of epochs and the size of the feature set. Notably, high-performing configurations often included neurons from at least one later epoch, reinforcing the benefit of deeper or better-trained features. Poorer configurations, by contrast, tended to involve only very deep (but not fully trained) layers or very small feature sets (less than 100 neurons).

5. Discussion

We observed that deeper layers yielded better RF performance, in line with how NNs operate: early layers capture low-level features, while deeper layers encode task-relevant abstractions. Surprisingly, even randomly chosen subsets of activations provided suficient signal, suggesting redundancy in representations and hinting at opportunities for feature selection and dimensionality reduction.

Several limitations remain. The RFs are trained ofline, requiring all activations to be stored, thus limiting scalability. We also used random feature sampling rather than principled selection methods. Furthermore, we evaluated only one architecGtouorgeL(eNet), and did not explore how transferable the findings are to others. Finally, although RFs ofer improved interpretability over NNs, we did not attempt to extract symbolic rules or explanations from them.

Importantly, our findings suggest that full convergence of a network may not be necessary if intermediate representations are already informative—opening the door to faster, hybrid learning pipelines in constrained environments.

It is worth mentioning that our approach parallels concept probing from XAI, where intermediate activations are used to detect specific concepts via simple class1i6fie]r.sL[ike concept probes, training a random forest on a layer’s outputs can reveal which features or abstractions the network has learned, highlighting the informational content of each layer.

6. Conclusion

We presented a neurosymbolic approach that combines convolutional neural networks with random forests by training the latter on intermediate network activations. This hybrid method leverages the feature extraction capabilities of neural networks and the data eficiency of classical models.

Experiments across several image classification tasks demonstrated that random forests trained on deeper neural activations consistently outperformed those trained on raw inputs and, in low-data regimes, even rivaled the neural networks themselves. Moreover, using only a subset of activations was often suficient, reducing training time without sacrificing accuracy.

These findings highlight that intermediate neural representations carry a significant signal and that symbolic models like random forests can efectively exploit them. Our results support the broader vision of neurosymbolic learning: combining neural and symbolic methods can ofer favorable trade-ofs between accuracy, interpretability, and computational eficiency.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT in order to: Improve writing style. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [6] N. Vasilev, Z. Mincheva, V. Nikolov, Decision Tree Extraction Using Trained Neural Network, in: Proceedings of the 9th International Conference on Smart Cities and Green ICT Systems (SMARTGREENS), SCITEPRESS, 2020, pp. 194–200. doi:10.5220/0009351801940200. [7] H. Li, J. Song, M. Xue, H. Zhang, J. Ye, L. Cheng, M. Song, A Survey of Neural Trees, arXiv e-prints (2022) arXiv:2209.03415. doi1: 0.48550/arXiv.2209.03415. arXiv:2209.03415. [8] G.-H. Kwak, C.-w. Park, K.-d. Lee, S.-i. Na, H.-y. Ahn, N.-W. Park, Potential of Hybrid CNNRF Model for Early Crop Mapping with Limited Input Data, Remote Sensing 13 (2021). URL: https://www.mdpi.com/2072-4292/13/9/1629.doi:10.3390/rs13091629. [9] F. Khozeimeh, D. Sharifrazi, N. H. Izadi, et al., RF-CNN-F: Random Forest with Convolutional Neural Network Features for Coronary Artery Disease Diagnosis Based on Cardiac Magnetic Resonance, Scientific Reports 12 (2022) 1–12. do1i0:.1038/s41598-022-15374-5. [10] A. Wan, L. Dunlap, D. Ho, J. Yin, S. Lee, S. Petryk, S. A. Bargal, J. E. Gonzalez, NBDT: NeuralBacked Decision Tree, in: International Conference on Learning Representations (ICLR), 2021.

URL: https://openreview.net/forum?id=mCLVeEppl.NE [11] G. Xu, M. Liu, Z. Jiang, D. Söfker, W. Shen, Bearing Fault Diagnosis Method Based on Deep Convolutional Neural Network and Random Forest Ensemble Learning, Sensors 19 (2019) 1088. doi:10.3390/s19051088. [12] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine

Learning Algorithms, 2017. URhLt:tps://arxiv.org/abs/1708.077.4a7rXiv:1708.07747. [13] G. Cohen, S. Afshar, J. Tapson, A. van Schaik, EMNIST: an extension of MNIST to handwritten letters, 2017a.rXiv:1702.05373. [14] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report TR-2009,

University of Toronto, Toronto, Canada, 2009. [15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. doi:10.1109/CVPR.2009.5206848. [16] G. Alain, Y. Bengio, Understanding Intermediate Layers Using Linear Classifier Probes, in: Proceedings of the International Conference on Learning Representations (ICLR) Workshop Track, 2017. URL: https://arxiv.org/abs/1610.016,4o4riginally published as arXiv:1610.01644 (2016).

[1]

Fernández-Delgado ,

Cernadas ,

Barro ,

Amorim , Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? , Journal of Machine Learning Research 15 ( 2014 ) 3133 - 3181 . URL: https://jmlr.org/papers/v15/delgado14a. h.tml

[2]

Grinsztajn , E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data? , in: Proceedings of the 36th International Conference on Neural Information Processing Systems , NIPS '22, Curran Associates Inc., Red

Hook

, NY , USA, 2022 .

[3]

Szegedy , W. Liu,

Jia ,

Sermanet ,

Reed ,

Anguelov ,

Erhan ,

Vanhoucke ,

Rabinovich , Going deeper with convolutions , in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015 , pp. 1 - 9 . do1i0:.1109/CVPR. 2015 . 7298594 .

[4]

Barnišin , Neurosymbolic Learning with Random Forest, Master's thesis , Masaryk University, Faculty of Informatics, Brno, 2025 . URhLt:tps://is.muni.cz/th/ja28.l/

[5]

Boz , Extracting decision trees from trained neural networks , in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '02, Association for Computing Machinery, New York, NY, USA, 2002 , p. 456 - 461 . UhRtLt:ps://doi.org/ 10.1145/775047.775113. doi: 10 .1145/775047.775113.