<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neurosymbolic Learning With Random Forest</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>MichalBarnišin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lubomír Popelínský</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Informatics, Masaryk University</institution>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ITAT'25: Information Technologies - Applications and Theory</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Neurosymbolic learning combines the representational power of neural networks with the data eficiency and interpretability of symbolic methods. In this thesis, we investigate a hybrid approaGcohoguLsienNget as a feature extractor and a random forest classifier trained on intermediate neuron activations. Focusing on moderately small image datasets, we show that this combination can improve classification accuracy compared to the neural network alone. Furthermore, we analyze the training time and find that the hybrid model can reach the neural network's peak accuracy in less time, depending on the layer used for feature extraction.</p>
      </abstract>
      <kwd-group>
        <kwd>neurosymbolic learning</kwd>
        <kwd>random foreGsoto</kwd>
        <kwd>gLeNet</kwd>
        <kwd>feature extraction</kwd>
        <kwd>image classification</kwd>
        <kwd>training eficiency</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR
Workshop
ISSN1613-0073</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The relation between neural networks and decision trees (and random forests) has been studied multiple
times [
        <xref ref-type="bibr" rid="ref5">5, 6</xref>
        ]. The main motivation to combine these methods is to increase the interpretability of the
resulting model, increase accuracy or adapt NNs for problems with small da7]tapsreetses.n[ts a well
organised survey of neural trees—the diferent combinations of NNs and DTs.
      </p>
      <p>One common direction uses NNs as feature extractors, feeding intermediate activations into
treebased models. This allows the NN to learn high-level representations, while the tree model performs
ifnal classification. For example,8[, 9] trained RFs on the final layer of convolutional neural networks
(CNNs) for medical and aerial image classification, reporting higher accuracy and better robustness,
especially on small datasets.</p>
      <p>Neural-backed Decision Tree1s0[] integrate soft decision trees into the architecture by reshaping
the NN’s final layer into a WordNet-aligned hierarchy, yielding interpretable misclassification paths
without sacrificing performance.</p>
      <p>A setup closer to ours is presented1i1n],[where RFs are trained on full-layer activations from
multiple levels of a CNN, and their outputs are aggregated via voting. This leverages both low-level
and high-level features.</p>
      <p>In contrast to prior work, we systematically evaluate which layers are most efective for hybrid
NN–RF classifiers, aiming for faster training and high accuracy with limited data. We also explore the
possibility of combining diferent neuron activations from various layers and epochs.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Hybrid Classifier</title>
      <p>We study a hybrid classification model that combines deep neural networks with random forests. The
NN serves as a feature extractor, while the RF acts as the final classifier.</p>
      <p>To train the model (see Algorit1h),mwe fine-tune a pretrainedGoogLeNet on an image classification
task. During training, we periodically extract activations from five selected layers. These span from
early convolutional outputs to the final logits. The collected activations serve as input features for a
random forest classifier. The training alternates between updating the NN and training the RF on the
neural network’s activations.</p>
      <p>During inference, samples are passed through the NN, and the RF uses the selected activations to
predict the final class label.</p>
      <p>Algorithm 1 Training Hybrid NN–RF Classifier
Require: Training datas ettrain,set of NN layers, stopping critersiatopping_criteria
1: Initialize neural netw ork(0)
2: for epoch = 1 to∞ do
3: Train () on  trainfrom  (−1) for one epoch
4: Extract activati on(s) from layer s on  train
5: Train a RFℱ on some subsets of⋃∈0…  ()
6: if stopping_criteria( () , ℱ ) then
7: return  () , ℱ
8: end if
9: end for</p>
      <sec id="sec-3-1">
        <title>3.1. Neural Network Setup</title>
        <p>We use thePyTorch implementation oGf oogLeNet, selected for its moderate size and compatibility
with freely available GPU environments. The original final layer is replaced with a new fully connected
layer, initialized wiKthaiming Uniform. All layers are fine-tuned using tAhedam optimizer (learning
rate is fixed at 0.001), batch size 64, and cross-entropy loss. Auxiliary classifier heads are removed.</p>
        <p>We extract activations from five layers (Ta1b)ldeistributed throughout the network to study the
efect of feature depth.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Random Forest Settings</title>
        <p>The RF classifier is implemented usingscikit-learn. We fix the number of trees to 100 and maximum
tree depth to 20, based on preliminary tuninFgaoshnion-MNIST. These hyperparameters remain
constant across all experiments to ensure consistency. RFs are trained on flattened layer activations
stored during NN training.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Datasets</title>
        <p>
          We evaluate the hybrid classifier on three small-to-moderate image classification datasets:
• Fashion-MNIST [12]: 10 grayscale clothing classes.
• EMNIST (letters)1[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]: 26-class handwritten character grayscale dataset.
        </p>
        <p>• CIFAR-100 [14]: 100-class colored object dataset.</p>
        <p>All datasets are resized to×224 pixels. Grayscale images are converted to 3-channel RGB format.
Inputs are normalized usinImgageNet statistics15[]. Since we focus on moderately small datasets
only, the accuracy scores and training times are reported for random 1,000- and 10,000-sample subsets,
averaged across multiple random selections.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section presents the empirical findings from our experiments comparing the performance and
training eficiency of hybrid models to those of a fully traGinoeodgLeNet and a conventional fully
trained Random Forest.</p>
      <p>We evaluate classification accuracy, training1,taimned performance stability using diferent neural
network layers and their combinations. The experiments include two specializations of A1,lgorithm
wheresome subsets refers to:
• Single-layer evaluation: A separate RF is trained on activations from a single layer of</p>
      <p>GoogLeNet from the most recent epoch to identify the most informative layers.
• Multi-layer and cross-epoch evaluation: Activations from multiple layers and epochs are
aggregated to find minimal configurations achieving strong performance with minimal training
time.</p>
      <sec id="sec-4-1">
        <title>4.1. Main Findings</title>
        <p>Our results show that:
1. A random forest trained on activations from a single layer of a NN can exceed the NN’s
classification accuracy, with the efect being more pronounced for deeper layers—at the cost of longer
training time.
2. This hybrid architecture can match the NN’s performance in less time, showing that this approach
is viable in dynamic environments or prototyping.
3. We have also found that full-layer activations are often unnecessary; subsets of layers are equally
informative and allow training time savings without loss in performance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Baseline</title>
        <p>To establish a baseline, we evaluaGtoedogLeNet trained end-to-end using cross-entropy loss and an
RF trained directly on flattened input images. The results are in2T.able</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Single Layer Evaluation</title>
        <p>We evaluated RFs trained on the activations from individual layers of a preG-toroagiLneeNdet.
Specifically, for each pre-selected layer, we trained a separate RF to assess how informative diferent
stages of neural network processing are for downstream tasks. This setup allows us to probe the
representational quality of features extracted at various depths of the network. The performance of
these models is summarized in Tabl3e.</p>
        <p>Across all datasets, the hybrid NN+RF models consistently achhiigehverd accuracy, with
improvements ranging from 2% to 6%. These results validate our core hypothesis and align with prior
ifndings [ 8, 9, 11], where forests were shown to benefit from intermediate neural features.</p>
        <p>Interestingly, the observed gains in accuracy and training eficiency were stdreopnegnldyent on
the selected layer. Shallower layers tended to converge more quickly, but deeper—and often sparser—
layers such adsropout ormaxpool4 ultimately delivered better predictive performance. This suggests
that mid-to-deep layers strike a favorable balance between feature richness, dimensionality, and training
time, especially when training a single-layer RF.</p>
        <sec id="sec-4-3-1">
          <title>Fashion-MNIST</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>EMNIST</title>
        </sec>
        <sec id="sec-4-3-3">
          <title>CIFAR-100 NN</title>
          <p>The hybrid approach also exhibited notalobwlyer variance across runs, indicating more stable
training dynamics. As illustrated in Fig1,utrhee NN+RF method consistently outperformed the NN
baseline at nearly every epoch, maintaining a lead until convergence. This increased stability can
be attributed to the RF’s ability to mitigate the efects of noisy or suboptimal mini-batch selections
during NN training—an issue especially pronounced in smaller datasets. By decoupling the prediction
mechanism from stochastic gradient updates, the RF introduces an ensemble-based smoothing efect
that dampens variance across training seeds.</p>
          <p>Furthermore, the combined NN+RF approach maintains a consistent accuracy advantage across
epochs. Although both the NN and hybrid methods tend to converge toward similar final performance,
the RF-enhanced models often sustain a small but measurable lead. This suggests that RFs are particularly
efective at leveraging the internal representations learned by NNs—sometimes even more so than linear
classifiers typically used at the output layer.</p>
          <p>Taken together, these results demonstrate that using RFs on top of neural features can provide both
accuracy and stability gains, especially in scenarios with limited data or noisy optimization dynamics.</p>
          <p>The approach requires no retraining of the neural network and can be flexibly applied to various
intermediate layers, making it a lightweight and robust enhancement to existing NN pipelines.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Multi-Layer and Cross-Epoch Evaluation</title>
        <p>We further examined whether the random forest could perform well when trained on only a subset of
neuron activations from a pre-trained neural network. All experiments in this section were conducted
on theFashion-MNIST dataset, using a 1,000-sample subset due to memory constraints. We also
adjusted the RF configuration by settimnign_samples_leaf = 5, which encourages generalization by
limiting the size of individual branches.</p>
        <p>Surprisingly, training on as few as 1–10% of a layer’s total neurons was often suficient to match
the accuracy obtained using the whole layer (F2ig).uTrheis suggests a high degree of redundancy
in the learned representation, and supports the idea that compact activation subsets can generalize
efectively—particularly in low-data regimes.</p>
        <p>We then explored randomly sampled configurations that draw neurons from multiple layers and
epochs. As shown in Figur3e, most of these configurations outperformed the NN baseline in terms of
accuracy, with training time primarily influenced by the number of epochs and the size of the feature
set. Notably, high-performing configurations often included neurons from at least one later epoch,
reinforcing the benefit of deeper or better-trained features. Poorer configurations, by contrast, tended
to involve only very deep (but not fully trained) layers or very small feature sets (less than 100 neurons).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>We observed that deeper layers yielded better RF performance, in line with how NNs operate: early
layers capture low-level features, while deeper layers encode task-relevant abstractions. Surprisingly,
even randomly chosen subsets of activations provided suficient signal, suggesting redundancy in
representations and hinting at opportunities for feature selection and dimensionality reduction.</p>
      <p>Several limitations remain. The RFs are trained ofline, requiring all activations to be stored, thus
limiting scalability. We also used random feature sampling rather than principled selection methods.
Furthermore, we evaluated only one architecGtouorgeL(eNet), and did not explore how transferable
the findings are to others. Finally, although RFs ofer improved interpretability over NNs, we did not
attempt to extract symbolic rules or explanations from them.</p>
      <p>Importantly, our findings suggest that full convergence of a network may not be necessary if
intermediate representations are already informative—opening the door to faster, hybrid learning pipelines
in constrained environments.</p>
      <p>It is worth mentioning that our approach parallels concept probing from XAI, where intermediate
activations are used to detect specific concepts via simple class1i6fie]r.sL[ike concept probes, training
a random forest on a layer’s outputs can reveal which features or abstractions the network has learned,
highlighting the informational content of each layer.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We presented a neurosymbolic approach that combines convolutional neural networks with random
forests by training the latter on intermediate network activations. This hybrid method leverages the
feature extraction capabilities of neural networks and the data eficiency of classical models.</p>
      <p>Experiments across several image classification tasks demonstrated that random forests trained on
deeper neural activations consistently outperformed those trained on raw inputs and, in low-data
regimes, even rivaled the neural networks themselves. Moreover, using only a subset of activations was
often suficient, reducing training time without sacrificing accuracy.</p>
      <p>These findings highlight that intermediate neural representations carry a significant signal and that
symbolic models like random forests can efectively exploit them. Our results support the broader vision
of neurosymbolic learning: combining neural and symbolic methods can ofer favorable trade-ofs
between accuracy, interpretability, and computational eficiency.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Improve writing style.
After using this tool/service, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[6] N. Vasilev, Z. Mincheva, V. Nikolov, Decision Tree Extraction Using Trained Neural Network,
in: Proceedings of the 9th International Conference on Smart Cities and Green ICT Systems
(SMARTGREENS), SCITEPRESS, 2020, pp. 194–200. doi:10.5220/0009351801940200.
[7] H. Li, J. Song, M. Xue, H. Zhang, J. Ye, L. Cheng, M. Song, A Survey of Neural Trees, arXiv e-prints
(2022) arXiv:2209.03415. doi1: 0.48550/arXiv.2209.03415. arXiv:2209.03415.
[8] G.-H. Kwak, C.-w. Park, K.-d. Lee, S.-i. Na, H.-y. Ahn, N.-W. Park, Potential of Hybrid
CNNRF Model for Early Crop Mapping with Limited Input Data, Remote Sensing 13 (2021). URL:
https://www.mdpi.com/2072-4292/13/9/1629.doi:10.3390/rs13091629.
[9] F. Khozeimeh, D. Sharifrazi, N. H. Izadi, et al., RF-CNN-F: Random Forest with Convolutional
Neural Network Features for Coronary Artery Disease Diagnosis Based on Cardiac Magnetic
Resonance, Scientific Reports 12 (2022) 1–12. do1i0:.1038/s41598-022-15374-5.
[10] A. Wan, L. Dunlap, D. Ho, J. Yin, S. Lee, S. Petryk, S. A. Bargal, J. E. Gonzalez, NBDT:
NeuralBacked Decision Tree, in: International Conference on Learning Representations (ICLR), 2021.</p>
      <p>URL: https://openreview.net/forum?id=mCLVeEppl.NE
[11] G. Xu, M. Liu, Z. Jiang, D. Söfker, W. Shen, Bearing Fault Diagnosis Method Based on Deep
Convolutional Neural Network and Random Forest Ensemble Learning, Sensors 19 (2019) 1088.
doi:10.3390/s19051088.
[12] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine</p>
      <p>Learning Algorithms, 2017. URhLt:tps://arxiv.org/abs/1708.077.4a7rXiv:1708.07747.
[13] G. Cohen, S. Afshar, J. Tapson, A. van Schaik, EMNIST: an extension of MNIST to handwritten
letters, 2017a.rXiv:1702.05373.
[14] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report TR-2009,</p>
      <p>University of Toronto, Toronto, Canada, 2009.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image
database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp.
248–255. doi:10.1109/CVPR.2009.5206848.
[16] G. Alain, Y. Bengio, Understanding Intermediate Layers Using Linear Classifier Probes, in:
Proceedings of the International Conference on Learning Representations (ICLR) Workshop Track,
2017. URL: https://arxiv.org/abs/1610.016,4o4riginally published as arXiv:1610.01644 (2016).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernández-Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cernadas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amorim</surname>
          </string-name>
          ,
          <article-title>Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <year>2014</year>
          )
          <fpage>3133</fpage>
          -
          <lpage>3181</lpage>
          . URL: https://jmlr.org/papers/v15/delgado14a. h.tml
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Grinsztajn</surname>
          </string-name>
          , E. Oyallon, G. Varoquaux,
          <article-title>Why do tree-based models still outperform deep learning on typical tabular data?</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Neural Information Processing Systems</source>
          , NIPS '22, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <article-title>Going deeper with convolutions</article-title>
          ,
          <source>in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . do1i0:.1109/CVPR.
          <year>2015</year>
          .
          <volume>7298594</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Barnišin</surname>
          </string-name>
          ,
          <article-title>Neurosymbolic Learning with Random Forest, Master's thesis</article-title>
          , Masaryk University, Faculty of Informatics, Brno,
          <year>2025</year>
          . URhLt:tps://is.muni.cz/th/ja28.l/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Boz</surname>
          </string-name>
          ,
          <article-title>Extracting decision trees from trained neural networks</article-title>
          ,
          <source>in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '02,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2002</year>
          , p.
          <fpage>456</fpage>
          -
          <lpage>461</lpage>
          . UhRtLt:ps://doi.org/ 10.1145/775047.775113. doi:
          <volume>10</volume>
          .1145/775047.775113.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>