How to Think About Benchmarking Neurosymbolic
AI?
Johanna Ott3,∗ , Arthur Ledaguenel1,2,∗ , Céline Hudelot1 and Mattis Hartwig3,4,∗
1
  MICS, CentraleSupélec, Université Paris-Saclay, Paris, France
2
  IRT SystemX, Paris-Saclay, France
3
  German Research Centre for Artificial Intelligence (DFKI), Lübeck, Germany
4
  singularIT GmbH, Leipzig, Germany


                                         Abstract
                                         Neurosymbolic artificial intelligence is a growing field of research aiming at combining neural networks
                                         with symbolic systems, including their respective learning and reasoning capabilities. This hybridization
                                         can take many shapes which adds to the fragmentation of the field and makes it difficult to compare
                                         the existing approaches. If some efforts have been made in the community to define archetypical
                                         means of hybridization, many elements are still missing to establish principled comparisons. Amongst
                                         those missing elements are formal and broadly accepted definitions of neurosymbolic tasks and their
                                         corresponding benchmarks. In this paper, we start from the specific task of multi-label classification
                                         with the integration of propositional background knowledge to illustrate how such a benchmarking
                                         framework could look like. Based on the benchmarking of one granular task we zoom out and discuss
                                         important elements and characteristics of building a full benchmarking suite for more than just one task.


1. Introduction
Neurosymbolic artificial intelligence (AI) is a trending research topic [1]. In general, neurosym-
bolic AI focuses on bringing together concepts from the logic-focused symbolic world and the
neural or connectionist’s world [2, 3, 4, 5].
   The potential of the field is based on the “best of both worlds” perspective, i.e., that by
combining neural and symbolic, the respective strengths are maintained while the weaknesses
are minimized. Thus, the objectives are extensive and include, amongst others, improved
performance [6, 7, 8, 9], explainability [10, 6, 11, 12, 13, 14] and generalization [15, 10, 11, 12,
13, 9, 16].
   Contrasting its promise of generalization, the field of neurosymbolic AI exhibits a progress-
hampering level of fragmentation, e.g. in the evaluation and the architectural landscape. There
have been several attempts to structure the architectural approaches in the neurosymbolic AI
field [17, 2, 18, 19, 20, 21]. In this paper, we focus on the fragmented evaluation landscape, i.e.
the tasks, datasets and metrics used to evaluate neurosymbolic systems. Although not the focus
of this paper, we believe that further work on a clear, unified architectural taxonomy is needed

NeSy2023: 17th International Workshop on Neural-Symbolic Learning and Reasoning, Siena, Italy
∗
    Corresponding author. These authors contributed equally.
Envelope-Open b00782280@essec.edu (J. Ott); arthur.ledaguenel@irt-systemx.fr (A. Ledaguenel); mattis.hartwig@dfki.de
(M. Hartwig)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
and that the current ambiguity about the separation of the neural, symbolic, and neurosymbolic
worlds adds to the fragmented evaluation landscape.
   Previous researchers have highlighted the fragmentation problem and emphasized the need
for a more systematic approach to evaluating neurosymbolic AI [22, 19, 1]. Although efforts
have been made to tackle this issue [23, 24, 25, 26, 27, 28], they have primarily remained at a
narrow and specific level, i.e. they propose specific tasks and benchmarks, including datasets
and evaluation metrics. Only a few exceptions, such as the panel discussion “The future of
(neuro-symbolic) AI” at the IBM Neuro-Symbolic AI Workshop 2022 [29] and the presentation
by Madhyastha and subsequent open discussion at NeSy2022 conference[22], have addressed
the neurosymbolic benchmark fragmentation issue on a level beyond a specific benchmark.
   In this position paper, we seek to complement the prior work tackling the fragmented
neurosymbolic benchmark landscape by facing the challenge on a higher-level, focusing on the
question of how to think about benchmarking neurosymbolic AI. We give an example for the
setup of a specific benchmark on the task of multi-label classification with symbolic background
knowledge. We include the thought process of coming up with a formal definition of the task,
a suitable dataset, and a selection of metrics. Additionally, we discuss the implications for
adding further benchmarks using our proposed thought process and thus contribute to a more
principled benchmarking landscape for neurosymbolic AI.


2. Benchmarking neurosymbolic systems on a specific task
A neurosymbolic benchmark can be designed to answer two main questions: “What performance
level can neurosymbolic systems reach on a given task?” and “How does hybridization of neural
and symbolic components help on a given task?”. The first question takes an outside view
focusing on observable behavior while the second question takes an inside view focusing on
the design of agents. The inside and the outside view are two well-known and deeply grounded
perspectives in AI research [30]. We agree with Russell that in general artificial intelligence
should be measured taking an outside view. However, answering the second question with the
inside view can give further insights on how to design AI agents by understanding how and
when to use neurosymbolic architectures. Additionally it might help directing the research
efforts of the neurosymbolic community because advancement in task performance can be
better linked to the architectural setup of the agent.
   Hence, in this section, we describe the challenges of benchmarking the task of multi-label
classification with symbolic background knowledge so that the two questions (internal and
external) can be answered. We cover the formal definition of the task, the underlying dataset
and the metrics. Although a task is not per se neurosymbolic, our chosen task covers elements
that are linked to a neural (image classification) and to a symbolic (background knowledge)
domain. This setup makes relatively straight-forward to use agents with a neurosymbolic
architecture, and is suitable for a neurosymbolic benchmark that answers both questions.

2.1. Task formalism
Setting a formal definition of the task is a necessary preliminary step to compare neurosymbolic
systems in a principled way. To be practical, the formalism also has to be comprehensive enough
to incorporate diverse datasets (in terms of modality and background knowledge structure) and
avoid a fragmentation of the field into multiple narrower tasks definitions.
   Multi-label classification with background knowledge is mapping inputs 𝑥 ∈ ℝ𝑑 to
binary labels y ∈ {0, 1}𝑘 such that these labels satisfy some background knowledge. This
background knowledge is expressed as a propositional formula 𝛼 using symbols from the
signature 𝒮 ∶= {𝑌𝑗 }1≤𝑗≤𝑘 and logical connectors {¬, ∧, ∨} with their standard semantics. For
lighter notations, we identify a label y ∈ {0, 1}𝑘 with the propositional valuation mapping each
𝑌𝑗 to 𝑦𝑗 . Therefore, we note y ⊧ 𝛼 if the corresponding valuation models 𝛼. A dataset for that
task is 𝒟 ∶= (𝑥 𝑖 , y𝑖 )1≤𝑖≤𝑛 with 𝑥 𝑖 ∈ ℝ𝑑 , y𝑖 ∈ {0, 1}𝑘 such that all labels in the dataset satisfy the
background knowledge, i.e. ∀1 ≤ 𝑖 ≤ 𝑛, y𝑖 ⊧ 𝛼.
   This formalism encompasses standard classification tasks like independent binary classifica-
tion (where 𝛼 = ⊤ since every combination is valid) and multi-category classification (where
𝛼 = (⋁1≤𝑗≤𝑘 𝑌𝑗 ) ∧ (⋀1≤𝑗<𝑙≤𝑘 (¬𝑌𝑗 ∨ ¬𝑌𝑙 )) enforces that one and only one atom is true at a time).
   Since we formally introduced our task we need to discuss the dataset and the metrics to
complete our benchmark.

2.2. Datasets
Building an appropriate dataset for multi-label classification with background knowledge poses
a substantial challenge. It must contain large amounts of data amenable to neural processing and
whose labels present some significant structure expressible in the language of propositional logic.
Efforts to build such datasets were often led by researchers trying to measure the performance
of their neurosymbolic system, meaning that different systems are rarely evaluated on the same
datasets and that datasets are often custom built to fit the capacity of a given system.
   We observed three patterns in how datasets were created: symbolic datasets where a sym-
bolic reasoning task is turned into a learning task (e.g. finding the shortest path in a weighted
graph [31]), compositional datasets where instances are tuples of a base sub-symbolic classifi-
cation dataset constrained to respect a given structure (e.g. the MNIST SUDOKU dataset [26])
and hierarchical datasets where classes of a sub-symbolic classification dataset are chosen in a
hierarchy of concepts (e.g. classes in ImageNet [32] are chosen amongst synsets of the WordNet
hierarchy [33]).
   To turn this collection of datasets into an efficient benchmark for multi-label classification
with background knowledge, further aspects need to be considered. On a fundamental level,
we observe an inverse relation between the complexity of the sub-symbolic features and the
complexity of the symbolic structure of the dataset, which means that the zone of complex
sub-symbolic features and complex symbolic structure is not well covered by existing datasets.
[25] is a dataset of traffic videos (complex sub-symbolic features) where labels satisfy a rich set
of constraints (complex symbolic structure). It constitutes a first step to cover that void and
more efforts should be invested in that direction. On the practical side, we need to set up a
standard on how to represent, store and operate neurosymbolic datasets and their corresponding
background knowledge, to allow rapid testing of any system on any dataset.
2.3. Metrics
To evaluate neurosymbolic systems inside our benchmark we use a combination of perfor-
mance metrics (the outside view) and control metrics (the inside view). Examples of standard
performance metrics are cross-entropy loss, individual accuracy, f1-score, collective accuracy,
top-k accuracy. Likewise, for standard control metrics we have number of trainable parameters,
number of hyper-parameters, number of FLOPS.
   Besides, new control or performance metrics specific to neurosymbolic tasks might be ben-
eficial. One example for such a performance metric is semantic consistency which tracks
how many predictions of a given system match the constraints expressed by the background
knowledge (see [34] or [25] for instance).
   To settle on a limited set of metrics for the multi-label classification with background knowl-
edge task (which also can be used for other tasks), we suggest to use collective accuracy and
semantic consistency as performance metrics and network size (number of trainable parameters)
as a control metric. The semantic consistency metric helps us understand how much the system
integrates background knowledge. Collective accuracy is a very demanding metric that is robust
to imbalanced datasets: we generally observe a strong correlation between collective accuracy
and f1-score for instance. Eventually, network size is a good first order approximation for model
capacity.


3. Broaden the focus on a collection of tasks
To extend the thoughts on the specific task from the previous section to cover more of the
neurosymbolic AI field, a natural next step is to focus on transferring the approach on more
tasks. We draw confidence in the transferability of the proposed thought process from the
observation that benchmarks cited in the preceding sections have already incorporated some
of our suggestions (e.g. implementing control metrics). Furthermore, existing benchmarks
may benefit from our thought process to improve their comparability. For instance, in visual
reasoning, CLEVR [35] and CLEVRER [36] benchmarks do not provide a formal definition of
the task, which makes comparisons between systems and with other datasets hard to establish.
Moreover, both underlying datasets lack sub-symbolic complexity compared to classic computer
vision datasets: the community could greatly benefit from filling that void.
   Expanding the focus from a specific task to a collection of tasks, i.e., creating a benchmarking
suite, raises another critical question: Which tasks should be included? The diversity of tasks has
been identified as a key consideration for a benchmarking suite by the discussion panel in [29].
Potential tasks should have ranging difficulties for both the neural and the symbolic architecture.
Also similar to the Glue [37] or GlueCon [38] benchmarking suites, several different capabilities
and skills should be needed to solve the tasks.


4. Conclusion
This position paper contributes an example thought process for designing neurosymbolic AI
benchmarks. Of course a single position paper cannot fully solve all questions around building
a unified benchmarking system, but, in contrast to other papers in the field so far, we refrained
from marketing an individual dataset and focused more on the questions around the design
phase of a benchmark. We also discussed the implications for broadening the approach to
multiple tasks which will be a valuable starting point for future benchmarking discussions
and designs. Next steps could include to validate our approach on more tasks and add further
thoughts to the discussion around important characteristics of a more holistic benchmarking
suite.


References
 [1] K. Hamilton, A. Nayak, B. Bozic, L. Longo, Is neuro-symbolic ai meeting its promise in
     natural language processing? a structured review, ArXiv abs/2202.12205 (2022).
 [2] M. K. Sarker, L. Zhou, A. Eberhart, P. Hitzler, Neuro-symbolic artificial intelligence: Current
     trends, 2021. URL: https://arxiv.org/abs/2105.05330. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 5 . 0 5 3 3 0 .
 [3] P. Hitzler, A. Eberhart, M. Ebrahimi, M. K. Sarker, L. Zhou, Neuro-symbolic approaches in
     artificial intelligence, National Science Review 9 (2022). URL: https://doi.org/10.1093/
     nsr/nwac035. doi:1 0 . 1 0 9 3 / n s r / n w a c 0 3 5 . a r X i v : h t t p s : / / a c a d e m i c . o u p . c o m / n s r / a r t i c l e -
     p d f / 9 / 6 / n w a c 0 3 5 / 4 3 9 5 2 9 5 4 / n w a c 0 3 5 _ s u p p l e m e n t a l _ f i l e . p d f , nwac035.
 [4] Z. Susskind, B. Arden, L. K. John, P. Stockton, E. B. John, Neuro-symbolic ai: An emerging
     class of ai workloads and their characterization, 2021. URL: https://arxiv.org/abs/2109.06133.
     doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 9 . 0 6 1 3 3 .
 [5] P. Hitzler, M. K. Sarker, T. R. Besold, A. D. Garcez, S. Bader, H. Bowman, P. Domingos,
     P. Hitzler, K. U. Kühnberger, L. C. Lamb, P. M. H. V. Lima, L. D. Penning, G. Pinkas, H. Poon,
     G. Zaverucha, Neural-Symbolic Learning and Reasoning: A Survey and Interpretation,
     volume 342, 2022. doi:1 0 . 3 2 3 3 / F A I A 2 1 0 3 4 8 .
 [6] D. Lyu, F. Yang, B. Liu, S. Gustafson, Sdrl: Interpretable and data-efficient deep rein-
     forcement learning leveraging symbolic planning, Proceedings of the AAAI Conference
     on Artificial Intelligence 33 (2019) 2970–2977. URL: https://ojs.aaai.org/index.php/AAAI/
     article/view/4153. doi:1 0 . 1 6 0 9 / a a a i . v 3 3 i 0 1 . 3 3 0 1 2 9 7 0 .
 [7] D. Demeter, D. Downey, Just add functions: A neural-symbolic language model, in: AAAI
     2020 - 34th AAAI Conference on Artificial Intelligence, 2020. doi:1 0 . 1 6 0 9 / a a a i . v 3 4 i 0 5 .
     6264.
 [8] F. Yang, D. Lyu, B. Liu, S. Gustafson, Peorl: Integrating symbolic planning and hierarchical
     reinforcement learning for robust decision-making, in: IJCAI International Joint Confer-
     ence on Artificial Intelligence, volume 2018-July, 2018. doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 1 8 / 6 7 5 .
 [9] H. Jiang, S. Gurajada, Q. Lu, S. Neelam, L. Popa, P. Sen, Y. Li, A. G. Gray, Lnn-el: A neuro-
     symbolic approach to short-text entity linking, in: Annual Meeting of the Association for
     Computational Linguistics, 2021.
[10] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, J. Wu, The neuro-symbolic concept learner:
     Interpreting scenes, words, and sentences from natural supervision, in: International
     Conference on Learning Representations, 2019. URL: https://openreview.net/forum?id=
     rJgMlhRctm.
[11] Y. Feng, X. Yang, X. Zhu, M. A. Greenspan, Neuro-symbolic natural logic with introspective
     revision for natural language inference, Transactions of the Association for Computational
     Linguistics 10 (2022) 240–256.
[12] K. Zheng, K.-Q. Zhou, J. Gu, Y. Fan, J. Wang, Z. xiao Li, X. He, X. E. Wang, Jarvis: A
     neuro-symbolic commonsense reasoning framework for conversational embodied agents,
     ArXiv abs/2208.13266 (2022).
[13] Y. Liang, J. Tenenbaum, T. A. Le, S. N, Drawing out of distribution with neuro-symbolic
     generative models, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
     (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates,
     Inc., 2022, pp. 15244–15254. URL: https://proceedings.neurips.cc/paper_files/paper/2022/
     file/6248a3b8279a39b3668a8a7c0e29164d-Paper-Conference.pdf.
[14] B. Finzel, A. Saranti, A. Angerschmid, D. Tafler, B. Pfeifer, A. Holzinger, Generating
     explanations for conceptual validation of graph neural networks: An investigation of
     symbolic predicates learned on relevance-ranked sub-graphs, KI - Künstliche Intelligenz
     36 (2022) 271–285. doi:1 0 . 1 0 0 7 / s 1 3 2 1 8 - 0 2 2 - 0 0 7 8 1 - 7 .
[15] M. B. Ganapini, M. Campbell, F. Fabiano, L. Horesh, J. Lenchner, A. Loreggia, N. Mattei,
     F. Rossi, B. Srivastava, K. B. Venable, Combining fast and slow thinking for human-like
     and efficient decisions in constrained environments, in: International Workshop on
     Neural-Symbolic Learning and Reasoning, 2022.
[16] X. Chen, C. Liang, A. W. Yu, D. Song, D. Zhou, Compositional generalization via neural-
     symbolic stack machines, in: Advances in Neural Information Processing Systems, volume
     2020-December, 2020.
[17] S. Bader, P. Hitzler, Dimensions of neural-symbolic integration - a structured survey, 2005.
     URL: https://arxiv.org/abs/cs/0511042. doi:1 0 . 4 8 5 5 0 / A R X I V . C S / 0 5 1 1 0 4 2 .
[18] H. A. Kautz, The third ai summer: Aaai robert s. engelmore memorial lecture, AI Mag. 43
     (2022) 93–104.
[19] A. d’Avila Garcez, L. C. Lamb, Neurosymbolic ai: the 3rd wave, Artificial Intelligence
     Review (2023). doi:1 0 . 1 0 0 7 / s 1 0 4 6 2 - 0 2 3 - 1 0 4 4 8 - w .
[20] F. V. Harmelen, A. ten Teije, A boxology of design patterns for hybrid learning and
     reasoning systems, Journal of Web Engineering 18 (2019) 97–124.
[21] L. d. Raedt, S. Dumančić, R. Manhaeve, G. Marra, From statistical relational to neuro-
     symbolic artificial intelligence, in: C. Bessiere (Ed.), Proceedings of the Twenty-Ninth
     International Joint Conference on Artificial Intelligence, IJCAI-20, International Joint
     Conferences on Artificial Intelligence Organization, 2020, pp. 4943–4950. URL: https:
     //doi.org/10.24963/ijcai.2020/688. doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 2 0 / 6 8 8 , survey track.
[22] P. Madhyastha, Towards a benchmark suite for neural-symbolic approaches for learning
     and reasoning, 2022. URL: https://ijclr22.doc.ic.ac.uk/program/index.html, 16th Interna-
     tional Workshop on Neural-Symbolic Learning and Reasoning.
[23] Ö. Yılmaz, A. S. d’Avila Garcez, D. L. Silver, A proposal for common dataset in neural-
     symbolic reasoning studies, in: NeSy@HLAI, 2016.
[24] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, R. B. Girshick, Clevr:
     A diagnostic dataset for compositional language and elementary visual reasoning, 2017
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 1988–1997.
[25] E. Giunchiglia, M. C. Stoian, S. Khan, F. Cuzzolin, T. Lukasiewicz, Road-r: the au-
     tonomous driving dataset with logical requirements, Machine Learning (2023). doi:1 0 .
     1007/s10994- 023- 06322- z.
[26] E. Augustine, C. Pryor, C. Dickens, J. Pujara, W. Y. Wang, L. Getoor, Visual sudoku puzzle
     classification: A suite of collective neuro-symbolic tasks, in: International Workshop on
     Neural-Symbolic Learning and Reasoning, 2022.
[27] A. D. Lindström, S. S. Abraham, Clevr-math: A dataset for compositional language, visual
     and mathematical reasoning, volume 3212, 2022.
[28] C. Cornelio, V. Thost, Synthetic Datasets and Evaluation Tools for Inductive Neural
     Reasoning, in: N. Katzouris, A. Artikis (Eds.), Inductive Logic Programming, Springer
     International Publishing, Cham, 2022, pp. 57–77.
[29] F. Rossi, H. Kautz, G. Marcus, L. Lamb, L. Kaelbling, Closing, 2022. URL: https://video.ibm.
     com/recorded/131288165, iBM Neuro-Symbolic AI Workshops.
[30] S. J. Russell, Artificial intelligence a modern approach, Pearson Education, Inc., 2010.
[31] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. V. D. Broeck, A semantic loss function for deep
     learning with symbolic knowledge, volume 12, International Machine Learning Society
     (IMLS), 2018, pp. 8752–8760.
[32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
     A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, Imagenet large scale visual recognition
     challenge, International Journal of Computer Vision 115 (2015) 211–252. doi:1 0 . 1 0 0 7 /
     s11263- 015- 0816- y.
[33] G. A. Miller, Wordnet, Communications of the ACM 38 (1995) 39–41. URL: https://dl.acm.
     org/doi/10.1145/219717.219748. doi:1 0 . 1 1 4 5 / 2 1 9 7 1 7 . 2 1 9 7 4 8 .
[34] K. Ahmed, S. Teso, K.-W. Chang, G. Van den Broeck, A. Vergari, Semantic probabilistic
     layers for neuro-symbolic learning, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
     K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35,
     Curran Associates, Inc., 2022, pp. 29944–29959. URL: https://proceedings.neurips.cc/paper_
     files/paper/2022/file/c182ec594f38926b7fcb827635b9a8f4-Paper-Conference.pdf.
[35] J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick,
     Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,
     in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017,
     pp. 2901–2910.
[36] K. Yi*, C. Gan*, Y. Li, P. Kohli, J. Wu, A. Torralba, J. B. Tenenbaum, Clevrer: Collision
     events for video representation and reasoning, in: International Conference on Learning
     Representations, 2020. URL: https://openreview.net/forum?id=HkxYzANYDB.
[37] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi-task benchmark
     and analysis platform for natural language understanding, in: Proceedings of the 2018
     EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,
     Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 353–355. URL:
     https://aclanthology.org/W18-5446. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 8 - 5 4 4 6 .
[38] H. R. Faghihi, A. Nafar, C. Zheng, R. Mirzaee, Y. Zhang, A. Uszok, A. Wan, T. Premsri,
     D. Roth, P. Kordjamshidi, Gluecons: A generic benchmark for learning under constraints,
     ArXiv abs/2302.10914 (2023).