<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. Eiter);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Modular Neurosymbolic Approach for Visual Graph Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Eiter</string-name>
          <email>thomas.eiter@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nelson Higuera Ruiz</string-name>
          <email>nelson.ruiz@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Oetsch</string-name>
          <email>johannes.oetsch@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vienna University of Technology (TU Wien)</institution>
          ,
          <addr-line>Favoritenstrasse 9-11, Vienna, 1040</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Images containing graph-based structures are a ubiquitous and popular form of data representation that, to the best of our knowledge, have not yet been considered in the domain of Visual Question Answering (VQA). We use CLEGR, a graph question answering dataset with a generator that synthetically produces vertex-labelled graphs that are inspired by metro networks. Structured information about stations and lines is provided, and the task is to answer natural language questions concerning such graphs. While symbolic methods sufice to solve this dataset, we consider the more challenging problem of taking images of the graphs instead of their symbolic representations as input. Our solution takes the form of a modular neurosymbolic model that combines the use of optical graph recognition for graph parsing, a pretrained optical character recognition neural network for parsing node labels, and answer-set programming, a popular logic-based approach to declarative problem solving, for reasoning. The implementation of the model achieves an overall average accuracy of 73% on the dataset, providing further evidence of the potential of modular neurosymbolic systems in solving complex VQA tasks, in particular, the use and control of pretrained models in this architecture.</p>
      </abstract>
      <kwd-group>
        <kwd>neurosymbolic computation</kwd>
        <kwd>answer-set programming</kwd>
        <kwd>visual question answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        (J. Oetsch)
Visual representations of structures that are based on graphs are a popular form of presenting
information and ubiquitous in real life and on the internet. Examples include depictions of
transit networks such as metro or train networks but graphs are truly everywhere. Visual
Question Answering (VQA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is concerned with inferring the correct answer to a natural
language question in presence of some visual input such as an image or video. VQA enables
applications in medicine, assistance for blind people, surveillance, and education [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
questions that arise in presence of graphs have been of interest to computer scientists since
early days and are the basis of many complex systems. It is almost surprising that VQA tasks
where the visual input contains a graph have, to the best of our knowledge, not been considered
so far. The contribution of this paper is threefold: (i) we introduce a novel VQA task that is
concerned with images of graphs that we call VGQA, (ii) we provide a VGQA dataset, and
(iii) we present a neurosymbolic VGQA approach and thereby create a first baseline.
Implementation and data to reproduce the experiments is available at https://github.com/pudumagico/NSGRAPH.
“How many stations are between
      </p>
      <p>Leauts and Nily?”</p>
      <p>Substract()
Count()</p>
      <p>Int(2)</p>
      <p>ShortestPath()
Station(Leauts) Station(Nily)
(b)
• nodes:
– name: leautts
size: tiny
…
• edges:
– station1: raows
station2: dwaiarf
…
• lines:
– built: 90s
…
(c)
(a)</p>
      <p>
        VQA is driven by suitable datasets, where the images and questions are either synthetically
generated or handcrafted. A well-known example is the CLEVR [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] dataset involving simple
scenes. The dataset we are using for VQA on graphs is based on CLEGR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a CLEVR inspired
dataset, with a generator that synthetically produces vertex-labelled graphs that are inspired
by metro networks. Additional structured information about stations and lines, e.g., how big
a station is, whether it is accessible for the disabled, when the line was constructed, etc., is
provided. The task is to answer natural language questions concerning such graphs. For
example, a question may ask for the shortest path between two stations while avoiding those
that have a particular property. An illustration of a graph and a corresponding question is
shown in Fig. 1.
      </p>
      <p>Notably, instances of the CLEGR dataset are provided in symbolic form. While purely
symbolic methods sufice to solve this dataset with ease (we present a particular approach in
this paper), we consider the more challenging problem of taking images of the graphs instead
of their symbolic representations as input. Therefore, we develop a solution of this dataset
in the context of what we call visual graph question answering (VGQA). More specifically, we
consider as input only an image of a graph as given in Fig. 1a with coloured nodes for stations
and edges for connections between them. The colouration represents the metro line to which
stations belong to. Each node in the image appears next to a label that represents the name of
the station. For the questions, we only consider those that can be answered with information
that can be found in the image, and we disregard all other symbolic information. The challenges
to solve this VGQA dataset, we call it CLEGR , are threefold: (1) we have to parse the graph
to identify nodes and edges, (2) we have to read and understand the labels and associate them
which nodes of the graph, and (3) we have to understand the question and reason over the
information extracted from the image to answer it accordingly.</p>
      <p>
        Our solution follows a neurosymbolic methodology. Neurosymbolic computation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] seeks to
unify the two main branches in artificial intelligence: statistical machine learning and automated
logic-based reasoning. Modular neurosymbolic systems are those where the aforementioned
parts are connected through some interface to combine their individual strengths.
      </p>
      <p>
        In particular, our solutions takes the form of a modular neurosymbolic model that combines
the use of optical graph recognition (OGR) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for graph parsing, a pretrained optical character
recognition (OCR) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] neural network for parsing node labels, and answer-set programming
(ASP) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a popular logic-based approach to declarative problem solving, for reasoning. It
operates in the following manner:
1. First, we use the OGR tool to parse the graph image into an abstract representation,
structuring the information as sets of nodes and edges;
2. We use the OCR algorithm to obtain the text labels and associate them to the closest node;
3. Then, we parse the natural language question using regular expressions which sufices
for the considered dataset as questions are structured in a rather simple way;
4. Finally, we use an encoding of the semantics of the question into a logic program which
is, combined with the parsed graph and the question in symbolic form, given as input to
an ASP solver, and from its output we obtain the answer to the question.
      </p>
      <p>The implementation of the model achieves an overall average accuracy of 73.03% on the
dataset, providing further evidence of the potential of modular neurosymbolic systems in solving
complex VQA tasks, in particular, the use and control of pretrained models in this architecture.
That our system does not require any training related to a particular set of examples—hence
solving the dataset in an zero-shot manner —is a practical feature that hints to what may become
the norm as large pretrained models are more than ever available for public use.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Visual Question Answering on Graphs</title>
      <p>
        Graph Question Answering (GQA) is the task of answering a natural language question for a
given graph in symbolic form. The graph consists of nodes and edges, but further attributes of
nodes and edges may be additionally specified. A specific GQA dataset is CLEGR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which
is concerned with graph structures that resemble transit networks like metro lines. Hence,
the nodes correspond to stations and the edges represent lines going between stations. The
questions are ones that are typically asked around mass transit like “How many stops are
between X and Y?”. The dataset is synthetic and comes with a generator that can be used
produce instances of varying complexity.
      </p>
      <p>Graphs come in the form of a YAML file containing records about attributes of the stations
and lines. Each station has a name, a size, a type of architecture, a level of cleanliness, potentially
disabled access, potentially rail access, and a type of music played. Stations can be described as
relations over the aforementioned attributes. Edges connect stations but additionally have a
colour, a line ID, and a line name. For lines, we have, besides name and ID, a construction year,
a colour, and optional presence of air conditioning.</p>
      <p>Examples of questions from the dataset are:
• Describe {Station} station’s architectural style.
• How many stations are between {Station} and {Station}?
• Which {Architecture} station is adjacent to {Station}?</p>
      <p>How many stations are two
steps away from Mccloack?</p>
      <p>NSGRAPH
Vision Module</p>
      <p>Graph</p>
      <p>Parsing
(OGR+OCR)
Language Module</p>
      <p>Question
Parsing
(RegEx)</p>
      <p>Reasoning Module</p>
      <p>Theory
(ASP)</p>
      <p>
        • How many stations playing {Music} does {Line} pass through?
• Which line has the most {Architecture} stations?
For a full list of the questions, we refer the reader to the online repository of the dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The answer to each question is of Boolean type, a number, a list, or a categorical answer.
The questions in the dataset can be represented by functional programs, which allows us to
decompose them into smaller and semantically less complex components. Figure 1 illustrates
an instance of the CLEGR dataset.
      </p>
      <p>Solving instances of the CLEGR dataset is not much of a challenge since all information is
given in symbolic form, and we present a respective method later. But what if the graph is not
available or given in symbolic form, but just as an image, as it is commonly the case? We define
Visual Graph Question Answering (VGQA) as a GQA task where the input is a natural language
question on a graph depicted in an image. We can in fact derive a challenging VGQA dataset,
we call it CLEGR , from CLEGR by generating images of the transit graphs. To this end, we
used the generator of the CLEGR dataset that has an option to produce images of the symbolic
graphs.</p>
      <p>Each image depicts stations, their names as labels in their proximity, and lines in diferent
colours that connect them; an example is given in Fig. 1a. For the VGQA task, we drop all
further symbolic information and consider only the subset of questions that can be answered
with information from the graph image.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Our Neurosymbolic Framework for VQA on Graphs</title>
      <p>We present our solution to VGQA tasks which we call NSGRAPH. It is a modular neurosymbolic
system, where the modules are the typical ones used for a VQA adapted to our VGQA setting: a
visual module, a language module, and a reasoning module. Figure 2 shows a summary of the
inference process in NSGRAPH.</p>
      <sec id="sec-3-1">
        <title>3.1. Visual Module</title>
        <p>The visual model is used for graph parsing which consists of two sub-tasks: (i) detection of
nodes and edges, and (ii) detection of labels, i.e., station names.</p>
        <p>
          We employ an optical graph recognition (OGR) system for the first sub-task. In particular,
we use a publicly available OGR script [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] that implements the approach due to Auer et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
The script takes an image as input and outputs the pixel coordinates of each node detected plus
an adjacency matrix that contains the detected edges.
        </p>
        <p>
          For the second sub-task of detecting labels, we use an optical character recognition (OCR)
system, in particular, we use a pretrained neural network called EasyOCR [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to obtain and
structure the information contained in the graph image. The algorithm takes an image as input
and produces the labels as strings together with their coordinates in pixels. We then connect
the detected labels to the closest node found by the OGR system. Thereby, we obtain an abstract
representation of the graph image as relations.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Language Module</title>
        <p>The purpose of the language module is to parse the natural language question. It is written in
plain Python and uses regular expressions to capture the variables in each question type. There
are overall 35 diferent question templates that CLEGR may use to produce a question instance
by replacing variables with names or attributes of stations, lines, or connections. Examples of
question templates were already given in the previous section.</p>
        <p>For illustration, the question template “How many stations are on the shortest path between
 1 and  2?” may be instantiated by replacing  1 and  2 with station names that appear in
the graph. We use regular expressions to capture those variables and furthermore translate
the natural language question into a functional program that represents the semantics of the
question by a tree of operations to answer the question. Continuing our example, we translate
the template described above into the program
e n d ( 3 ) . c o u n t N o d e s B e t w e e n ( 2 ) . s h o r t e s t P a t h ( 1 ) . s t a t i o n ( 0 , S 1 ) . s t a t i o n ( 0 , S 2 ) .
where the the first numerical argument of each predicate imposes the order of execution of the
associated operation and links the input of one operation to the output of the previous one. We
can interpret this functional program as follows: the input to the shortest-path operation are
two station names S 1 and S 2 . Its output are the stations on the shortest path between S 1 and
S 2 which are counted in the next step. The predicate e n d represents the end of computation to
yield this number as the answer to the question.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Reasoning Module</title>
        <p>The third module consists of an ASP program that implements the semantics of the operations
from the functional program of the question. Before we explain this reasoning component, we
briefly review the basics of ASP.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Answer-Set Programming</title>
          <p>
            Answer-Set Programming (ASP) [
            <xref ref-type="bibr" rid="ref11 ref12 ref8">8, 11, 12</xref>
            ] is a declarative logic-based approach for combinatorial
search and optimisation problems with roots in knowledge representation and reasoning. It
ofers a simple modelling language and eficient solvers 1. To solve a problem with ASP, the
search space and properties of problem solutions are described by means of a logic program
such that its models, called answer sets, encode the problem’s solutions.
          </p>
          <p>An ASP program is a set of rules of the form
 1 ∣ ⋯ ∣   ∶−  1, … ,   ,  
1, … , not  
where all   ,   ,   are first-order literals and not is default negation. The set of atoms left of ∶−
is the head of the rule, while the atoms to the right form the body. Intuitively, whenever all  
are true, and there is no evidence for any   , then at least some   has to be true.</p>
          <p>A rule with an empty body and a single head atom without variables is a fact and is always
true. A rule with an empty head is a constraint and is used to exclude models that would satisfy
the body.</p>
          <p>ASP provides further language constructs like aggregates and weak (also called soft)
constraints, whose violation should only be avoided. For a comprehensive coverage of the ASP
language and its semantics, we refer to the language standard [13].</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Question Encoding</title>
          <p>The symbolic representations obtained from the language and visual modules are first translated
into ASP facts. The functional program from a question is already in a fact format. The graph
is translated in binary atoms e d g e / 2 and unary atoms s t a t i o n / 1 as well. These facts combined
with an ASP program that encodes the semantics of all CLEGR questions templates can then be
used to compute the answer with an ASP solver.</p>
          <p>We present an excerpt of the ASP program that implements the semantics of the functional
program e n d ( 3 ) , c o u n t N o d e s B e t w e e n ( 2 ) , s h o r t e s t P a t h ( 1 ) , s t a t i o n ( 0 , s ) , s t a t i o n ( 0 , t ) from
above. These atoms, together with facts for edges and nodes, serve as input to the ASP encoding
for computing the answer as they only appear in rule bodies:</p>
          <p>The first rule expresses that if we see s h o r t e s t P a t h ( T ) in the input, then we want to compute
the shortest path between station S 1 and S 2 . The actual shortest path is produced by rule 2
which non-deterministically decides for every edge if this edge is part of the shortest path.
Rules 3 and 4 jointly define the transitive closure of this path relation and the constraint in
1See, for example, www.potassco.org or www.dlvsystem.com.</p>
          <p>(a)
(b)
(c)
line 5 enforces that station S 1 is reachable from S 2 . Line 7 is a weak constraint that minimises
the number of edges that are selected for the shortest path (without, we would not get a shortest
path at all). This number of edges is calculated by rule 6 using an aggregate expression for
counting. Finally, rule 8 calculates the number stations on the shortest path, and rule 9 defines
the answer to the question as that number. We omit the full encoding for space reasons but, it
is part of the online repository of this project (https://github.com/pudumagico/NSGRAPH).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We experimentally evaluated NSGRAPH on GQA and VGQA tasks as described in Section 2.
In particular, we generated graphs that fall into three categories: tiny (3 lines and at most 4
stations per line), small (4 and at most 6 stations per line), and medium (5 lines and at most 8
stations per line). We generate 100 graphs of each size accompanied by 10 questions per graph,
with a median of 10 nodes and 8 edges for tiny graphs, 15 nodes and 15 edges for small graphs,
and 24 nodes with 26 edges for medium ones. Figure 3 shows three graphs, one of each size.</p>
      <p>NSGRAPH achieves 100% on the original GQA task, i.e., with graphs in symbolic form as
input and with the unrestricted set of questions. Here, the symbolic input is translated directly
into ASP fact without the need to parse an image.</p>
      <p>For the more challenging VGQA task, we summarise the results in Table 12. The task get
more dificult with increasing size of the graphs, but we still achieve an overall accuracy of 73%.
As we also consider settings where we replace the OCR, resp., OGR module, with the ground
truth as input, we are able to pinpoint to the OGR as the main reason for wrong answers. The
average run time to answer a question was 5.86s for tiny graphs, 14s for small graphs, and
35.33s for medium graphs. NSGRAPH is the first baseline for this VGQA dataset and further
improvements a certainly possible, e.g., by leveraging the modularity of NSGRAPH, stronger
OGR systems could be used.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>
        Our approach follows ideas from previous work [15], where we introduced a similar
neurosymbolic method for VQA in the context of the CLEVR dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As here, the reasoning component
there is logic-based and implemented in ASP. This work was in turn inspired by NSVQA [16]
that uses a combination of RCNN [17] for object detection, an LSTM [18] for natural language
parsing, and Python as a symbolic executor to infer the answer. In this related work, the visual
modules are trained for the dataset, while here we use merely a pretrained network. We also
mention in this context the neural and end-to-end trainable system MAC [19] that achieves very
promising results on VQA datasets. A very recent approach that combines large pre-trained
models for images and text in combination with symbolic execution in Python is ViperGPT [20];
complicated images of graphs would presumably require some fine-tuning for that approach.
      </p>
      <p>A characteristic feature of NSGRAPH is that we use ASP for reasoning. Outside the context of
VQA, ASP has been applied for various neurosymbolic tasks like validation of electric panels [21],
segmentation of laryngeal images [22], discovering rules that explain sequences of sensory
input [23], or query answering with large language models [24]. There are also systems that
can be used for neurosymbolic learning, e.g., by employing the semantic loss [25], which means
they use the information produced by the reasoning module to enhance the learning tasks of
the neural networks involved [26, 27].</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Future Work</title>
      <p>We introduced a novel VQA problem that we call visual graph question answering (VGQA),
where the input is an image of a graph along with a natural language question. Also, we
introduced a respective dataset for this task that is based on an existing one for graph question
answering on transit networks, and we presented NSGRAPH, a modular neurosymbolic model
for VGQA that combines neural components for graph parsing and symbolic reasoning with
ASP for question answering. We evaluate NSGRAPH on the VQGA dataset and thus create a
ifrst baseline. The advantages of a modular architecture are that the solution is transparent,
interpretable, easier to debug, and components can be replaced with better ones over time
in contrast to more monolithic end-to-end trained models. Our system notably relies on
pretrained components and thus requires no additional training for the VGQA task. With the
2We ran the experiments on a computer with 32GB RAM, 12th Gen Intel Core i7-12700K, and a NVIDIA GeForce
RTX 3080 Ti, and we are using clingo (v. 5.6.2) [14] as ASP solver.
advent of large pretrained models for language and images like GPT-4 [28] or CLIP [29], such
architectures, where symbolic systems are used to control and connect neural ones, may be
seen more frequently.</p>
      <p>As this is work in progress, quite a number of topics remain for future work. While we
advocate neurosymbolic methods, we would also like to see a comparison with purely neural
end-to-end trained methods for VGQA. The performance of NSGRAPH is promising but also
leaves room for improvement: We want to look into better alternatives for the visual module
which is currently the limiting factor. Also, our approach to parse questions with regular
expressions will not generalise well, and a large language model could be adopted for this
purpose instead. The VGQA dataset is based on random graphs which do not always resemble
real transit networks. We plan to improve on this, e.g., by using images of real metro maps.
Lectures on Artificial Intelligence and Machine Learning, Morgan &amp; Claypool Publishers,
2012. doi:1 0 . 2 2 0 0 / S 0 0 4 5 7 E D 1 V 0 1 Y 2 0 1 2 1 1 A I M 0 1 9 .
[13] F. Calimeri, W. Faber, M. Gebser, G. Ianni, R. Kaminski, T. Krennwallner, N. Leone,
M. Maratea, F. Ricca, T. Schaub, Asp-core-2 input language format, Theory Pract. Log.</p>
      <p>Program. 20 (2020) 294–309. doi:1 0 . 1 0 1 7 / S 1 4 7 1 0 6 8 4 1 9 0 0 0 4 5 0 .
[14] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, T. Schaub, P. Wanko, Theory Solving
Made Easy with Clingo 5, in: Technical Communications of the 32nd International
Conference on Logic Programming (ICLP 2016), volume 52 of OASIcs, Schloss
DagstuhlLeibniz-Zentrum für Informatik, 2016, pp. 2:1–2:15. doi:1 0 . 4 2 3 0 / O A S I c s . I C L P . 2 0 1 6 . 2 .
[15] T. Eiter, N. Higuera, J. Oetsch, M. Pritz, A neuro-symbolic ASP pipeline for visual question
answering, Theory Pract. Log. Program. 22 (2022) 739–754. doi:1 0 . 1 0 1 7 / S 1 4 7 1 0 6 8 4 2 2 0 0 0 2 2 9 .
[16] K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, J. Tenenbaum, Neural-symbolic VQA:
disentangling reasoning from vision and language understanding, in: Advances in Neural
Information Processing Systems 31: Annual Conference on Neural Information Processing
Systems 2018, NeurIPS 2018, 2018, pp. 1039–1050.
[17] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with
region proposal networks, in: Advances in Neural Information Processing Systems 28:
Annual Conference on Neural Information Processing Systems 2015, 2015, pp. 91–99.
[18] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997)
1735–1780. doi:1 0 . 1 1 6 2 / n e c o . 1 9 9 7 . 9 . 8 . 1 7 3 5 .
[19] D. A. Hudson, C. D. Manning, Compositional attention networks for machine reasoning, in:
6th International Conference on Learning Representations, (ICLR 2018), OpenReview.net,
2018.
[20] D. Surís, S. Menon, C. Vondrick, ViperGPT: Visual Inference via Python Execution for
Reasoning, CoRR abs/2303.08128 (2023). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 3 . 0 8 1 2 8 . a r X i v : 2 3 0 3 . 0 8 1 2 8 .
[21] V. Barbara, D. Buelli, M. Guarascio, S. Ierace, S. Iiritano, G. Laboccetta, N. Leone, G. Manco,
V. Pesenti, A. Quarta, F. Ricca, E. Ritacco, A loosely-coupled neural-symbolic approach
to compliance of electric panels, in: Proceedings of the 37th Italian Conference on
Computational Logic, volume 3204 of CEUR Workshop Proceedings, CEUR-WS.org, 2022,
pp. 247–253.
[22] P. Bruno, F. Calimeri, C. Marte, M. Manna, Combining deep learning and asp-based models
for the semantic segmentation of medical images, in: Proceedings of the 5th International
Joint Conference on Rules and Reasoning, RuleML+RR 2021, volume 12851 of Lecture Notes
in Computer Science, Springer, 2021, pp. 95–110. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 9 1 1 6 7 - 6 \ _ 7 .
[23] R. Evans, J. Hernández-Orallo, J. Welbl, P. Kohli, M. J. Sergot, Making sense of sensory
input, Artif. Intell. 293 (2021) 103438. doi:1 0 . 1 0 1 6 / j . a r t i n t . 2 0 2 0 . 1 0 3 4 3 8 .
[24] A. Rajasekharan, Y. Zeng, P. Padalkar, G. Gupta, Reliable natural language understanding
with large language models and answer set programming, CoRR abs/2302.03780 (2023).
doi:1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 2 . 0 3 7 8 0 . a r X i v : 2 3 0 2 . 0 3 7 8 0 .
[25] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. V. den Broeck, A semantic loss function for deep
learning with symbolic knowledge, in: Proceedings of the 35th International Conference
on Machine Learning, ICML 2018, volume 80 of Proceedings of Machine Learning Research,
PMLR, 2018, pp. 5498–5507.
[26] Z. Yang, A. Ishay, J. Lee, Neurasp: Embracing neural networks into answer set
programming, in: Proceedings of the Twenty-Ninth International Joint Conference on Artificial
Intelligence, IJCAI 2020, ijcai.org, 2020, pp. 1755–1762. doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 2 0 / 2 4 3 .
[27] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, L. D. Raedt, Deepproblog: Neural
probabilistic logic programming, in: Advances in Neural Information Processing Systems
31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018,
2018, pp. 3753–3763.
[28] OpenAI, Gpt-4 technical report, 2023. a r X i v : 2 3 0 3 . 0 8 7 7 4 .
[29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from
natural language supervision, in: Proceedings of the 38th International Conference on
Machine Learning, ICML 2021, volume 139 of Proceedings of Machine Learning Research,
PMLR, 2021, pp. 8748–8763.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>VQA: visual question answering</article-title>
          ,
          <source>in: 2015 IEEE International Conference on Computer Vision</source>
          , ICCV
          <year>2015</year>
          , IEEE Computer Society,
          <year>2015</year>
          , pp.
          <fpage>2425</fpage>
          -
          <lpage>2433</lpage>
          . doi:
          <article-title>1 0 . 1 1 0 9 / I C C V . 2 0 1 5 . 2 7 9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bisogni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Marsico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ricciardi</surname>
          </string-name>
          ,
          <article-title>Visual question answering: Which investigated applications?, Pattern Recognit</article-title>
          .
          <source>Lett</source>
          .
          <volume>151</volume>
          (
          <year>2021</year>
          )
          <fpage>325</fpage>
          -
          <lpage>331</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . p
          <source>a t r e c . 2 0 2 1 . 0 9 . 0 0 8 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          , L. van der Maaten, L.
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>R. B.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning</article-title>
          ,
          <source>in: 2017 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2017</year>
          , IEEE Computer Society,
          <year>2017</year>
          , pp.
          <fpage>1988</fpage>
          -
          <lpage>1997</lpage>
          .
          <article-title>doi:1 0 . 1 1 0 9 / C V P R . 2 0 1 7 . 2 1 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jeferson</surname>
          </string-name>
          ,
          <article-title>CLEVR graph: A dataset for graph question answering</article-title>
          ,
          <year>2018</year>
          . URL: https://github.com/Octavian-ai/clevr-graph.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Raedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumancic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manhaeve</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Marra, From statistical relational to neurosymbolic artificial intelligence</article-title>
          ,
          <source>in: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2020</year>
          ,
          <article-title>ijcai</article-title>
          .org,
          <year>2020</year>
          , pp.
          <fpage>4943</fpage>
          -
          <lpage>4950</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bachmaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Brandenburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gleißner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reislhuber</surname>
          </string-name>
          ,
          <article-title>Optical graph recognition</article-title>
          ,
          <source>in: Graph Drawing</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2013</year>
          , pp.
          <fpage>529</fpage>
          -
          <lpage>540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sabu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. S. Das</surname>
          </string-name>
          ,
          <article-title>A survey on various optical character recognition techniques</article-title>
          ,
          <source>in: 2018 Conference on Emerging Devices and Smart Systems (ICEDSS)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>152</fpage>
          -
          <lpage>155</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>0</volume>
          <fpage>9</fpage>
          <string-name>
            <surname>/ I C E D S S</surname>
          </string-name>
          .
          <volume>2 0 1 8 . 8 5 4 4 3 2 3 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brewka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Eiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Truszczynski</surname>
          </string-name>
          ,
          <article-title>Answer set programming at a glance</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>54</volume>
          (
          <year>2011</year>
          )
          <fpage>92</fpage>
          -
          <lpage>103</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 2 0 4 3 1 7 4 . 2 0 4 3 1 9 5 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chodziutko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nowakowski</surname>
          </string-name>
          ,
          <article-title>Optical Graph Recognition (OGR) -</article-title>
          script,
          <year>2020</year>
          . URL: https://github.com/praktyka-zawodowa
          <article-title>-2020/optical_graph_recognition.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jaided</surname>
            <given-names>AI</given-names>
          </string-name>
          , EasyOCR,
          <year>2022</year>
          . URL: https://https://github.com/JaidedAI/EasyOCR.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lifschitz</surname>
          </string-name>
          , Answer Set Programming, Springer,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gebser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaminski</surname>
          </string-name>
          , B. Kaufmann, T. Schaub, Answer Set Solving in Practice, Synthesis
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>