Combining Knowledge Graphs and Language
Models to Answer Questions over Tables
Judith Knoblach1,† , Nikhil Acharya1,† , Bhavya Koranemkattil1,† , Andreas Both2,3 and
Diego Collarana1,4,∗
1
  Fraunhofer Institute for Intelligent Analysis and Information Systems, Dresden, Germany
2
  Leipzig University of Applied Sciences, Leipzig, Germany
3
  DATEV eG, Nuremberg, Germany
4
  Universidad Privada Boliviana, Cochabamba, Bolivia


                                         Abstract
                                         Tables remain a primary modality for organizing and presenting information to people. We interact every
                                         day with Excel sheets, CSV files, tables in PDF documents, and web tables. Providing a natural language
                                         interface to query table information is paramount for several use cases. This demo shows a solution to
                                         query semantically described tables using natural-language questions. Our solution employs knowledge
                                         graphs as a medium to integrate tables coming from heterogeneous sources. Then, a transformer-based
                                         language model analyzes a user’s question and finds the answer in the semantically represented tables.
                                         During the demo session, we will show a use case developed in collaboration with DATEV eG, where tax
                                         consultants can efficiently query information from financial tables. Attendees will experience how a
                                         natural-language interface speeds up the information retrieval process from tables. They will also be
                                         allowed to ask their questions to a prepared dataset, showing the scalability of our solution. The video
                                         demo is available at https://owncloud.fraunhofer.de/index.php/s/uXFmUfzCta70rqN.

                                         Keywords
                                         Knowledge Graphs, Transformers, Language Models


1. Introduction
Companies still use tables as the main modality to present information to employees. For
example, DATEV eG is a German company mainly providing large-scale business software (e.g.,
accounting). These services are widely used by tax consultants, lawyers, auditors, small and
medium-sized enterprises, municipalities, start-ups, and many more. More than two million
German companies use financial accounting programs from DATEV, interacting with hundreds
of tables daily. To continue this success story and remain competitive in the market, DATEV
relies on employees who are experts in their field and on state-of-the-art software solutions to
accelerate internal processes even in environments that become increasingly data-dependent.
SEMANTICS 2022 EU: 18th International Conference on Semantic Systems, September 13-15, 2022, Vienna, Austria
∗
    Corresponding author.
†
     These authors contributed equally.
Envelope-Open judith.knoblach@iais.fraunhofer.de (J. Knoblach); nikhil.acharya@iais.fraunhofer.de (N. Acharya);
bhavya.koranemkattil@iais.fraunhofer.de (B. Koranemkattil); andreas.both@datev.de (A. Both);
diegocollarana@upb.edu (D. Collarana)
Orcid 0000-0002-9177-5463 (A. Both); 0000-0002-2583-0778 (D. Collarana)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Question Answering over Tables – A Knowledge Graph Transformer based approach


DATEV uses AI solutions that facilitate the development and relieve the employees’ workload.
One of the most relevant aspects is data management and information retrieval. DATEV
employees handle a wide range of information on different domains, e.g., taxes in European
countries, specific software versions for wire transfers, or (strict) deadlines for storing personal,
sensitive data. Often the data is presented as tables, either as web tables or CSV files coming
from internal, external, or official data providers. This data-intensive environment makes it
challenging for users to find information efficiently.
   In the scope of the SPEAKER1 project, Fraunhofer IAIS and DATEV eG have teamed up to
provide an application to query tables with natural language, i.e., Question Answering (QA)
over tables. Our approach allows non-technical users to express what they want in the table
more naturally in text form. Moreover, our solution integrates tables represented in various
formats, e.g., CSV, Web Tables, or even tables encoded in XML, as in DATEV’s use case. Figure 1
depicts the structure of the overall application. It consists of two proprietary components –
the Fraunhofer Smart Data Connector (SDC) and the Fraunhofer QA Component. The SDC
is a solution to create knowledge graphs by transforming heterogeneous enterprise data into
actionable knowledge. The QA component allows answering questions expressed in natural
language over the knowledge graph. The following section presents an overview of our approach
that combines the SDC and QA components to provide a solution to the problem of answering
questions over tables. The last section describes the demonstration of the use case in detail.


2. Architecture
2.1. Smart Data Connector
The SDC is a generic component to transform and store enterprise data sources into a knowledge
graph. Figure 2a depicts a general overview of the SDC components and their interactions.
Following a mapping approach [1], the SDC uses mapping rules to transform tables into RDF
triples following the CSVW vocabulary. In DATEV’s use case, we define a mapping rule to
transform tables encoded in XML files to RDF. However, our component is generic enough to
handle other scenarios, e.g., transforming CSV tables into RDF. At runtime, the SDC Engine
1
    cf. https://www.speaker.fraunhofer.de/en
     (a) Components of the Smart Data Connector              (b) The CSVW vocabulary
Figure 2: Main elements of our solution


provides a service to upload XML files. Then, the Mapping Engine takes the file and the mapping
rules and creates the semantic representation of the table using CSVW vocabulary. Figure 2b
shows the main concepts from CSVW that we are using in this application. Once the tables
are transformed into RDF, the SDC offers different services to the QA component. The SDC
Entity-Relation Index provides a search service for entities and relations (including their labels,
types, and synonyms), and it can bed used for Entity Linking tasks by the QA component.
The SDC Engine provides a SPARQL endpoint used by the QA component to query the tables.
Finally, the SDC Embeddings Generator offers advanced services, e.g., entity similarity based
on embeddings. The SDC Embeddings Generator uses PyKEEN [2] to generate embeddings of
the entities and relations of the graph with different models.

2.2. QA component
Our QA component is inspired by the TaPas model [3]. In our approach, the model is extended
to answer questions from tables semantically described in knowledge graphs. TaPas is a deep
learning model based on BERT’s encoder architecture [4] and is specifically designed for question
answering over tabular data. The TaPas model is built through two stages, pre-training, and
fine-tuning. The pre-training has been done over millions of tables and related text segments
crawled from Wikipedia and this is a crucial reason behind the performance of the model [3].
   The fine-tuning process of TaPas has been done in a supervised fashion in multiple public
datasets such as WIKISQL, WIKITQ, and SQA. Figure 3 depicts the architecture of the TaPas
model. In addition to the BERT’s encoder embeddings [4], the table data and structure are
also encoded as inputs for the TaPas model. A table is flattened to a string format where
the column headers and cells are concatenated as string tokens. Then question tokens are
appended to the sequence. TaPas, trained in the weakly supervised setting, achieves close to
state-of-the-art performance for WIKISQL (accuracy: 83.6). For SQA, TaPas leads to substantial
improvements on all metrics: improving all metrics by at least 11 points. For WIKITQ [5] the
model trained only from the original training data reaches an accuracy of 42.6, which surpasses
similar approaches [6].
Figure 3: Overview of TaPas Architecture [3]


3. Demonstration
This demo shows the two-fold approach of our solution:
   Building a knowledge graph from domain- and format-independent tables. The
biggest challenge for the DATEV’s use case is to map tables from different domains and formats
without the additional effort of creating a complex ontology, keeping the mapping rules to a
minimum. We select to reuse the CSV on the Web2 (CSVW) vocabulary for having a generic
semantic representation of the tables. Figure 2b shows the main concepts we use, i.e., “Table”
and connects it with the concepts “Row”, “Cell” and “Column”. In the demo video, XML files
from DATEV (document storage period, VAT rate in European countries, and payment initiation
versions for the SEPA formats) are mapped to the CSVW with the rule “transform-tables”. Using
the query service in the SDC dashboard, the structure of the populated knowledge graph can be
explored in more detail. The knowledge graph contains triples that connect the table URI with
the corresponding columns and rows. There are also triples that link the content of a cell to the
expected row via the row URI.
   Answering questions over knowledge graph. The table URI is the key element for con-
necting with the “QA component”. The user sends both the URI of the table and a question in
natural language, e.g., “What is the intermediate VAT rate in Belgium?”. Thanks to the use of a
shared vocabulary for all tables, i.e., CSVW, the QA component can send a standard SPARQL to
retrieve the table’s columns and rows based on the URI. Then the Cells and Columns instances
are pre-processed as embeddings used to represent the table in TaPas, i.e., column embeddings
which indicate the column the token belongs to, row embedding, which indicates the row the
token belongs to, rank embeddings which indicate the rank of a particular cell according to the

2
    https://www.w3.org/ns/csvw
column it belongs. Thus, our QA component can combine tables stored in knowledge graphs
and the TaPas model with this information. It is also possible to answer questions requiring
cell aggregation, e.g., “How many European countries have a standard VAT rate of 20 percent?”
(COUNT), “What is the average standard VAT rate?” (AVERAGE), “Can you tell me the total
years I need to keep all invoices?” (SUM). As a weakly supervised deep learning model, TaPas
can represent the relationships between columns and values in tables and has an excellent
semantic understanding of natural language queries. We can deploy TaPas for multiple use
cases in multiple domains since the model has the ability for cross-domain [3].


4. Conclusions
The application to the DATEV use case is just one of many possible applications of our
question answering over tables solution. This demo emphasizes the combination of RDF
knowledge graphs with Language Models to solve the problem of answering questions
over heterogeneous tables. In future work, we will explore improving the TaPas model by
taking advantage of the semantically described tables. Moreover, we want to extend the
TaPas model to the German language. Also, we plan to add more aggregation operators,
like MINIMUM and MAXIMUM, into the TaPas architecture. Finally, the QA component
can be improved to automatically identify the right table and answer without explicitly
specifying the table URI. This way, it would be possible to answer questions from a pool of tables.

Acknowledgements: We acknowledge the support of the EU H2020 Projects Opertus Mundi
(GA 870228), and the Federal Ministry for Economic Affairs and Energy (BMWi) project
SPEAKER (FKZ 01MK20011A).


References
[1] A. Dimou, M. V. Sande, P. Colpaert, R. Verborgh, E. Mannens, R. V. de Walle, RML: A generic
    language for integrated RDF mappings of heterogeneous data, in: WWW, Seoul, Korea,
    volume 1184 of CEUR Workshop Proceedings, 2014.
[2] M. Ali, M. Berrendorf, C. T. Hoyt, L. Vermue, S. Sharifzadeh, V. Tresp, J. Lehmann, PyKEEN
    1.0: A Python library for training and evaluating knowledge graph embeddings, J. Mach.
    Learn. Res. 22 (2021) 82:1–82:6.
[3] J. Herzig, P. K. Nowak, T. Müller, F. Piccinno, J. M. Eisenschlos, TaPas: Weakly supervised
    table parsing via pre-training, in: ACL, Online, 2020, pp. 4320–4333.
[4] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional trans-
    formers for language understanding, in: NAACL-HLT, 2019, pp. 4171–4186.
[5] Q. Liu, B. Chen, J. Guo, Z. Lin, J.-g. Lou, TaPEx: Table pre-training via learning a neural
    SQL executor, arXiv preprint arXiv:2107.07653 (2021).
[6] A. Neelakantan, Q. V. Le, M. Abadi, A. McCallum, D. Amodei, Learning a natural language
    interface with neural programmer, in: ICLR, Toulon, France, 2017.