=Paper=
{{Paper
|id=Vol-3196/paper2
|storemode=property
|title=Entity Linking and Filling for Question Answering over Knowledge Graphs
|pdfUrl=https://ceur-ws.org/Vol-3196/paper2.pdf
|volume=Vol-3196
|authors=Daniel Diomedi,Aidan Hogan
|dblpUrl=https://dblp.org/rec/conf/esws/DiomediH22
}}
==Entity Linking and Filling for Question Answering over Knowledge Graphs==
<pdf width="1500px">https://ceur-ws.org/Vol-3196/paper2.pdf</pdf>
<pre>
            Entity Linking and Filling for
     Question Answering over Knowledge Graphs

                          Daniel Diomedi and Aidan Hogan

                  DCC, Universidad de Chile; IMFD; Santiago, Chile
                 daniel.diomedi@ug.uchile.cl, ahogan@dcc.uchile.cl


        Abstract. Question Answering over Knowledge Graphs (KGQA) aims
        to compute answers for natural language questions over a knowledge
        graph. Recent KGQA approaches adopt a neural machine translation
        (NMT) approach, where the natural language question is translated into
        a structured query language. However, NMT suffers from the out-of-
        vocabulary problem, where terms in a question may not have been seen
        during training, impeding their translation. This issue is particularly
        problematic for the millions of entities that large knowledge graphs de-
        scribe. We rather propose a KGQA approach that delegates the process-
        ing of entities to entity linking (EL) systems. NMT is then used to create
        a query template with placeholders that are filled by entities identified
        from the text in an EL phase. This approach gives rise to what we call
        the “entity filling” problem, where we must decide which placeholders to
        replace with which entities. To address this problem, we propose a so-
        lution based on sequence labelling and constraints. Experiments for QA
        with complex questions over Wikidata show that our approach outper-
        forms pure NMT approaches: while the task remains challenging, errors
        relating to entities in the translated queries are greatly reduced.


1     Introduction

Knowledge graphs (KGs) adopt a graph-based abstraction of knowledge, where
nodes represent entities and edges represent relations [13]. Many open knowl-
edge graphs provide public query services that can receive upwards of millions
of SPARQL queries per day over the Web [18]. In order to query such services,
clients must be able to formulate their queries in SPARQL, which may be unfa-
miliar to many users. Ideally a user could instead receive answers to questions
written in a natural language familiar to them. Addressing this task automati-
cally is known as Question Answering over Knowledge Graphs (KGQA).
    Recent approaches for KGQA have moved towards neural networks [4]. Using
neural machine translation (NMT) to translate the natural language question of
the user into a structured query (e.g., [16,20,24,26]) has recently shown promising
results, particularly for simple “factoid” questions [3] (such as “Where was Marie

    Copyright © 2022 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2      D. Diomedi and A. Hogan

Curie born? ”), that can be answered by a single edge in the knowledge graph
                  born in
(Marie Curie −−−−→ Poland). Handling complex questions that require joining
multiple edges (such as “What was the profession of Marie Curie’s father? ”),
and/or involve query features such as counting, ordering, etc. (such as “How
many daughters did Marie Curie have? ”) remains very challenging.
    In this paper, we focus on advancing the state-of-the-art for NMT-based
KGQA. After discussing related works, we analyse the performance of state-
of-the-art “pure NMT” approaches, finding that their performance leaves con-
siderable room for improvement. We identify that the most problematic issue
relates to the out-of-vocabulary problem, particularly for entities. Based on this
observation, we investigate the specific claim that an approach combining entity
linking (EL) with NMT can improve performance versus a pure NMT approach
(without EL) by designing, implementing and performing experiments to com-
pare such an approach with analogous pure NMT baselines.
    Specifically, in our approach, NMT is used to generate an abstract query
template with generic placeholders for entities, while a specialised ensemble of
entity linking systems (EL) is used to extract entities from the question and
match them with the knowledge graph. This approach still leaves us with a key
problem that we call entity filling (EF) – which can be seen as a particular form
of slot filling – where we need to decide which placeholders in the query template
should be replaced with which entities returned from the EL process. To address
this problem, we propose to (1) first use a supervised sequence labelling approach
(SL) to tag elements of the text with roles corresponding to the query template,
and (2) thereafter apply a heuristic-based entity assignment (EA) that decides
which placeholders in the query template to replace with which entities from
EL based on the labels produced by SL. This approach then generates the final
query predicted to represent the user’s question.
    Our results show that our approach greatly reduces the number of errors due
to entities when compared with state-of-the-art pure NMT baselines, confirming
our claim. However, even with these advances, the results for complex questions
remain relatively poor, where we identify some open challenges for state-of-the-
art methods when faced with such questions, as well as potential solutions.


2   Related work

Related tasks A number of tasks are closely related to KGQA. Question Answer-
ing over Linked Data (QALD) [22] can be seen as a type of KGQA targeting
KGs specifically published as Linked Data. Text-to-SQL [10] relates to KGQA
approaches that construct a query from the natural language question, such as
NMT-based approaches; however, queries are answered over relational databases
that have fixed schemas and do not exhibit the level of diversity present in large
knowledge graphs. We thus focus on works for QALD and KGQA.

Non-Neural Approaches Early works on QALD avoided having to deal with the
full complexity of natural language by defining control languages (e.g, GiN-
  Entity Linking and Filling for Question Answering over Knowledge Graphs       3

SENG [2]) or by using fixed templates (e.g., TBSL [21]). More flexible ap-
proaches involve using the question to guide a graph exploration (e.g., Treo [11],
NLQ-KBQA [14]), where given a question such as “What was the profession
of Marie Curie’s father? ”, such an approach may first find candidate nodes in
the knowledge graph referring to Marie Curie (e.g., db:Marie Curie in DBpedia,
wd:Q7186 in Wikidata), thereafter navigating through relations matching “fa-
ther ” (dbo:father, wdt:P22) and “profession” (dbo:occupation, wdt:P106) to-
wards a candidate answer. Further approaches (e.g., FREyA [5]) propose human-
in-the-loop interaction, incorporating multiple sources on the Web (e.g., Power-
Aqua [15]), etc. For details on earlier systems, we refer to Unger et al. [22].

Neural Approaches A number of recent KGQA approaches leverage modern
neural architectures. While neural networks have been used to classify questions,
or to rank candidate queries, here we focus on approaches using neural machine
translation (NMT), referring to Chakraborty et al. [4] for other uses. These
NMT approaches rephrase the KGQA task into a translation task from natural
language into a structured query language such as SPARQL. NMT approaches
can thus be equivalently viewed as performing neural semantic parsing (NSP),
mapping a natural language question into its structured query form.
    Ferreira Luz and Finger [17] propose to compute a vector representation for
questions and queries that are then fed into an LSTM encoder–decoder model
with an attention layer in order to perform the translation/parsing; the out-
of-vocabulary problem is explicitly left open. Soru et al. [20] propose “Neural
Semantic Machines”, which accept as input a set of templates, composed of pairs
of questions and queries with entities replaced by placeholders (e.g., “What was
the profession of <A>’s father? ”, with the corresponding query), and a knowl-
edge graph, from which valid entity replacements of placeholders are computed,
generating concrete instances (e.g., “What was the profession of Marie Curie’s
father? ” with the corresponding query) that are used to train an LSTM encoder-
decoder model. While this approach partially addresses the out-of-vocabulary
problem, tens of millions of (highly repetitive) instances would need to be used
for training to cover all entities in a knowledge graph such as Wikidata. Yin et
al. [26] compare a variety of NMT models in the context of KGQA, including
RNN-, CNN- and transformer-based models, finding that the best model was the
Convolutional Sequence to Sequence (ConvS2S) model. However, Yin et al. [26]
acknowledge that the models encounter issues with larger vocabularies.
    Tackling the out-of-vocabulary issue, Lukovnikov et al [16] merge word-level
and character-level representations for both the questions and the entity labels
in the knowledge graph; however, the approach is specifically designed to deal
with simple questions (e.g., “Who is Marie Curie’s father ”), and would not work
with knowledge graph identifiers not based on a natural language (as is the case
for Wikidata). Wang et al. [24] propose gAnswer: a system that extends NMT
with entity linking (EL) and relation linking (RL) in order to better handle the
out-of-vocabulary problem, finding that entity linking in particular improves
performance over NMT. However, only one EL system is explored, and there is
no discussion of how entities are matched with their placeholders.
4       D. Diomedi and A. Hogan

                     Table 1. Hyperparameter settings used

Model         Layers H. Units Attention Optimiser L. Rate Dropout Size
ConvS2S           15 512–2,048      yes     SGD         0.5             0.2   500
LSTM               4     1,000      yes     Adam        0.001           0.3   500
Transformer        6     1,024      yes     Adam        0.0005          0.3   500


QALD/KGQA Datasets Various datasets have been proposed for QALD/KGQA,
including a series of QALD challenges [23], which provide a variety of small sets
(in the order of hundreds) of training and test examples over knowledge graphs
such as DBpedia and Wikidata. For NMT-based approaches, larger datasets
are needed for training. SimpleQuestions [3] is a larger alternative, containing
108,442 question–answer pairs, but focuses on simple factoid questions. The
Monument dataset [20] is based on 38 unique query templates about monu-
ments in DBpedia, where 300–600 question–query pairs are created for each
template; however, these instances focus on the data for one class in DBpedia.
The DBNQA dataset [12] includes 894 thousand question–query pairs produced
by taking 215 queries from a QALD training dataset, replacing the entities with
placeholders, and generating other valid replacements from the DBpedia knowl-
edge graph; there is thus a high level of repetition as the 894 thousand exam-
ples are based on 215 unique queries. Finally, LC-QuAD 2.0 [7] consists of 30
thousand question–query pairs over DBpedia and Wikidata generated from 22
templates, where the questions are paraphrased through crowdsourcing.


3   The Problem with Entities

Our goal is to advance NMT-based approaches for KGQA. We first establish a
baseline for a state-of-the-art “pure NMT” approach by selecting three of the
best performing models for the KGQA in the comparison of Yin et al. [26].
These three models constitute the state-of-the-art for NMT in the context of
natural language, involving three diverse architectures based on Recurrent Neu-
ral Networks (RNNs), Convolutional Neural Networks (CNNs) and attention
without recurrence or convolutions. More specifically, the three models are as
follows. ConvS2S (Convolutional Sequence-to-Sequence): A CNN-based archi-
tecture, featuring gated linear units, residual connections, and attention. LSTM
(Long Short Term Memory): An RNN-based architecture that uses stacking
LSTM models of four layers, with attention. Transformer : a lightweight archi-
tecture that interleaves multi-head attention mechanisms with fully-connected
layers. We use Fairseq, Google Colab, and the hyperparameters listed in Table 1
(per Yin et al. [26]). For ConvS2S, the first 9 layers had 512 hidden units and
kernel width 3; the next 4 layers had 1,024 hidden units and kernel width 3; the
final 2 layers had 2,048 hidden units with kernel width 1.
    With respect to the datasets, we first aim to replicate the experiments of Yin
et al. [26] over the LC-QuAD 2.0 dataset, which was the most complex, diverse
    Entity Linking and Filling for Question Answering over Knowledge Graphs      5

and challenging KGQA dataset they tested. We use the original training/test
splits of LC-QuAD 2.0. However, we encountered some quality issues with the
LC-QuAD 2.0 dataset, which may relate to its use of crowdsourcing platforms to
generate and diversify the question texts. Specifically, participants were asked to
rephrase synthetic questions such as “What is [occupation] of [father] of [Marie
Curie]? ” into a more natural form, such as “What was the occupation of Marie
Curie’s father? ”, or to paraphrase the question, such as “What did Marie Curie’s
father do for a living? ”. From revising the benchmark, though in many cases the
process yielded meaningful question–query pairs, not all cases were like this. In
particular, we identified issues including: (1) the question text being null, empty
or marked not applicable (“NA”); (2) questions containing syntactic features
from the synthetic question, or copying the synthetic question in a verbatim
manner; (3) the question specifying the answer rather than the question; (4) the
length of the question being too long or too short to be a reasonable rephrasing
of the synthetic question; (5) the query containing invalid tokens. Given the
number of instances in the dataset, we applied automatic heuristics to detect
such cases, where we discarded 2,502 (8.2%) of the 30,226 instances due to such
quality issues. We call the resulting cleaned dataset LC-QuAD 2.0*.
    Given the quality issues with LC-QuAD 2.0 dataset, and the fact that its
training and test datasets feature question–query pairs based on common tem-
plates, we decided to create a novel, independent test dataset with 100 question–
query pairs answerable over Wikidata. This dataset does not follow standard
templates, and contains queries with a range of complexities and algebraic fea-
tures, ranging from simple queries such as “Who is the president of Poland ”, to
complex questions such as “Which country that does not border Germany has
German as an official language? ”.1 We call this novel dataset WikidataQA; it
contains 132 entities, 86 properties, and 208 triple/path patterns.
    In Table 2, we present the results for the three models trained on the LC-
QuAD 2.0* dataset, applied to the LC-QuAD 2.0* test set and the full Wiki-
dataQA set. These results consider a number of different metrics. First we present
the BLEU score, which is a modified version of precision used to compare a given
translation against a set of reference translations. This measure was reported by
Yin et al. [26]. However, a translated query may have an excellent BLEU score
but may be invalid or distinct from the reference query. We thus also present
accuracy, which indicates the percentage of questions for which the exact refer-
ence query was generated. This measure offers a lower-bound for performance
since two queries that are syntactically different may be semantically equivalent,
returning the same results on any knowledge graph. Unfortunately, determining
the equivalence of SPARQL queries (or queries for any language that encapsu-
late first-order logic, including SQL) is undecidable. A complementary measure
is thus to compare the answers generated by the reference query and the com-
puted query in terms of precision and recall ; we then finally present the macro
precision, recall and F1 -score (averaged over each question, which is given equal
1
    While QALD datasets could have been used, LC-QuAD 2.0 may use templates in-
    spired by QALD, creating a possible leak between training and test datasets.
6       D. Diomedi and A. Hogan

Table 2. KGQA performance of baseline models in terms of BLEU (B), Accuracy (A),
Precision (P ), Recall (R), and F1 score (F1 ) for three models and two datasets

      Model         Dataset               B       A       P        R       F1
      ConvS2S       LC-QuAD 2.0*      51.5%    3.3%   16.4%    16.6%    16.4%
      LSTM          LC-QuAD 2.0*      45.6%    0.2%   12.9%    13.0%    12.9%
      Transformer   LC-QuAD 2.0*      48.2%    2.1%   15.0%    15.2%    14.9%
      ConvS2S       WikidataQA        18.8%     0%     5.5%     6.0%     5.3%
      LSTM          WikidataQA        18.3%     0%     7.9%     7.8%     7.9%
      Transformer   WikidataQA        18.3%     0%     7.2%     8.9%     7.3%


weight). Again however, this measure is somewhat flawed in that – in particular
for ASK queries that return a boolean result – a query that is completely unre-
lated may return a correct result. While no measure perfectly captures KGQA
performance, each provides a different insight.
    The presented results indicate that while the BLEU score for LC-QuAD 2.0*
is mediocre, accuracy is very poor, while precision and recall of answers is some-
what better than accuracy. We highlight that over the LC-QuAD 2.0 dataset, Yin
et al. [26] report BLEU scores of 59.5%, 51.1% and 57.4% for ConvS2S, LSTM
and Transformer, respectively; although BLEU scores drop over our cleaned
dataset for the corresponding models – perhaps due to the removal of trivial
cases – the relative ordering of models remains the same. It is important to note
that Yin et al. [26] base their F1 scores on correct terms in the output query,
rather than solutions; thus their scores are higher than seen here. We can also
compare these results with those reported for gAnswer [24] over the QALD-9 and
Monument datasets, where Transformer and ConvS2S outperformed LSTM.
    From these baseline results, we can conclude that while NMT models do
sometimes find correct answers, overall the performance leaves much room for
improvement. Furthermore, the results for WikidataQA are consistently worse
than those for LC-QuAD 2.0*, indicating that when presented with questions
not following similar patterns seen in training, the models struggle to generalise.
    Comparing the models, we see that ConvS2S provides the best results for
all metrics in the case of LC-QuAD 2.0*. However, while ConvS2S provides the
best BLEU score in the case of WikidataQA, it is outperformed by LSTM and
Transformer in terms of precision, recall and F1 .
    Although the NMT-based gAnswer system won the recent QALD-9 challenge
by a clear margin, these results highlight the difficulty of KGQA for complex
questions. Analysing the errors, we found that in 91.1% of the cases, the set of
entities in the output query did not correspond with that of the reference query,
thus constituting the primary source of error found. This is not surprising as the
specific entities of a question–query pair may not have been seen during training,
causing more frequent out-of-vocabulary errors than in the case of properties,
which are fewer in number and easier to cover in the training set. In the following,
we propose to combine Entity Linking with NMT to address this issue.
    Entity Linking and Filling for Question Answering over Knowledge Graphs                                         7

         “How many PhD students of Marie Curie were nominated for the Nobel Prize in Chemistry? ”


                Entity Linking                        Sequence Labelling             Neural Machine Translation


    1 “Marie Curie”           → wd:Q7186          “Marie Curie”           → obj2       SELECT (COUNT(*) as ?ans)
    2 “Nobel . . . Chemistry” → wd:Q44585         “Nobel Prize”           → obj1       WHERE {
    3 “Nobel Prize”           → wd:Q7191          “Nobel . . . Chemistry” → obj1         ?subj wdt:P1411 <obj1> .
    4 “PhD students”          → wd:Q12764792                                             ?subj wdt:P184 <obj2> .
                                                                                       }

                                                       Entity Assignment

                           SELECT (COUNT(*) as ?ans)
                           WHERE {
                             ?subj wdt:P1411 wd:Q44585 . # nominated for the Nobel Prize in Chemistry
                             ?subj wdt:P184 wd:Q7186 . # doctoral supervisor is Marie Curie
                           }


      Fig. 1. Overview of ElNeuKGQA approach for an example input question


4     Proposal: ElNeuKGQA
In order to reduce entity-related errors, we propose to combine NMT with En-
tity Linking (EL). EL identifies mentions of entities in a text and links them
to knowledge graph identifiers [25]. Given a text “What did Marie Curie’s fa-
ther do for a living? ”, an EL tool selecting DBpedia or Wikidata as a target
knowledge graph would be expected to link “Marie Curie” to db:Marie Curie
or wd:Q7186, respectively. Some EL systems may further target common entities
such as “father ”, linking them to dbr:Father and wd:Q7565, respectively.
    Numerous EL techniques have been proposed down through the years, where
a common theme is to use the context of the text to disambiguate an entity. For
example, in a question “Who is the lead singer of Boston? ”, while the mention
“Boston” has a higher prior probability of referring to the city, in this case
it clearly refers to the rock band Boston, based on contextual clues such as
“singer ”. A natural approach is then to delegate the translation of entities in
KGQA to external EL systems. This would avoid the need to train the NMT
model with examples for every entity, and by taking into account the context,
it should improve results versus the approach of applying disambiguation using
only prior probabilities, as was proposed for Neural Semantic Machines [20].
    The idea of using EL in the context of KGQA is far from new [11,8,4,24],
but to the best of our knowledge only Wang et al. [24] have recently looked
at combining EL with NMT. Their system, called gAnswer, won the QALD-9
challenge, and is thus clearly a promising approach. However, their approach is
based on an individual EL system, and details are missing in terms of how the
entity filling task is addressed, i.e., how cases where the EL system generates
multiple entities, or where the query involves multiple entities, are handled.

ElNeuKGQA Overview We propose a novel approach, called ElNeuKGQA, that
combines EL with NMT. Specifically, an EL system is used to identify entity
mentions in the question and link them to the knowledge graph. We combine
8      D. Diomedi and A. Hogan

this with an NMT model that is trained and used to generate template queries
with placeholders for entities. The results of the EL system can then be used
to fill in these placeholders in the template query. In initial experiments, we
noted some key design factors underlying such an approach: (1) the EL system
should not produce fewer entity links than there are placeholders in the template
query, as this will lead to an invalid query; (2) the EL system may sometimes
identify more entity links than placeholders, where the correct entities must be
chosen; (3) for questions/queries with multiple entities, we must choose which
placeholder to replace with which entity, which is a non-trivial problem.
    We thus propose a custom ensemble EL approach, which incorporates mul-
tiple individual EL systems in order to boost recall (addressing (1)), and is
combined with a voting method to rank individual entities (addressing (2)).
Though various ensemble EL systems have already been proposed in the litera-
ture, we opted to build a task-specific ensemble system for two main reasons: (i)
the nature of the scoring function changes per the aforementioned requirements
where, rather than voting for entities to link a given mention in the text as in a
typical EL scenario, we rather need to score entities to fill “placeholders” in the
query; (ii) existing ensemble systems do not target Wikidata (as targeted by the
LQ-QuAD 2.0 dataset), where although Wikidata contains a large intersection
of interlinked entities with knowledge bases such as DBpedia, Wikipedia, and
YAGO that are targeted by the existing systems, it also contains a much broader
set of unique entities, where our ensemble includes an EL system for Wikidata.
    We further propose to combine EL with an entity filling (EF) technique that
chooses which placeholders in the query to replace with which entities (address-
ing (3)). This entity filling component consists of two phases: (SL) a sequence
labeller is trained on the question–query pairs in order to identify the role of
entities in the question, matching them with their role in the query; and (EA)
an entity assigner uses these roles and specific heuristics to map each query
placeholder to a specific entity produced by EL.
    We provide an example of our proposed ElNeuKGQA system in Figure 1.
Entity linking (EL) provides a ranked list of entities mentioned in the question
text. Neural machine translation (NMT) is used to generate a query template
with placeholders, such as <obj1> and <obj2>, to be filled later. Entity filling
(EF) requires two steps: sequence labelling (SL) annotates phrases in the ques-
tion with roles corresponding to placeholder labels, while entity assignment (EA)
decides which entity to use to replace which placeholder in the query template.
   There exist a number of ways in which this process can give an incorrect
query: EL may fail to identify all of the relevant entities; sequence labelling may
produce labels that do not match the template; NMT may produce an incorrect
template; etc. Later we will look at the most common forms of errors in practice.
   In order to train the system, the mentions of entities found in the output
query must be annotated in the question with their knowledge graph identifiers,
which is the case for LC-QuAD 2.0* and our novel WikidataQA dataset.
    We now describe the individual components in more detail.
  Entity Linking and Filling for Question Answering over Knowledge Graphs            9

Neural Machine Translation The NMT component follows the baseline ap-
proach, with the sole exception that the model is trained to produce query tem-
plates with generic placeholders replacing the specific knowledge graph nodes
referring to a particular entity mention (where specific nodes are rather identi-
fied by the EL component). In particular, all constant subjects and objects in
the query are replaced, in a one-to-one manner, with placeholders P of the form
sbjk/objk for subject/object IRIs; and numk/strk for numbers/strings, where k
indicates the k th placeholder of that form in the query.Formally, let Q denote the
set of all finite-length sequences of (Unicode) characters of the form hc1 , ..., cn i.
Let S denote the set of all finite-length sequences of the form ht1 , ..., tn i, where
each ti (1 ≤ i ≤ n) is a query token that may be a syntactic terminal (e.g., “”),
an operator (e.g., COUNT), a query variable, an RDF term, or a placeholder. Then
N M T : Q → S is a (learnt) function that maps an input question q ∈ Q to an
output sequence of query tokens N M T (q) ∈ S. Though most elements of S are
not valid queries, the goal is to learn a function N M T that generates a valid
query representing the intent of the input question.

Entity Linking Entity linking (EL) identifies entity mentions in text, and links
them to entity identifiers in a knowledge base. In our scenario, given an input
question q ∈ Q, and an RDF knowledge graph G referring to its entities with a
set of IRIs E, the function EL(q, G) = M provides us with a set of entity men-
tions M , which are quads of the form (e, k1 , k2 , z), where k1 < k2 indicate the
start and end offsets of a mention of entity e in the text q and z provides a score.
ElNeuKGQA is instantiated with an ensemble of four EL systems: AIDA [27]
targeting YAGO, DBpedia Spotlight [19] targeting DBpedia, OpenTapioca [6]
targeting Wikidata, and TagME [9] targeting Wikipedia. Though AIDA, DB-
pedia Spotlight and TagME target other knowledge bases, we can use existing
mappings to convert DBpedia, Wikipedia and YAGO links to Wikidata links;
however, Wikidata has entities not appearing in DBpedia, Wikipedia nor YAGO,
where OpenTapioca is important to be able to target these Wikidata-specific en-
tities. We primarily choose these four EL systems as they provide online APIs,
where other systems could be added to the ensemble system in future. It is im-
portant for later entity assignment that the EL component rank the entity links
that are found. Along these lines, we establish a voting system for z where, for
a given mention, the output of each EL system gives a vote to the top-scored
entity for that mention based on the order (not the score) in which the results
are output. In the case of a tie, we boost z for individual EL systems that pro-
vide more precise results for the training dataset; if still tied, we use the scores
produced by individual systems to boost z and break the tie.

Sequence Labelling We are then left to perform entity filling, which consists
of two phases, starting with sequence labelling (SL). The SL phase takes the
question q as input, and produces a set of roles SL(q) = R, i.e., a set of triples
of the form (p, k1 , k2 ), where p ∈ P is a placeholder (indicating a role, e.g.,
obj2), and k1 < k2 indicate the start and end position for the placeholder tag.
We assume that the dataset provides annotations to enable training of SL (as
10        D. Diomedi and A. Hogan

provided, e.g., by LC-QuAD 2.0). In order to implement SL in ElNeuKGQA, an
embedding layer is used, composed of a stacked embedding that includes GloVe
embeddings and Flair contextual embeddings. We use the model proposed by
Akbik et al. [1]: GloVe and Flair embeddings are concatenated and passed to
BiLSTM with 256 layers, with a final conditional random field (CRF) layer.

Entity Assignment The second phase of entity filling, and the last step overall,
involves entity assignment (EA). The EA component accepts the output of the
previous three components – (s, M, R) where s = N M T (q), M = EL(q, G), R =
SL(q) – and produces the final query EA(s, M, R) ∈ S. Specifically, EA com-
putes an injective mapping γ : Ps → EM ∪D from the placeholders Ps mentioned
in s to the entities EM mentioned in M and a set of datatype literals D, returning
the image of s under γ (i.e., replacing each placeholder p in s with γ(p)).
    In Algorithm 1, function mapP2E describes how γ is computed. For brevity,
we focus on how entities are processed; datatype literals are generated directly
from R for sub-strings labelled with strk or numk. We use domain(γ) to de-
note the set of placeholders for which γ is defined, range(γ) to denote the set
of entities returned for some placeholder by γ, and role(p) to indicate the role
of a placeholder p (e.g., role(obj3) = obj). The algorithm throws an error if
there are more unique entity placeholders appearing in s than unique entities
appearing in M (|Ps | > |Em |).2 We apply three phases of assignment. In the
first phase, each placeholder is mapped to the highest ranking entity (from EL)
whose mention in the question text is labelled with the same placeholder (by SL),
if any; if no exactly matching positions in the question text are found, overlap-
ping positions are considered. In the second phase, each unmapped placeholder
is mapped to the highest ranking entity whose mention is labelled with a place-
holder of the same role (e.g., any obj* for obj2); if no exactly matching positions
are found, overlapping positions are considered. Finally, each unmapped place-
holder is paired with an entity based on the order in which they appear in
the query template and question text, respectively. Throughout, the algorithm
enforces that γ maps each placeholder to a distinct entity. The algorithmic com-
plexity is O(pe + p log p + e log e) for p = |Ps | and e = |Em |.


5      Experiments
Our hypothesis is that combining entity linking (and entity filling) with neural
machine translation can improve the quality of results for the KGQA task versus
pure neural machine translation approaches. The following experiments explore
this hypothesis by extending the best three NMT models in the recent results
by Yin et al. [26] and the three models used by gAnswer [24] – namely ConvS2S,
LSTM, and Transformer – with the proposed EL and EF methods. Experiments
comparing NMT and non-NMT approaches are not relevant for our hypothesis,
where relevant results can rather be found in QALD-9 [24].
2
     This could optionally be relaxed to replace extra placeholders with variables, or to
     map multiple placeholders to a given entity.
  Entity Linking and Filling for Question Answering over Knowledge Graphs                                11

Algorithm 1 Mapping placeholders to entities
 1: function mapP2E(s, M, R)                            . i.e., mapP2E(N M T (q), EL(q, G), SL(q))
 2:    Ps ← {p ∈ P | p appears in s and role(p) ∈ {sbj, obj}}
 3:    EM ← {e | there exists k1 , k2 , z : (e, k1 , k2 , z) ∈ M }
 4:    PR ← {p ∈ P | there exists k1 , k2 : (p, k1 , k2 ) ∈ R}
 5:    if |Ps | > |EM | then
 6:        throw error "more placeholders than entities"
 7:    initialise γ                                                               . an empty mapping
 8:    Ps ← sortRole(Ps )                                        . by sbj1, . . . , sbja, obj1, . . . , objb
 9:    for i ∈ 1 : |Ps | do                                                           . iterate in order
10:        assignByRole(e, M, R, {Ps [i]}, γ)
11:    for i ∈ 1 : |Ps | such that Ps [i] ∈
                                          / domain(γ) do                . skip assigned placeholders
12:        P 0 ← {p0 ∈ Ps | role(Ps [i]) = role(p0 ) and Ps [i] 6= p0 }
13:        assignByRole(e, M, R, P 0 , γ)
         0
14:    Ps ← sortPos(Ps \ domain(γ), s)             . asc. by start position of first appearance in s
15:    E0M ← sortPos(EM \ range(γ), M )               . asc. by start position of first mention in q
16:    for i ∈ 1 : |P0s | do
17:        γ ← γ ∪ {P0s [i] 7→ E0M [i]}
18:    return γ
19: function assignByRole(e, M, R, P 0 , γ)
20:    C ← {(e, z) | there exists k1 , k2 , p ∈ P 0 : (e, k1 , k2 , z) ∈ M and (p, k1 , k2 ) ∈ R}
21:    C 0 ← {(e, z) ∈ C | e ∈/ range(γ)}                 . ensure entities mapped to at most once
22:    if C 0 is not empty then
23:        γ ← γ ∪ {p 7→ arg maxe C 0 }             . map p to entity e with highest score z in C 0
24:    else
25:        D ← {(e, z) | ∃k1 , k2 , k3 , k4 , x, p ∈ P 0 : (e, k1 , k2 , z) ∈ M, (p, k3 , k4 ) ∈ R,
                             k1 ≤ x ≤ k2 and k3 ≤ x ≤ k4 }                     . has overlapping text
26:        D0 ← {(e, z) ∈ D | e ∈ / range(γ)}
27:        if D0 is not empty then
28:             γ ← γ ∪ {p 7→ arg maxe D0 }


Setting As per the baselines of Section 3, the NMT and SL models are built
with the Fairseq framework. We use the same hyperparameters (Table 1). We
will again use the cleaned LC-QuAD 2.0* for training and testing (following the
original splits), and the WikidataQA dataset for testing only.


Query Generation and Question Answering Table 3 presents high-level results
that compare the baseline results for “pure NMT” (shown previously in Table 2)
with the results for our proposed ElNeuKGQA system using the same metrics.
We see that ElNeuKGQA consistently – and in most cases considerably – out-
performs the baseline approaches: ElNeuKGQA variants achieve equal or better
results for all models, metrics and datasets compared to their baseline pure
NMT counterparts. These results over two KGQA datasets and three sequence-
to-sequence models thus support our hypothesis that combining entity linking
and filling with NMT (the ElNeuKGQA variants) can improve the quality of re-
12      D. Diomedi and A. Hogan

              Table 3. KGQA performance of baseline and ElNeuKGQA

     System                  Dataset             B      A      P      R     F1
     ConvS2S               LC-QuAD 2.0* 51.5% 3.3% 16.4% 16.6% 16.4%
     ElNeuKGQA-ConvS2S     LC-QuAD 2.0* 59.3% 14.0% 26.9% 27.0% 26.9%
     LSTM                  LC-QuAD 2.0* 45.6% 0.2% 12.9% 13.0% 12.9%
     ElNeuKGQA-LSTM        LC-QuAD 2.0* 52.9% 7.9% 20.2% 20.3% 20.2%
     Transformer           LC-QuAD 2.0* 48.2% 2.1% 15.0% 15.2% 14.9%
     ElNeuKGQA-Transformer LC-QuAD 2.0* 51.8% 7.1% 19.9% 20.1% 19.8%
     ConvS2S               WikidataQA        18.8%    0% 5.5% 6.0% 5.3%
     ElNeuKGQA-ConvS2S     WikidataQA        20.5%    0% 12.9% 12.9% 12.9%
     LSTM                  WikidataQA        18.3%    0% 7.9% 7.8% 7.9%
     ElNeuKGQA-LSTM        WikidataQA        18.8%    0% 8.9% 8.9% 8.9%
     Transformer           WikidataQA        18.3%    0% 7.2% 8.9% 7.3%
     ElNeuKGQA-Transformer WikidataQA        19.8%    0% 11.9% 11.9% 11.9%


sults for the KGQA task (versus pure NMT approaches). As an auxiliary result,
in terms of models, we see that ConvS2S clearly outperforms the other two.
    Although we see clear relative gains over the baseline, even in the case of
ElNeuKGQA, the results leave a lot of room for improvement. For example,
although correct answers are generated for some WikidataQA queries, no query
generated corresponded (precisely) to the gold standard query for any model or
variant (see A: accuracy). KGQA for complex questions remains a challenging
task, and in what follows we characterise the performance, errors and limitations
of ElNeuKGQA by each component, in order to better understand the origin of
these relative gains, and also ways in which the approach could be improved.

NMT Template Generation We first look at the results for generating query tem-
plates using NMT. Given that the query templates are not expected to produce
answers to the question, we present only BLEU and accuracy. These results are
shown in Table 4. Looking first at the results for LC-QuAD 2.0*, we observe
a moderate increase in the BLEU score versus the baseline for generating the
full query in Table 2 (51.5% → 65.2%, 45.6% → 58.2%, 48.2% → 56.9%. resp.).
We also see a marked improvement in accuracy (3.3% → 34.3%, 0.2% → 19.3%,
2.1% → 16.4%, resp.), indicating that the models can generate query templates
much more accurately than full queries mentioning specific entities (further high-
lighting the out-of-vocabulary issues encountered by pure NMT models when
processing entities). On the other hand, the BLEU score for WikidataQA sees
only a slight increase (18.8% → 20.1%, 18.3% → 19.0%, 18.3% → 20.2%) and
accuracy remains at zero: not a single WikidataQA template is generated in
a syntactically identical manner to the gold standard template by any model.
This highlights that the models trained on the LC-QuAD 2.0* dataset do not
generalise to WikidataQA. However, it is still possible for queries that are not
syntactically identical to return (some) correct answers, as seen in Table 3.
  Entity Linking and Filling for Question Answering over Knowledge Graphs         13

                   Table 4. Performance of template generation

                Model           Dataset                 B          A
                ConvS2S         LC-QuAD 2.0*        65.2%     34.3%
                LSTM            LC-QuAD 2.0*        58.2%     19.3%
                Transformer     LC-QuAD 2.0*        56.9%     16.4%
                ConvS2S         WikidataQA          20.1%        0%
                LSTM            WikidataQA          19.0%        0%
                Transformer     WikidataQA          20.2%        0%

 Table 5. Performance of individual and ensemble EL systems for LC-QuAD 2.0*

                                    Micro             Macro
            EL System
                                  P     R    F1     P     R    F1
            AIDA              72.5% 31.3% 43.7% 38.5% 30.5% 33.1%
            DBpedia Spotlight 20.8% 52.7% 29.9% 23.3% 52.5% 30.8%
            OpenTapioca       61.8% 32.0% 42.2% 34.9% 30.2% 31.4%
            TagME             25.0% 59.5% 35.2% 29.5% 59.4% 37.4%
            Ensemble           19.5% 68.6% 30.3% 23.9% 68.4% 33.6%


EL Systems Next we look at the performance of the four individual EL systems
that we use for our ensemble, and of the voting-based ensemble itself. The results
for micro and macro precision, recall and F1 -score are presented in Table 5
over the LC-QuAD 2.0* dataset. These results consider the entity mentions in
the question text that correspond to entities in the reference query as being
“correct”; it may be the case that an EL system detects entities that would
otherwise be considered valid, but are not used in the query. Individual systems
find different trade-offs between precision and recall, where AIDA offers the
highest precision, while TagME offers the highest recall. Our ensemble offers the
best recall, but has (almost) the worst precision. This is a result of its design: we
return all entities with a vote-based ranking in order to offer subsequent steps
as many (ranked) options as possible in order to ensure that all slots are filled.
The results of Table 5 do not consider that ensemble results are ranked.

Sequence Labelling We now look at the results for sequence labelling, where Ta-
ble 6 again presents the micro and macro precision, recall and F1 -scores. Preci-
sion and recall tend to be balanced. However, we note that there is a significant
drop in performance when considering the WikidataQA dataset; the model is
trained on LC-QuAD 2.0*, and only partially generalises to the other dataset.

Entity Filling We specifically analyse the micro and macro precision, recall and
F1 -scores for EF over the pairs (p, e) that are generated, where p is a placeholder,
and e is an entity or literal. These results are presented in Table 7. We see that
the results are the lowest thus far, which is to be expected as EF depends on
the EL and SL components and thus accumulates their errors.
14     D. Diomedi and A. Hogan

                   Table 6. Performance of sequence labelling

                                 Micro             Macro
              Dataset
                               P     R    F1     P     R    F1
              LC-Quad 2.0* 62.0% 66.9% 64.4% 63.8% 66.9% 64.8%
              WikidataQA 22.7% 29.8% 25.8% 30.9% 31.8% 31.0%

                        Table 7. Performance of entity filling

                                 Micro             Macro
              Dataset
                               P     R    F1     P     R    F1
              LC-Quad 2.0* 47.2% 45.6% 46.4% 49.6% 47.4% 47.4%
              WikidataQA 18.0% 17.0% 17.5% 18.3% 21.5% 18.8%


Sources of error We finally summarise and compare the sources of error found
for ElNeuKGQA and baseline pure NMT approaches. Since the accuracy for
WikidataQA was 0%, in Table 8 we present the error rates for the LC-QuAD 2.0*
dataset, where T refers to a correct template, E refers to the query having the
correct set of entities, and F refers to correct entity filling. We use ! to denote
an error in the corresponding step and omit both T!F and !!F since an error in
E implies an error in F. Of note is that all three pure NMT baselines produce a
much lower percentage of queries containing correct entities versus ElNeuKGQA;
for example, comparing ConvS2S with ElNeuKGQA-ConvS2S, only 10.0% (TEF
+ TE! + !EF + !E!) of the queries produced by the former contain correct
entities, while 58.2% of the latter contain correct entities. This indicates that
ElNeuKGQA primarily gains over pure NMT approaches due to its handling of
entities. Focusing on ElNeuKGQA, template errors were the most common.


6    Conclusions

Question Answering over Knowledge Graphs (KGQA) has seen significant ad-
vances in recent years, but more complex questions remain a major challenge. Al-
though recent approaches using neural machine translation (NMT) yield promis-
ing results, NMT suffers from the out-of-vocabulary problem, particularly for en-
tities. We propose ElNeuKGQA: an approach that combines entity linking (EL)
with NMT through a novel entity filling approach. Our experiments show that
combining EL with NMT outperforms a pure NMT approach for all models and
datasets considered. These results – along those of Wang et al. [24] – highlight
the potential for NMT and EL to complement each other.
     Regarding limitations, the results for more complex questions – particularly
those having unseen templates – remain poor. KGQA over such questions is a
challenging task, but also an important one. Regarding future work, we iden-
tified some quality issues with the LC-QuAD 2.0 dataset, where NMT-based
approaches may perform better given larger, higher-quality datasets for train-
  Entity Linking and Filling for Question Answering over Knowledge Graphs          15

              Table 8. Sources of error in the LC-QuAD 2.0* dataset

        System                       TEF    TE!     T!!    !EF     !E!     !!!
        ConvS2S                     3.3% 1.3% 25.3% 2.8% 2.6% 64.9%
        ElNeuKGQA-ConvS2S          14.0% 4.0% 13.5% 15.7% 24.5% 28.5%
        LSTM                         0.2% 0.1% 20.3% 0.7% 0.6% 78.1%
        ElNeuKGQA-LSTM               7.9% 2.5% 8.2% 21.4% 27.3% 33.7%
        Transformer                  2.1% 1.0% 22.1% 4.0% 4.3% 66.6%
        ElNeuKGQA-Transformer        7.1% 1.7% 6.8% 17.6% 31.6% 35.1%


ing; however, creating such datasets appears costly. Another avenue to explore
is to extend NMT with relation linking (RL), as proposed by Wang et al. [24];
however, sometimes a query relation is not explicitly mentioned by a query, and
while large knowledge graphs typically contain millions of entities, they only
contain thousands of relations (or properties), meaning that it would be more
feasible to cover all relations in training. It may also be of interest to leverage
the knowledge graph in order to avoid generating queries with empty results;
however, user questions may sometimes yield empty results, so it would seem
best to avoid always “forcing” a query with results.
Supplementary material, including code and queries, are available from the repos-
itory https://github.com/thesemanticwebhero/ElNeuKGQA.

Acknowledgements Hogan was supported by Fondecyt Grant No. 1221926 and
by ANID – Millennium Science Initiative Program – Code ICN17 002.

References
 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual String Embeddings for Sequence
    Labeling. In: COLING. pp. 1638–1649. ACL (2018)
 2. Bernstein, A., Kaufmann, E., Göhring, A., Kiefer, C.: Querying Ontologies: A
    Controlled English Interface for End-Users. In: The Semantic Web - ISWC 2005,
    4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, Nov.
    6-10, 2005, Proc. LNCS, vol. 3729, pp. 112–126. Springer (2005)
 3. Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale Simple Question An-
    swering with Memory Networks. CoRR abs/1506.02075 (2015)
 4. Chakraborty, N., Lukovnikov, D., Maheshwari, G., Trivedi, P., Lehmann, J., Fis-
    cher, A.: Introduction to Neural Network based Approaches for Question Answer-
    ing over Knowledge Graphs. CoRR abs/1907.09361 (2019)
 5. Damljanovic, D., Agatonovic, M., Cunningham, H., Bontcheva, K.: Improving hab-
    itability of natural language interfaces for querying ontologies with feedback and
    clarification dialogues. J. Web Semant. 19, 1–21 (2013)
 6. Delpeuch, A.: OpenTapioca: Lightweight Entity Linking for Wikidata. In: Wikidata
    Workshop. CEUR, vol. 2773 (2020)
 7. Dubey, M., Banerjee, D., Abdelkawi, A., Lehmann, J.: LC-QuAD 2.0: A Large
    Dataset for Complex Question Answering over Wikidata and DBpedia. In: ISWC.
    LNCS, vol. 11779, pp. 69–78. Springer (2019)
16      D. Diomedi and A. Hogan

 8. Dubey, M., Banerjee, D., Chaudhuri, D., Lehmann, J.: EARL: Joint Entity and Re-
    lation Linking for Question Answering over Knowledge Graphs. In: ISWC. LNCS,
    vol. 11136, pp. 108–126. Springer (2018)
 9. Ferragina, P., Scaiella, U.: TAGME: on-the-fly annotation of short text fragments
    (by wikipedia entities). In: CIKM. pp. 1625–1628. ACM (2010)
10. Finegan-Dollak, C., Kummerfeld, J.K., Zhang, L., Ramanathan, K., Sadasivam,
    S., Zhang, R., Radev, D.R.: Improving Text-to-SQL Evaluation Methodology. In:
    ACL. pp. 351–360. ACL (2018)
11. Freitas, A., Oliveira, J.G., O’Riain, S., Curry, E., da Silva, J.C.P.: Treo: Best-
    Effort Natural Language Queries over Linked Data. In: NLDB. LNCS, vol. 6716,
    pp. 286–289. Springer (2011)
12. Hartmann, A.K., Soru, T., Marx, E.: Generating a Large Dataset for Neural Ques-
    tion Answering over the DBpedia Knowledge Base. In: LD Management (2018)
13. Hogan, A., et al.: Knowledge graphs. ACM Comput. Surv. 54(4), 71 (2021)
14. Jung, H., Kim, W.: Automated conversion from natural language query to
    SPARQL query. J. Intell. Inf. Syst. 55(3), 501–520 (2020)
15. López, V., Fernández, M., Motta, E., Stieler, N.: Poweraqua: Supporting users in
    querying and exploring the semantic web. Semantic Web 3(3), 249–265 (2012)
16. Lukovnikov, D., Fischer, A., Lehmann, J., Auer, S.: Neural network-based question
    answering over knowledge graphs on word and character level. In: WWW. pp.
    1211–1220. ACM (2017)
17. Luz, F.F., Finger, M.: Semantic Parsing Natural Language into SPARQL:
    Improving Target Language Representation with Neural Attention. CoRR
    abs/1803.04329 (2018)
18. Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the
    Most Out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge
    Graph. In: ISWC. LNCS, vol. 11137, pp. 376–394. Springer (2018)
19. Mendes, P.N., Jakob, M., Garcı́a-Silva, A., Bizer, C.: DBpedia spotlight: shedding
    light on the web of documents. In: I-SEMANTICS. pp. 1–8. ACM (2011)
20. Soru, T., Marx, E., Moussallem, D., Publio, G., Valdestilhas, A., Esteves, D., Neto,
    C.B.: SPARQL as a foreign language. In: SEMANTiCS P&D. CEUR, vol. 2044.
    CEUR-WS.org (2017)
21. Unger, C., Bühmann, L., Lehmann, J., Ngomo, A.N., Gerber, D., Cimiano, P.:
    Template-based question answering over RDF data. In: WWW. pp. 639–648 (2012)
22. Unger, C., Freitas, A., Cimiano, P.: An Introduction to Question Answering over
    Linked Data. In: Reasoning Web. LNCS, vol. 8714, pp. 100–140. Springer (2014)
23. Usbeck, R., Gusmita, R.H., Ngomo, A.N., Saleem, M.: 9th Challenge on Ques-
    tion Answering over Linked Data (QALD-9) (invited paper). In: NLIWoD. CEUR,
    vol. 2241, pp. 58–64. CEUR-WS.org (2018)
24. Wang, S., Jiao, J., Li, Y., Zhang, X., Feng, Z.: Answering Questions over RDF
    by Neural Machine Translating. In: ISWC P&D. CEUR, vol. 2721, pp. 189–194.
    CEUR-WS.org (2020)
25. Wu, G., He, Y., Hu, X.: Entity Linking: An Issue to Extract Corresponding Entity
    With Knowledge Base. IEEE Access 6, 6220–6231 (2018)
26. Yin, X., Gromann, D., Rudolph, S.: Neural machine translating from natural lan-
    guage to SPARQL. Future Gener. Comput. Syst. 117, 510–519 (2021)
27. Yosef, M.A., Hoffart, J., Bordino, I., Spaniol, M., Weikum, G.: AIDA: An Online
    Tool for Accurate Disambiguation of Named Entities in Text and Tables. Proc.
    VLDB Endow. 4(12), 1450–1453 (2011)

</pre>