SPARQL-QA enters the QALD challenge

    Manuel Borroto1 , Francesco Ricca1 , Bernardo Cuteri1 , and Vito Barbara1

                University of Calabria, Rende (CS) 87036, Italy
 {manuel.borroto,francesco.ricca,bernardo.cuteri,barbara.vito}@unical.it


        Abstract. The large public knowledge bases available today empower
        the community with an invaluable amount of information covering multi-
        ple domains. Unfortunately, accessing these data sources can be difficult
        without the proper knowledge of the KB structure and a query language,
        such as SPARQL. Question Answering over Knowledge Bases (KBQA)
        proposes to solve this problem by allowing users to pose questions in
        natural language to extract the desired information from the KBs, ab-
        stracting them from any technical complexity. In this context, the QALD
        challenges propose to evaluate and encourage the development of KBQA
        systems. In this paper, we describe sparql-qa, a system for question an-
        swering over KBs that have been submitted to the QALD-10 challenge.
        Our method uses Neural Machine Translation and Named Entity Recog-
        nition tasks that complement each other to create SPARQL queries. The
        experiments were performed using the Wikidata QALD-10 benchmark.


1     Introduction

Today, search and retrieval of the rich information stored in large knowledge
bases is still a challenging task for those lay users that do not know the structure
of the knowledge base and the appropriate query languages, such as SPARQL.
As a result, AI techniques for natural language Question Answering (QA) have
taken a central role in the area of the Semantic Web to address such issues. In-
deed, systems able to translate questions posed in natural language in SPARQL
queries have the potential of overcoming this problem because they can remove
all technical complexity to the final users.
    In this paper, we describe sparql-qa [3], an AI system for QA over knowl-
edge bases. The core of our architecture is based on a Neural Machine Trans-
lation (NMT) [1] module, which is based on Bidirectional Recurrent Neural
Networks [11], trained side-by-side with a Named Entity Recognition (NER)
module, implementing a BiLSTM-CRF network [7]. The NMT module trans-
lates the input NL question into a SPARQL template, whereas the NER module
extracts the entities from the question. The combination of the outputs of the
two modules results in a SPARQL query ready to be executed.
    In the following, we provide an overview of the system that participated in
the 10th Question Answers over Linked Data (QALD)1 challenge edition.
1
    http://qald.aksw.org/
2       M. Borroto et al.

2    System Architecture


The potential of exploiting knowledge bases can be increased by allowing any
user to query the ontology by posing questions in natural language. In this paper,
this problem is seen as the following Natural Language Processing task: Given an
RDF knowledge base O and a question Qnat in natural language (to be answered
using O), translate Q into a SPARQL query SQnat such that the answer to Qnat
is obtained by running SQnat on O.
    The starting point is a training set containing a number of pairs hQnat , GQnat i,
where Qnat is a natural language question, and GQnat is a SPARQL query,
called the gold query. The gold query is a SPARQL query that models (i.e.,
allows to retrieve from O) the answers to Qnat . The training set has to be
used to learn how to answer questions posed in natural language using O, so
that, given a question in natural language Qnat , the QA system can generate
           0
a query SQ   nat
                 that is equivalent to the gold query GQnat for Qnat , i.e., such
                  0
that answers(SQ     nat
                        ) = answers(GQnat ). Basically, we compare the answers,
and we are not interested in reproducing syntactically the gold query. We ap-
                                                                                0
proach this problem as a machine translation task, that is, we compute SQ         nat
     0
as SQnat = T ranslate(Qnat ), where T ranslate is the translation implemented
by our QA system, called sparql-qa. Most of the solutions proposed up to now
to convert from natural language to SPARQL make use of various techniques,
either using patterns or deep neural networks.
    To reduce the impact of the words out of the vocabulary (WOOV) and im-
prove the training time of the entire process, we introduce in sparql-qa some
remedies, including a new format to represent an NL to SPARQL dataset. In
particular, sparql-qa implements a neural-network-based architecture (see Figure
1) for question answering that accomplishes the objective by resorting to a novel
combination of tools. The architecture is composed of three main modules: In-
put preparation, Translation, and Assembling, as shown in Figure 1 (lef t). Each
module is described in the following.


Fig. 1. System architecture (left) and model for joint training of NMT and NER (right).
                                   SPARQL-QA enters the QALD challenge           3

2.1   Input preparation
In this phase, the input sentence is processed in such a way that it is polished to
attenuate linguistic noise (e.g., shifts in spelling, grammar, punctuation, entity
identification) and also recast to be used as input for the subsequent phase.
Acronyms normalization. In this step, acronyms are converted to the correspond-
ing names. For example, the U K acronym becomes United Kingdom. Acronyms
regularly refer to KB resources inside the SPARQL query, and we need to replace
them with the full name, which is more similar to the KB resource name. In our
approach, this is particularly useful for handling acronyms of countries. To per-
form this task, we rely on two libraries. The first one is spaCy [6], a well-known
tool for NLP tasks. This library helps us to identify the acronyms thanks to
their powerful NER mechanism. Then we use the Country Converter (COCO)
library to obtain the original name.
Fixing entities with NEL. We apply this preprocessing step to identify the enti-
ties in Qnat and replace them with the label used in the KB because our approach
heavily relies on the correct spelling of the entities. For example, in the ques-
tion: Where was the president Kennedy born?, we look to transform the entity
”Kennedy” to ”John F. Kennedy” which is the right label used to identify the
resource. In our implementation, we face this problem by using Named Entity
Linking (NEL), also referred to as Named Entity Disambiguation, which is the
task of linking entities mentioned in the text with their corresponding entities
in a target KB [10]. To this end, we employed DBpedia Spotlight [4].
Tokenization. The tokenization process is a fundamental step in almost all NLP
methods. It is the task of chopping a text up into pieces called tokens, which
are usually words. During this process, Qnat is cleaned by filtering out undesired
characters such as punctuation marks and converted into a sequence of words S.
The sequence is converted then into a sequence of word embeddings. To mitigate
the dependency on the input vocabulary (English) and reduce the impact of the
WOOV, we use the pre-trained word embeddings of F astT ext [2]. The decision
to use F astT ext is because it provides embeddings also for those words outside
the training vocabulary.

2.2   Translation
In this stage, the pre-processed Qnat is analyzed to recognize both its structure
                                              0
and the named entities that serve to build SQ   nat
                                                    in the subsequent assembling
phase. To do such recognition, we employ two neural networks performing two
key NLP tasks: (i) a Neural Machine Translation task (NMT), and (ii) a Named
Entity Recognition (NER) task.


             Table 1. hQnat , Qsparql i for Who painted the Mona Lisa?

Question                        Query
Who painted the Mona Lisa?      select ?a where { wd:Q12418 wdt:P170 ?a }
4       M. Borroto et al.

    In the NMT task, Qnat is translated in a query template. A query template
is the skeleton of a SPARQL query where some of the KB resources are replaced
by placeholders. In general, in an NMT task, given a source X = (x1 , x2 , ..., xn )
sequence and a target Y = (y1 , y2 , ..., ym ) sequence, the aim is to model the
conditional probability of target words given the source sequence [1]. In our
approach, an NMT neural network takes as input the preprocessed question and
translates it into a SPARQL query template.
    In the NER task, the entities present in Qnat are identified and classified using
a dedicated neural network. In a NER task (also known as Entity Extraction),
named entities present in a text are associated with predefined categories, such
as individuals, companies, places, etc. This additional semantic knowledge helps
to understand the role of words in a given text [5]. As in most of the literature,
in our implementation, we also adopt the BIO notation [9] to tag a text, which
differentiates the beginning “B” and the interior “I” of the entities while “O” is
used for non-entity tokens. The named entities identified by NER are combined
                                                                0
with the output of NMT in the assembling phase to build SQ        nat
                                                                      .
Training set format. The NMT and NER networks are trained together using
the same input, obtained by converting the original training set into a novel
format called QQT format. This pre-processing step has a two-fold objective.
On the one hand, it aligns the inputs of both NMT and NER tasks; and, on the
other hand, it reduces the size of the output vocabulary and helps to mitigate
the impact of the WOOV during the translation. Translating entities to URIs
can be hard to learn from mere examples. A system can fail if the NER task is
not simple and there are a lot of words that are out of the vocabulary of the
training set.
    A dataset in QQT is composed of a set of triples in the form hQuestion,
QueryT emplate, T aggingi, where Question is a natural language question, Tag-
ging marks which parts of Question are entities, and QueryT emplate is a
SPARQL query template modified as follows: (i) The KB resources are replaced
by one or more variables; (ii) A new triple is added for each variable in the
form ”?var rdfs:label placeholder”. P laceholders are meant to be replaced by
substrings of Question depending on T agging.
    In Table 1, we show an example of a hQnat , Qsparql i pair, while Table 2 shows
the corresponding hQuestion, QueryT emplate, T aggingi triple in the QQT for-
mat. In Table 2, the term $1 denotes a placeholder where the number index 1
means that $1 has to be replaced by the first entity occurring in the question,
Mona Lisa in this case, as represented by BIO tagging notation. Note that in
the QQT format, the query template does not contain any KB resource, so the
learning model does not need to understand Mona Lisa stands for a specific KB


               Table 2. QQT triple for Who painted the Mona Lisa?
Question                        QueryTemplate                         Tagging
Who painted the Mona Lisa?      select ?a where { ?w wdt:P170 ?a . ?w O O O B I O
                                rdfs:label $1 }
                                   SPARQL-QA enters the QALD challenge            5

resource identifier. With this representation, the output vocabulary of the NMT
model is reduced, and the network is more tolerant to the WOOV problem.
The networks. As we have seen, the architecture has two important parts that
are NMT and NER. To develop the NMT neural network, we decided to use
the standard Encoder-Decoder approach, with BiLSTM and Luong Attention
[8], which has shown high performance in the literature. On the other hand, to
develop the NER network, we used the BiLSTM-CRF approach proposed by
[7], which assigns a tag to each token in the input sequence. The two models
share the fact that they have a BiLSTM-based encoder to obtain the semantic
information from the input sequence. For this reason, we decided to do a joint
training of the two models looking to improve the training times and make the
two networks help each other.
    The proposed approach, depicted in Figure 1 (right), has a single encoder
composed of one BiLSTM layer with two branches connected to it. The first
branch uses a CRF layer responsible for determining p(l|x), which refers to the
probability of calculating a tagging sequence l given an input sequence x. The
second one is an NMT decoder, composed of one LSTM layer and the attention
mechanism followed by a fully connected network calculating the probability
p(y|x) that the sequence y correctly translates x in QueryT emplate.

2.3   Assembling

The last step of the proposed architecture is the creation of Q0sparql . Here the
placeholders in Q0temp are replaced by the corresponding named entities. For
example, suppose we have the question-query pair:

<Who developed Skype?, SELECT ?uri WHERE { wd:Q40984 wdt:P178 ?uri } >

in the form hQnat , Qsparql i. Our system will translate Qnat to the following query
template:

SELECT ?uri WHERE { ?v wdt:P178 ?uri . ?v rdfs:label $1 }

and the NER task will identify that Skype is the named entity. So the query
instantiation step will produce:

SELECT ?uri WHERE { ?v wdt:P178 ?uri . ?v rdfs:label "Skype"@en }

which is a SPARQL query equivalent to Qsparql .
    At this step, we validate the entities identified by the NER model to produce a
more accurate query. The system checks that the rdfs:label -based triples created
with the entities produce a non-empty result and consider them valid. If a triple
does not have a match in the KB, the system attempts to fix it using the entities
identified with the NEL tool. In this case, we pick the entity identified by NEL
whose surface in the question overlaps with the entity we are trying to fix. If it
is not possible to fix a non-valid triple, the system will probably respond with
an incorrect answer.
6        M. Borroto et al.

3     Results
We have implemented our models using Keras, a well-known framework for ma-
chine learning, on top of TensorFlow. The training process was carried out on
Google Colaboratory, a Jupyter notebook environment running entirely in the
cloud. Colaboratory provides an environment with 12 GB of RAM and the pos-
sibility to run the code using a GPU configuration.
    To train and fine-tune the system, we used the dataset released for the tenth
edition of QALD2 . Table 3 shows the results obtained by sparql-qa in the chal-
lenge, demonstrating that the system performed very well. The reported scores
were calculated using the GERBIL-QA3 web service. A live demo of the system
is available at: https://www.mat.unical.it/ricca/sparqlqa.


                             Table 3. QALD-10 final scores.

                        Macro           Macro         Macro        Macro F1
                       Precision        Recall          F1          QALD
       sparql-qa        0.4538          0.4574        0.4538        0.5947


Conclusion
The paper presents sparql-qa, an approach for querying knowledge bases by
using natural language. The system relies on the combination of Neural Machine
Translation and Named Entity Recognition while focusing on attenuating the
impact of the OOV words. sparql-qa showed to be a valid proposition in the field
of KBQA, and the results obtained in QALD-10 demonstrate this.
   In future work, we plan to improve our system by integrating other NLP
tools, such as BERT word embeddings, Transformers, and Relation Linking.


References
 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
    to align and translate. arXiv preprint arXiv:1409.0473 (2014)
 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
    subword information. TACL 5, 135–146 (2017)
 3. Borroto, M.A., Ricca, F., Cuteri, B.: Reducing the impact of out of vocabulary
    words in the translation of natural language questions into sparql queries. arXiv
    preprint arXiv:2111.03000 (2021)
 4. Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and accu-
    racy in multilingual entity extraction. In: ICSS (I-Semantics) (2013)
 5. Grishman, R., Sundheim, B.M.: Message understanding conference-6: A brief his-
    tory. In: COLING 1996 Volume 1: The 16th Int. Conf. on Comp. Linguistics (1996)
2
    https://github.com/KGQA/QALD 10
3
    https://gerbil-qa.aksw.org/gerbil/experiment?id=202205200035
                                   SPARQL-QA enters the QALD challenge            7

 6. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-
    strength Natural Language Processing in Python (2020)
 7. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging.
    CoRR abs/1508.01991 (2015)
 8. Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based
    neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
 9. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning.
    In: Nat. lang. proc. using very large corpora, pp. 157–176. Springer (1999)
10. Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, tech-
    niques, and solutions. IEEE TKDE 27(2), 443–460 (2014)
11. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
    networks. In: NIPS. pp. 3104–3112 (2014)