A system for translating natural language questions into SPARQL queries with neural networks: Preliminary results

A system for translating natural language questions into SPARQL queries with neural networks: Preliminary results ManuelAlejandroBorroto manuel.borroto@unical.it University of Calabria

Via Pietro Bucci 87036 Rende Cosenza Italy

FrancescoRicca francesco.ricca@unical.it University of Calabria

Via Pietro Bucci 87036 Rende Cosenza Italy

BernardoCuteri bernardo.cuteri@unical.it University of Calabria

Via Pietro Bucci 87036 Rende Cosenza Italy

A system for translating natural language questions into SPARQL queries with neural networks: Preliminary results 1613-0073 3C99C4B11B39BF216F303ADB26BF64EB GROBID - A machine learning software for extracting information from scholarly documents Knowledge base Question Answering Neural network

The development of knowledge bases has gathered nowadays large volumes of information concerning multiple domains. Unfortunately, access to this information is complicated for those users unfamiliar with the SPARQL query language and the knowledge base definition. In this paper, we present preliminary results on a system for automatic translation of natural language questions into SPARQL queries. Our method uses Neural Machine Translation and Named Entity Recognition tasks that complement each other to obtain a final query ready to be executed. We demonstrate the potential of our approach by presenting its results on the Monument dataset, which is a recently released dataset for Question Answering on the well-known DBpedia knowledge base.

Introduction

Today we live in what is known as the Digital Age. How knowledge is generated and shared has changed dramatically, digital formats and the internet have made information much more accessible than the old non-virtual format. As evidence of this, we now have vast and complex knowledge bases which allow gathering large volumes of information through the intercommunication of thousands of datasets referring to various domains in what is known as Linked Data. This means that the people have access to a large amount of information never thought, and the DBpedia [1] project is a real example of that, which is one of the most popular knowledge bases nowadays.

The problem is that the search and retrieval of the information stored in this way can be a hard task for lay users because it is necessary to know the structure of the knowledge base and the appropriate query languages, such as SPARQL [2]. As a result, natural language Question Answering (QA) has taken a central role in the area of the Semantic Web to address such issues. A group of QA approaches, especially the most recent, have begun to take advantage of the great development achieved by Deep Learning and started to use deep neural networks to tackle the problem, proposing systems for the automatic translation from natural language questions to SPARQL queries, removing all technical complexity to the final users.

In this context, we propose a system for the automatic translation of natural language questions into SPARQL queries. More specifically we employ LSTM [3] neural networks due to the proven effectiveness for Natural Language Processing that they have demonstrated. The system is consists of two parts. The first one translates the questions in natural language into a SPARQL template using an LSTM encoder-decoder model, which is the state-of-the-art for these types of tasks [4]. Whereas the second part is a model for Named Entity Recognition [5], also based on LSTM networks, and responsible for extracting the entities from the question to finally combine the results and create a SPARQL query ready to be executed. Besides, we introduce a formal definition of a dataset format that greatly reduces the output space and is essential for the proper functioning of the system and also allows us to tackle the problem with the out-of-vocabulary (OOV) words of the training set, a major weakness of the majority of the related approaches today.

We demonstrate the potential of our approach by presenting its results on the Monument dataset, which is a recently released dataset for Question Answering on the well-known DBpedia knowledge base.

The remainder of the paper is structured as follows. Section 2, we talk about related works. In section 3, we go into the particular details of our approach. Section 4 focuses on the discussion of experiments and results. Finally, we provide some conclusions and aspects for future work.

Related Work

Pattern-based. The idea of employing query patterns for mapping questions to SPARQLqueries was already exploited in the literature [6,7]. The approach presented by Pradel and Ollivier [6] also adopts named entity recognition but applies a set of predefined rules to obtain all the query elements and their relationships. The approach by Steinmetz et. al [7] has 4 phases, firstly, the question is parsed and the main focus is extracted, then general queries are generated from the phrases in natural language according to predefined patterns, and finally, makes a subject-predicate-object mapping of the general question to triples in RDF. Despite both of the above-mentioned approaches performed well in selected benchmarks, they rely on patterns and rules defined manually for all existing types of questions. A limit that is not present in our proposal.

Deep Learning-based. In the Seq2SQL approach [8] an LSTM Seq2Seq model is used to translate from natural language to SQL queries. The interesting thing about this approach is that they use Reinforcement Learning to guide the learning. The usage Encoder-Decoder model based in LSTM with an attention mechanism to associate a vocabulary mapping between natural language and SPARQL also was proposed in the literature [9] obtaining good results.

The Neural SPARQL Machines (NSpM) [10] approach is based on the idea of modifying the SPARQL queries to treat them as a foreign language. To achieve this, they encoded the brackets, URIs, operators, and other symbols, making the tokenization process easier. The resulting dataset was introduced in a Seq2Seq model responsible for performing the question-query mapping. The same authors created the DBNQA dataset [11], and their model was tested on a subdomain referring to monuments and evaluated using the purely syntactic BLEU score [10]. As a consequence, it performs well in reproducing the syntax of the gold query but is less able to generalize to unseen natural language questions and OOV words when compared with our approach.

The query building approach by Chen et al. [12] features two stages. The first stage consists of predicting the query structure of the question and leverages the structure to constrain the generation of the candidate queries. The second stage performs a candidate query rank. As in our approach, Chen et al. [12] uses BiLSTM networks, but query representation is based on abstract query graphs.

Also, we report that eight different models based on RNNs and CNNs were compared by Yin and colleagues [13]. In this large experiment, the ConvS2S [14] model proved to be the best.

For completeness, we studied another related line of work that aims to translate the natural language questions into SQL queries. The work proposed by Yu et. al [15] introduces a largescale, complex, and cross-domain semantic parsing and text-to-SQL dataset. To validate the work contribution, they used the proposed dataset to train different models to convert text to SQL queries. Most of the models were based on a Seq2Seq architecture with attention, demonstrating an adequate performance. Another interesting case of study is the editing-based approach for text-to-SQL generation introduced by Zhang et. al [16]. They implement a Seq2Seq model with Luong's attention, using BiLSTMs and BERT embeddings. The approach demonstrates to perform well on SParC and Spider datasets, outperforming the related work in some cases.

Our architecture addresses the issues connected with the translation resorting to specific tools, an aspect that is not present in mentioned works. Moreover, existing approaches based on NMT do nothing special to deal with OOV words.

Translating Natural Language Questions to SPARQL

Knowledge bases (KB) are a rich source of information related to a great variety of domains, which can be accessed by experts of formal query languages. The potential of exploiting knowledge bases can be greatly increased by allowing any user to query the ontology by posing questions in natural language.

In this paper, this problem is seen as the following Natural Language Processing task: Given an RDF knowledge base 𝑂 and a question 𝑄 𝑛𝑎𝑡 in natural language (to be answered using 𝑂), translate 𝑄 into a SPARQL query 𝑆 𝑄𝑛𝑎𝑡 such that the answer to 𝑄 𝑛𝑎𝑡 can be obtained by running 𝑆 𝑄𝑛𝑎𝑡 on the underlying ontology 𝑂.

The starting point is a training set containing a number of pairs ⟨𝑄 𝑛𝑎𝑡 , 𝐺 𝑄𝑛𝑎𝑡 ⟩, where 𝑄 𝑛𝑎𝑡 is a natural language question, and 𝐺 𝑄𝑛𝑎𝑡 is a SPARQL query, called the gold query. The gold query is a SPARQL query that models (i.e., allows to retrieve from 𝑂) the answers to 𝑄 𝑛𝑎𝑡 . The training set has to be used to learn how to answer questions posed in natural language using 𝑂, so that, given a question in natural language 𝑄 𝑛𝑎𝑡 , the QA system can generate a query 𝑆 ′ 𝑄𝑛𝑎𝑡 that is equivalent to the gold query 𝐺 𝑄𝑛𝑎𝑡 for 𝑄 𝑛𝑎𝑡 , i.e., such that

𝑎𝑛𝑠𝑤𝑒𝑟𝑠(𝑆 ′

𝑄𝑛𝑎𝑡 ) = 𝑎𝑛𝑠𝑤𝑒𝑟𝑠(𝐺 𝑄𝑛𝑎𝑡 ). 1 In particular, we approach this problem as a machine translation task, that is, we compute 𝑆 ′ 𝑄𝑛𝑎𝑡 as 𝑆 ′ 𝑄𝑛𝑎𝑡 = 𝑇 𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑒(𝑄 𝑛𝑎𝑡 ), where 𝑇 𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑒 is the translation function implemented by our QA System, called sparql-qa.

In the remainder, we first present an intermediate format conceived to boost the training time of the entire process and reduce the impact of words that reference individuals that are not mentioned in the training set, in turn, we describe our translation modules that take as input the dataset in the new format.

A new data set format

In general, NL to SPARQL datasets are composed of a set of pairs ⟨𝑄 𝑛𝑎𝑡 , 𝐺 𝑄𝑛𝑎𝑡 ⟩. In such a common type of representation, the named entities found in the question are typically represented directly by their URIs in the SPARQL query, but this transformation is hard to learn from mere examples, and the trained system would fail if the transformation can not be described as simple rules. This is an issue, especially in large ontologies, where there is a huge number of resources.

A dataset in QQT is composed of a set of triples in the form ⟨𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒, 𝑇 𝑎𝑔𝑔𝑖𝑛𝑔⟩, where 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛 is a natural language question, and 𝑇 𝑎𝑔𝑔𝑖𝑛𝑔 marks which parts of 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛 are entities, and 𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒 is a SPARQL query template with the following modifications: (𝑖) The KB resources are replaced by one or more variables; (𝑖𝑖) A new triple is added for each variable in the form "?var rdfs:label placeholder". 𝑃 𝑙𝑎𝑐𝑒ℎ𝑜𝑙𝑑𝑒𝑟𝑠 are meant to be replaced by substrings of 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛 depending on 𝑇 𝑎𝑔𝑔𝑖𝑛𝑔.

In Table 1 we show an example of a ⟨𝑄 𝑛𝑎𝑡 , 𝑄 𝑠𝑝𝑎𝑟𝑞𝑙 ⟩ pair for the question Who painted the Mona Lisa?, while Table 2 shows the corresponding ⟨𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒, 𝑇 𝑎𝑔𝑔𝑖𝑛𝑔⟩ triple in the QQT format.

In table 2 the term $1 denotes a placeholder, where 1 means that it has to be replaced by the first entity occurring in the question, that is Mona Lisa as represented by 𝐵 and 𝐼 in Tagging. Note that, in the QQT format, the query template does not contain any DBpedia resource, thus the learning model (which is the neural network in our case) does not need to understand that Mona Lisa stands for the dbr:Mona_Lisa resource and the 𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒 is exactly the same for all questions asking the author of a given artwork.

The translation modules

Our approach consists of two deep neural networks, the first one specialized in Neural Machine Translation (NMT) based on the well-known Seq2Seq [4] model and the second one used for extracting the entities from the question using the Named Entity Recognition (NER) technique.

Neural Machine Translation

The network focused on NMT is used to translate the question into a SPARQL 𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒. The network is based on an Encoder-Decoder model with Luong's attention [17], in which the Encoder extracts semantic content from the question in natural language and encodes it into a fixed-dimensional vector representation 𝑉 . Instead, the Decoder tries to decode 𝑉 into a sequence in the output language (𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒).

The Encoder is composed of an input layer that receives a question in natural language converted into a sequence of word-embeddings obtained by mean of FastText [18], in the form {𝑥 1 , 𝑥 2 , ..., 𝑥 𝑡 }, where 𝑥 𝑡 is the vector representation of the word 𝑡 in the sentence. Next, we use a Bidirectional LSTM (BiLSTM) to summarize {𝑥 1 , 𝑥 2 , ..., 𝑥 𝑡 } into 𝑉 , in forward and reverse orders. 𝑉 is formed by concatenating the last hidden states in the two directions.

On the other hand, during the training process, the Decoder is responsible for calculating the word-embeddings of the output language tokens (SPARQL), which is used together with the vector 𝑉 , provided by the Encoder, as input to a Luong-Decoder layer. This layer is responsible for decoding the sentence supported by the attention mechanism. Finally, the values are feed to a Fully Connected Network with a Softmax activation function that predicts the output sequence by calculating the conditional probability over the output vocabulary. Figure 1 shows the described network architecture.

Named Entity Recognition

To perform the entity recognition, we created a BiLSTM-CRF [5] network that constitutes state-of-the-art for this type of task. In this case, we again used FastText to obtain the wordembeddings and deal with OOV words. The model is composed of an input layer that receives the sequences of embeddings, followed by a BiLSTM connected to a Fully Connected layer. Finally, the information flows through a CRF layer that predicts the final sequence of tags. Figure 2 shows the described network architecture.

Finally, we mixed the results of both networks to obtain the final query 𝑆 ′ 𝑄𝑛𝑎𝑡 . Here, the placeholders in the 𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒 are replaced by the corresponding entities obtained with the NER network.

To better understand how sparql-qa works, we can translate, for example, the question: Where is Washington Monument located? First, the question is cleaned and split into tokens and then converted into a sequence of fastText word-embeddings. Then, the sequence is processed by the two networks used by our system. The NMT network translates the question into the corresponding QueryTemplate:

SELECT DISTINCT ?a WHERE { ?w dbo:location ?a . ?w rdfs:label $1 } successively the NER network calculates the tagging sequence: O O B I O, where it is indicated that the entity to be considered is Washington Monument (Positions 2 and 3 of the tagging sequence, respectively). Finally, in the composition phase, the results are mixed, obtaining the final query:

SELECT DISTINCT ?a WHERE { ?w dbo:location ?a .

?w rdfs:label "Washington Monument"@en }

It is important to note that the previous example describes the translation operation in general terms and does not go into the more complicated details of the process.

Experiments on Monument dataset

Setup. The Monument dataset was proposed as part of the Neural SPARQL Machines (NSpM) [10] research. It contains 14,778 question-query pairs about the instances of type monument present in DBpedia. We compared our system with the state-of-the-art, thus, we have trained the Learner Module of NSpM as it was done in [10], where the authors proposed two instances of the Monument dataset that we will denote by Monumet300 and Monument600 containing 8,544 and 14,788 pairs, respectively. In both cases, the dataset split fixes 100 pairs for both validation and test set and keeps the rest for the training set. All the data is publicly available in the NSpM GitHub project. 2We have implemented our system, called sparql-qa, by using Keras, a well-known framework for machine learning, on top of TensorFlow. We trained the networks by using Google Collaboratory, which is a virtual machine environment hosted in the cloud and based on Jupyter Notebooks. The environment provides 12GB of RAM and connects to Google Drive. To train our system, we first performed hyperparameter tuning focused on three metrics: embedding-size of the target language, batch size, and LSTM hidden units. The task was performed by using a grid search method. We set the number of epochs to 5, shuffling the dataset at the end of each one. After tuning, we set the hyperparameters of the two networks as follows: embedding-size is set to 300, LSTM hidden units are set to 96, and batch size is set to 64.

For comparing performance, we adopted the macro precision, recall, and F1-score measures, which are the most used ones to assess this kind of system. 3 show that sparql-qa performs reasonably well, reaching F1-score values greater than 0.7. On the other hand, NSpM achieves better results.

Results. Results of the execution reported in Table

We have investigated why our system could not provide an optimal answer for some questions. This analysis evidenced that the performance of our approach is mainly affected by problems in the dataset. Indeed, there is a set of questions that lacks context to determine specific expected URIs. For example, for the question "What is Washington Monument related to?" our system uses "Washington Monument", but the gold query uses the specific URI: Wash-ington_Monument_(Baltimore). Note that there is no reference to Baltimore in the question text, and there are Washington Monuments also in Milwaukee and Philadelphia, according to DBPedia. Surprisingly, the compared system can often use the specific URI of the gold query even without context. Thus, we run another experiment to better outline the issue. We create a new test set of 200 pairs by using the templates provided by NSpM and a randomly selected set of unseen monument entities extracted from DBpedia. Table 4 shows that our approach has the same good performance (F1 score greater than 0.78) and performs much better than NSpM that is not able to generalize to deal with OOV (F1 of 0.11).

Another cause that affects our approach is the correctness with which the named entities are written. Sometimes the entities mentioned in the question do not match the rdfs:label property value, and sometimes they are referenced using acronyms. In these cases, our system will not give the expected answers because it cannot reference the right DBpedia resources. To address these issues, we plan to use Named Entity Linking (NEL) [19], which allows us to determine accurately which DBpedia resources are present in the question.

Finally, for completeness, we report that the intermediate format allows us to save 40% of training time.

Conclusions and Future Work

The paper presents preliminary results on an approach for querying SPARQL knowledge bases by using natural language. We combine in our system both neural machine translation and named entity recognition modules and focus on attenuating the impact of the OOV words, an important issue that is not well considered in existing approaches. Our system showed good preliminary results on the Monument dataset and demonstrated a more general and robust behavior than state-of-the-art approaches.

In future work, we plan to extend our system to improve translation performance by integrating other NLP tools, such as Named Entity Linking and BERT contextual word embeddings. We also plan to extend our experiments by considering other well-known QA benchmarks.

Figure 1 :1Figure 1: NMT neural network architecture

Figure 2 :2Figure 2: NER neural network architecture

Table 11⟨𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑄𝑢𝑒𝑟𝑦⟩ pair for Who painted the Mona Lisa?QuestionQueryWho painted theselect ?a whereMona Lisa?{dbr:Mona_Lisa dbo:author ?a.}

Table 22⟨𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑄𝑢𝑒𝑟𝑦𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒, 𝑇 𝑎𝑔𝑔𝑖𝑛𝑔⟩ triple for Who painted the Mona Lisa?QuestionQueryTemplateTaggingWho painted theselect ?a whereMona Lisa?{ ?w dbo:author ?a.O O O B I O?w rdfs:label $1 }

Table 33Comparison on Monument datasets.Mon300Mon600PRF1PRF1NSpM0.860 0.861 0.852 0.929 0.945 0.932sparql-qa 0.780.780.780.791 0.791 0.791

Table 44Comparison with OOV entities on Monument datasets.Mon300Mon600PRF1PRF1NSpM0.097 0.123 0.101 0.110.110.11sparql-qa 0.795 0.795 0.795 0.785 0.785 0.785

Note that we are interested in computing the answers and not in syntactically reproducing the gold query. https://github.com/LiberAI/NSpM/tree/master/data

Acknowledgments

This work was partially supported by the Italian Ministry of Economic Development (MISE) under projects "MAP4ID -Multipurpose Analytics Platform 4 Industrial Data", N. F/190138/01-03/X44.

Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia JLehmann RIsele MJakob AJentzsch DKontokostas PNMendes SHellmann MMorsey PVan Kleef Semantic Web 6 2015 W3c Semantic web standards 2014 Long short-term memory SHochreiter JSchmidhuber Neural Computation 9 1997 Sequence to sequence learning with neural networks ISutskever OVinyals QVLe NIPS 2014 Bidirectional LSTM-CRF models for sequence tagging ZHuang WXu KYu CoRR abs/1508.01991 2015 Natural language query interpretation into sparql using patterns CPradel OHaemmerlé NHernandez 2013 From natural language questions to SPARQL queries: A pattern-based approach NSteinmetz AArning KSattler BTW, volume P-289 of LNI

Bonn

2019 Gesellschaft für Informatik Seq2sql: Generating structured queries from natural language using reinforcement learning VZhong CXiong RSocher CoRR abs/1709.00103 2017 Semantic parsing natural language into SPARQL: improving target language representation with neural attention FFLuz MFinger CoRR abs/1803.04329 2018 SPARQL as a foreign language TSoru EMarx DMoussallem GPublio AValdestilhas DEsteves CBNeto SEMANTiCS 2017 -Posters and Demos 2017 Generating a large dataset for neural question answering over the DBpedia knowledge base AHartmann EMarx TSoru 2018 Formal query building with query structure prediction for complex question answering over knowledge base YChen HLi YHua GQi IJCAI 2020 Neural machine translating from natural language to SPARQL XYin DGromann SRudolph CoRR abs/1906.09302 2019 Convolutional sequence to sequence learning JGehring MAuli DGrangier DYarats YNDauphin ICML, volume 70 of Proc. of ML Research

PMLR

2017 TYu RZhang KYang MYasunaga DWang ZLi JMa ILi QYao SRoman arXiv:1809.08887 Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task 2018 arXiv preprint RZhang TYu HYEr SShim EXue XVLin TShi CXiong RSocher DRadev arXiv:1909.00786 Editing-based sql query generation for cross-domain context-dependent questions 2019 arXiv preprint MLuong HPham CDManning arXiv:1508.04025 Effective approaches to attention-based neural machine translation 2015 arXiv preprint Enriching word vectors with subword information PBojanowski EGrave AJoulin TMikolov TACL 5 2017 Entity linking with a knowledge base: Issues, techniques, and solutions WShen JWang JHan IEEE Transactions on Knowledge and Data Engineering 27 2014