Answering Questions over RDF by Neural Machine Translating Shujun Wang, Jie Jiao, Yuhan Li, Xiaowang Zhang*, and Zhiyong Feng College of Intelligence and Computing, Tianjin University, Tianjin 300350, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China * Corresponding author:{xiaowangzhang}@tju.edu.cn Abstract. Question Answering over Knowledge Bases (KBQA) is a task that a natural language question can be accurately answered over a knowledge base. Unlike previous methods for KBQA use a pipelined approach, which focuses on entity linking and relation path ranking. In this paper, we present a translation-based approach to translate nat- ural language questions into SPARQL queries. Specifically, this paper contributes to filling the gap between natural language question and SPARQL by utilizing multiple Neural Machine Translation(NMT) mod- els such as RNN, CNN, and Transformer. More importantly, we bridge the gap between the NMT model and existing KBQA by combining the entity linking and relation linking technologies in KBQA with the NMT model. Based on which, we design four novel question translation ap- proach for any NTM model, i.e., “Pure NMT”, “NMT+Entity Linking”, “NMT + Relation Linking” and “NMT + Entity Linking + Relation linking”. Compared to the traditional KBQA system using a state-of-the- art semantic parser, our method achieves a precision measure of 67.9% on the QALD-9 dataset and win the first place. 1 Introduction Knowledgebase question answering (KBQA) is an important task in NLP that has many real-world applications, such as in search engines and decision support systems. Most existing methods for KBQA use a pipelined approach: First, given a question q, an entity linking step is used to find KB entities mentioned in q. Next, relations or relation paths in the KB linked to the topic entities are ranked such that the best relation or relation path matching q is selected as the one that leads to the answer entities. In the view of the success of Neural Machine Translation (NMT) approaches, it comes as a surprise that very few such models utilized to address the question translating challenge(Question→SPARQL) in KBQA. Although some NMT- based KBQA works have been proposed for answering questions over RDF. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). However, these methods did not utilize the latest transformer model; more im- portantly, they did not try to associate the NMT model with critical technologies in traditional KBQA. In order to utilize NMT models in the KBQA area, this paper presents a large-scale comparison of three distinct neural network architectures (Recur- rent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and the Transformer model). Further, we bridge the gap between the NMT model and traditional KBQA technologies, and we combine NMT models with the key technology(entity linking and relation linking) of traditional KBQA to form four NMT-based KBQA approaches. 2 Overview Who developed Skype? Question Preprocessing who developed skype NMT NMT-E NMT-R NMT-ER Entity Recognition Relation Recognition Entity Recognition & Marker & Marker & Marker Relation Recognition NMT Model who developed who

skype & Marker who

NMT Model NMT Model NMT Model select distinct variable where select distinct variable where select distinct variable where select distinct variable where skype developer variable. developer variable. skype

variable.

variable. SPARQL Encoding Module SELECT DISTINCT ?uri WHERE { res:Skype dbo:developer ?uri . } Fig. 1. Four NMT-based KBQA Model As shown in Figure 1, we divide these four models into two categories, namely Pure Translation(NMT) and Template-based Translation(NMT-E,NMT-R and NMT-ER). – Pure Translation: In this case, NMT model directly trains questions and SPARQL query sequences. – Template-based Translation: In this case, NMT model trains questions template and SPARQL template sequences. 3 Methodology 3.1 SPARQL Encoding Unlike natural language that can be easily tokenized, SPARQL queries are inter- nally structured, combining elements of the query language with elements from the KBs and variables. Thus, SPARQL Encoding Module is first employed to encode each query as a sequence. Specifically, we ignore the prefixes of URIs. Brackets, wildcards, and dots are replaced by their verbal description. SPARQL operators are lower-cased and represented by a specified number of tokens. These operations can be implemented as a set of replacements, and applying them turns an original SPARQL query to a final sequence that contains tokens that are only formed of characters. An example has shown in Figure 1. 3.2 Tested NMT models Neural Machine Translation (NMT) models are widely used in intelligent trans- lation, which achieved excellent performance. We use NMT models to translate the English language into the SPARQL. Firstly, we encode the English language questions or templates and SPARQL queries into embedding representations. Then, we fed them to the NMT models for training. Finally, we can convert any English input question into its corresponding SPARQL query. In this poster, we compare three types of network architectures, RNN-based, CNN-based, and self-attention models, since those represented the best perform- ing NMT architectures in the field at the time of the experiment without con- sidering hybrid and ensemble methods. Encoded SPARQL queries and natural language questions are fed to the network on a word-level. 3.3 Template-based Translating Considering SPARQL as a foreign language is a novel and direct method in the KBQA task, which turns a question into a SPARQL query with machine trans- lation. However, it would fail to accurately translate the entities and predicates of the question when the entity mentions or relation mentions that have not occurred in the trained set previously. We consider learning the structure information and local semantic informa- tion in question and SPARQL query without entities and predicates, which is translating question template into the SPARQL query template, called Template- based Translation. Since no specific entity is involved and only the location in- formation is learned, we can get better universality and performance. Template Construction: There are three main ways to preprocess the data for constructing the templates: substitute entities, substitute predicates and substitute both entities and predicates, which has shown in Figure 1. In this step, we rely on existing entity linking tools[6] to recognize, mask, and replace entities in the question with hei i. For the relation mention in the question, we directly recognize the verbs and adjectives in the questions as relations and replace them with hpi i. 4 Experiment and Evaluation 4.1 Datasets and Metrics Our method are evaluated on two well-known public datasets, the Monument dataset, and QALD-9. For training, validation, and testing, we split the datasets randomly by 8:1:1. Accuracy (Acc). Acc is a metric for evaluating the query results, which is com- puted as followed: the number of right answers ACC = (1) the number of query answers 4.2 Evaluation Table 1. Results on QALD-9 NMT NMT-E NMT-R NMT-ER Dev Test Dev Test Dev Test Dev Test CNN-based 0.6607 0.5536 0.7679 0.6429 0.5582 0.5921 0.7500 0.6071 LSTM 0.2500 0.1607 0.3214 0.4821 0.5668 0.6169 0.3036 0.2679 Transformer 0.6786 0.5000 0.7143 0.6786 0.6051 0.6255 0.6071 0.5714 As shown in Table 1, “Transformer+NMT-E” beats all other combinations and win the first place, which acc is 0.6786, while the best result in QALD-9 compe- tition is that gAnswer gets ACC = 0.293. Table 2. Results on Monument NMT NMT-E NMT-R NMT-ER Dev Test Dev Test Dev Test Dev Test CNN-based 0.9851 0.9876 0.8830 0.8736 0.9531 0.9659 0.9675 0.9723 LSTM 0.9703 0.9655 0.9155 0.9175 0.9766 0.9703 0.9804 0.9872 Transformer 0.9642 0.9757 0.8830 0.8736 0.9631 0.9652 0.9675 0.9723 As shown in Table 1, “CNN-based+NMT-ER” beats all other combinations and win the first place, which acc is 0.9876. Through the experimental results of two datasets, we can see that it is feasible to translate questions into SPARQL queries by NMT alone. However, its accuracy can be further improved by com- bining entity recognition and relation recognition. 5 Conclusion Using natural language questions to query knowledge graphs provides an easy and natural way for common users to acquire useful knowledge. Most tradi- tional approaches for semantic parsing via recognizing entities and relations of the question and assemble them to a semantic query graph; however, it is very time-consuming. Thus, in this poster, we propose a question translation-based method translate natural language questions to SPARQLs. Extensive empirical evaluations over several benchmarks demonstrate that our proposed way is very useful and promising. Acknowledgments This work is supported by the National Key Research and Development Program of China (2017YFC0908401) and the National Natural Science Foundation of China (61972455,61672377). Xiaowang Zhang is supported by the Peiyang Young Scholars in Tianjin University (2019XRX-0032). References 1. R. Cai, B. Xu, Z. Zhang, X. Yang., Z. Li, Z. Liang: An Encoder-Decoder Frame- work Translating Natural Language to Database Queries In Proc. of IJCAI 2018, pp. 3977–3983. 2. L. Dong, M. Lapata.: Language to Logical Form with Neural Attention. In Proc. of ACL 2016, pp. 33–43. 3. J. Gehring, M. Auli, D. Grangier, D. Yarats, Y.N. Dauphin.: Convolutional Sequence to Sequence Learning. In Proc. of ICML 2017, pp. 1243–1252. 4. M.T. Luong, H. Pham, C.D. Manning: Effective Approaches to Attention-based Neural Machine Translation. In Proc. of EMNLP 2015, pp. 1412–1421. 5. T. Soru, E. Marx, D. Moussallem, G. Publio, A. Valdestilhas, D. Esteves, C.B. Neto: SPARQL as a Foreign Language. SEMANTICS Posters&Demos 2017. 6. Y. Yang and M. Chang Smart: Novel tree-based structured learning algorithms applied to tweet entity linking. In Proc. of ACL 2015, pp. 504–513.