-

Learning-based Translation Performance Prediction

Shujun Wang

Jie Jiao

Mingyu Yang

Xiaowang Zhang

Zhiyong Feng

0 0 College of Intelligence and Computing, Tianjin University , Tianjin 300350 , China Tianjin Key Laboratory of Cognitive Computing and Application , Tianjin , China

RDF question/answering (Q/A) can translate questions into SPARQL queries by employing question translation.One of the challenges of RDF Q/A is predicting the performance of questions before they are translated. Performance characteristics, such as the translation time, can help data consumers identify unexpected long-running questions before they start and estimate the system workload for scheduling. In this paper, we adopt machine learning techniques to predict the performance of question translation in RDF Q/A.Our work focuses on modeling features of a question to a vector representation. Our feature modeling method does not depend on the knowledge of underlying systems and the structure of the underlying data, but only on the nature of questions. Then we use these features to train prediction models.Finally, based on this model, we designed a single parallel-batching RDF Q/A application.Evaluations are performed on real-world questions, whose translation time ranges from milliseconds to minutes. The results demonstrate that our approach can e ectively predict question translation performance.

RDF Question Answering Performance Prediction

RDF Q/A allows users to ask questions in natural languages over a knowledge base represented by RDF. Hence, it has received extensive attention in both natural language processing and database areas. The core task of RDF Q/A is to translate natural language questions into SPARQLs. Prediction of question translation can bene t many system management decisions. The challenge in our work centers on capturing characteristics of questions and representing the characteristics as features for the application of machine learning techniques.

The main contributions of this work are summarized as follows: { We propose four ways to model features of a question. The lexical features, part of speech features, and dependency relation features can be acquired from the question's dependency tree. The hybrid feature can be derived from part of speech features and dependency features. All features can be easily obtained without the information provided by the underlying systems. { The RDF Q/A system we used is one of the most used systems in the community of Semantic Web. Thus our work will bene t a large population of users.

With the decline of computer hardware costs, the parallelism of computers increases gradually. Based on the prediction algorithm proposed above, we designed a single-machine high-parallel RDF Q / A application to implement the speci c query transformation process. 2

Feature Modeling

We formulate the problem as follows: Let N = (W; P; T ) denote a question, where W is a set of words, which contained in N , P is a set of posTags and T is a dependency tree of N . Feature modeling is the transformer that maps N ! N , where N 2 Rm and m is the number of features. 2.1

Lexical Features We rst focus on each word's characteristics in the question, such as their lengths, and the number of special words. More speci cally, { Word Length: the number of words whose length belongs to [1; 15], and the number of words whose length is 16. { Special Words : We detect the number of three kinds of special wordsfall upper-case, contains a hyphen and stop wordg

Most importantly, we use information entropy I(N ) to measure the uncertainty of a question.

n I(N ) = X p(wi) log2 p(wi)

i=1 wi 2 N , P (wi) refers to the probability of wi appearing in the corpus. ( 1 ) 2.2

Part of Speech Features In the process of translating natural language questions into SPARQL queries in the RDF Q/A system, the part of speech of a word can determine whether the word participates in the construction of the SPARQL query graph. For example, nouns, verbs, and adjectives in questions are important components of the SPARQL query graph. Therefore, in our work, we apply Standford pos tagger to obtain the part of speech of each word contained in N. We collect the number of di erent parts of speech as part of speech features of a given question. Besides, we further insert the number of words at the beginning of the vector. 2.3

Dependency Relation Features The above two kinds of features mainly express the characteristics of the words in the questions. In this subsection, we emphasize the relationships between di erent words.

In our work, we collect the number of di erent dependencies as dependency relation features. Note that we further insert the height of the dependency tree at the beginning of the vector. 2.4

Hybrid Features We build hybrid features by selecting the most predictive features based on the part of speech features and dependency relation features.

De nition 1 (Triple). Let T = hpi; d; pj i, where pi and pj are part of speech features, and d is a dependency relation feature between pi and pj . For example, there is a triple hW P; nsubj; V BDi in Figure 1. T describes the structural characteristics of questions.

We use T as our hybrid features, which represent the structural characteristics of questions. A synthetic feature vector example is shown below.

nsubj WP Who

VBD acted

IN in

DT The nmod case

det NNP

Green

compound NNP

Mile

cc CC and conj

Forrest NNP compound NNP Gump

An advantage of SVR is its insensitivity to outliers.

Parallel RDF Q/A A set of questions : N1,N2, N3..,.

,Nn

Overhead Prediction Model

N1, N2, ... ... ... Ni, ...,Nn 1 2 m

Server with maximum parallelism is m. RDF Q/A System

Our system's task is to predict the overhead of translating N questions into N SPARQL queries, and then divide the overhead of N questions into M processors, in order to achieve this goal, we design the algorithm 1 to minimize the loss function in Formula 3.

Loss = min(max(M1; M2; : : : ; Mn)) ( 3 ) where Mi is the sum overhead of all questions in the i-th processor. 5

Experiments

We use QALD to verify the e ectiveness of our parallel-batching RDF Q/A system. Four experiments with a parallelism of 2,4,6,8 are shown in the following four gures. Each experiment includes ten groups (10 questions in each group) of question translation tests.

In each experiment, we compare our approach with the other three methods in performance, i.e., divide ten questions into M processors according to the number of questions, the number of words, and serial execution.

Experiments show that our prediction model is accurate, and our parallel RDF Q/A system can achieve a single server high parallel question translation. http://qald.aksw.org/

O u r S y s t e m N u m b e r o f q u e s t i o n s N u m b e r o f w o r d s S e r i a l

0 1 2 3 4 5 6 7 8 9 1 0

Acknowledgments

This work is supported by the National Key Research and Development Program of China (2017YFC0908401) and the National Natural Science Foundation of China (61972455,61672377). Xiaowang Zhang is supported by the Peiyang Young Scholars in Tianjin University (2019XRX-0032).

1. Zhang W., Sheng

, Qin

, et al: Learning-based SPARQL query performance modeling and prediction . In Proc. of WWW 2018 , pp. 1015 - 1035 .

2. Chifu

, Laporte

, Mothe

, et al: Query Performance Prediction Focused on Summarized Letor Features . In Proc. of SIGIR 2018 , pp. 1177 { 1180 .

3. Zou

, Huang

, Wang

, et al: Natural language question answering over RDF: a graph data driven approach . In Proc. of SIGMOD 2014 , pp. 313 { 324 .

4. Hu

, Zou

, Yu

, et al: Answering Natural Language Questions by Subgraph Matching over Knowledge Graphs . In Proc. of ICDE 2018 , pp. 1815 - 1816 .

5. Jiao

, Wang

, Zhang

, et al: Multi-Query Optimization in RDF Q/A System . In Proc. of ISWC 2019 , pp. 77 { 80 .