Learning-based Translation Performance Prediction Shujun Wang, Jie Jiao, Mingyu Yang, Xiaowang Zhang*, and Zhiyong Feng College of Intelligence and Computing, Tianjin University, Tianjin 300350, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China * Corresponding author:{xiaowangzhang}@tju.edu.cn Abstract. RDF question/answering (Q/A) can translate questions into SPARQL queries by employing question translation.One of the challenges of RDF Q/A is predicting the performance of questions before they are translated. Performance characteristics, such as the translation time, can help data consumers identify unexpected long-running questions before they start and estimate the system workload for scheduling. In this paper, we adopt machine learning techniques to predict the performance of ques- tion translation in RDF Q/A.Our work focuses on modeling features of a question to a vector representation. Our feature modeling method does not depend on the knowledge of underlying systems and the structure of the underlying data, but only on the nature of questions. Then we use these features to train prediction models.Finally, based on this model, we designed a single parallel-batching RDF Q/A application.Evaluations are performed on real-world questions, whose translation time ranges from milliseconds to minutes. The results demonstrate that our approach can effectively predict question translation performance. Keywords: RDF · Question Answering · Performance Prediction 1 Introduction RDF Q/A allows users to ask questions in natural languages over a knowledge base represented by RDF. Hence, it has received extensive attention in both natural language processing and database areas. The core task of RDF Q/A is to translate natural language questions into SPARQLs. Prediction of question translation can benefit many system management decisions. The challenge in our work centers on capturing characteristics of questions and representing the characteristics as features for the application of machine learning techniques. The main contributions of this work are summarized as follows: – We adopt machine learning techniques to predict the question performance before their execution effectively. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). – We propose four ways to model features of a question. The lexical features, part of speech features, and dependency relation features can be acquired from the question’s dependency tree. The hybrid feature can be derived from part of speech features and dependency features. All features can be easily obtained without the information provided by the underlying systems. – The RDF Q/A system we used is one of the most used systems in the community of Semantic Web. Thus our work will benefit a large population of users. With the decline of computer hardware costs, the parallelism of computers increases gradually. Based on the prediction algorithm proposed above, we de- signed a single-machine high-parallel RDF Q / A application to implement the specific query transformation process. 2 Feature Modeling We formulate the problem as follows: Let N = (W, P, T ) denote a question, where W is a set of words, which contained in N , P is a set of posTags and T is a dependency tree of N . Feature modeling is the transformer that maps N → N , where N ∈ Rm and m is the number of features. 2.1 Lexical Features We first focus on each word’s characteristics in the question, such as their lengths, and the number of special words. More specifically, – Word Length: the number of words whose length belongs to [1, 15], and the number of words whose length is ≥ 16. – Special Words: We detect the number of three kinds of special words{all upper-case, contains a hyphen and stop word} Most importantly, we use information entropy I(N ) to measure the uncer- tainty of a question. n X I(N ) = − p(wi ) log2 p(wi ) (1) i=1 wi ∈ N , P (wi ) refers to the probability of wi appearing in the corpus. 2.2 Part of Speech Features In the process of translating natural language questions into SPARQL queries in the RDF Q/A system, the part of speech of a word can determine whether the word participates in the construction of the SPARQL query graph. For exam- ple, nouns, verbs, and adjectives in questions are important components of the SPARQL query graph. Therefore, in our work, we apply Standford pos tagger to obtain the part of speech of each word contained in N. We collect the number of different parts of speech as part of speech features of a given question. Besides, we further insert the number of words at the beginning of the vector. 2.3 Dependency Relation Features The above two kinds of features mainly express the characteristics of the words in the questions. In this subsection, we emphasize the relationships between different words. In our work, we collect the number of different dependencies as dependency relation features. Note that we further insert the height of the dependency tree at the beginning of the vector. 2.4 Hybrid Features We build hybrid features by selecting the most predictive features based on the part of speech features and dependency relation features. Definition 1 (Triple). Let T = hpi , d, pj i, where pi and pj are part of speech features, and d is a dependency relation feature between pi and pj . For exam- ple, there is a triple hW P, nsubj, V BDi in Figure 1. T describes the structural characteristics of questions. We use T as our hybrid features, which represent the structural characteris- tics of questions. A synthetic feature vector example is shown below. nmod case conj nsubj cc det compound compound WP VBD IN DT NNP NNP CC NNP NNP Who acted in The Green Mile and Forrest Gump Fig. 1. A Dependency Tree of the Question Lexcial Features Part of Speech Features Dependency Relation Features I(N) 1 2 3 4 ... 15 SW Num WP VBD NNP JJ ... Height nsubj nmod det conj ... 0.64 0 1 3 2 ... 0 0 9 1 1 4 0 ... 3 1 1 1 1 ... 3 Prediction Model Support vector regression(SVR) is to find the best regression function by selecting the particular hyperplane that maximizes the margin. The problem is formulated as an optimization problem: min wT w, s.t. yi (wT xi + b) ≥ 1 − ξ, ξ ≥ 0 (2) An advantage of SVR is its insensitivity to outliers. 4 Parallel RDF Q/A A set of questions : N1,N2, N3... , ,Nn Overhead Prediction Model N1, N2, ... ... ... Ni, ...,Nn 1 2 m Server with maximum parallelism is m. RDF Q/A System Fig. 2. Framework of Parallel-Batching Our system’s task is to predict the overhead of translating N questions into N SPARQL queries, and then divide the overhead of N questions into M processors, in order to achieve this goal, we design the algorithm 1 to minimize the loss function in Formula 3. Loss = min(max(M1 , M2 , . . . , Mn )) (3) where Mi is the sum overhead of all questions in the i-th processor. 5 Experiments We use QALD to verify the effectiveness of our parallel-batching RDF Q/A system. Four experiments with a parallelism of 2,4,6,8 are shown in the following four figures. Each experiment includes ten groups (10 questions in each group) of question translation tests. In each experiment, we compare our approach with the other three methods in performance, i.e., divide ten questions into M processors according to the number of questions, the number of words, and serial execution. Experiments show that our prediction model is accurate, and our parallel RDF Q/A system can achieve a single server high parallel question translation. http://qald.aksw.org/ 4 0 0 0 0 O u r S y s te m O u r A p p ro a c h N u m b e r o f q u e s tio n s N u m b e r o f Q u e s tio n s 2 0 0 0 0 3 5 0 0 0 N u m b e r o f w o rd s N u m b e r o f W o rd s S e r ia l S e r ia l 3 0 0 0 0 T o ta l E x e c u tio n T im e ( m s ) T o ta l E x e c u tio n T im e ( m s ) 1 5 0 0 0 2 5 0 0 0 2 0 0 0 0 1 0 0 0 0 1 5 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 0 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1 0 Fig. 3. Efficiency Evaluation 4 0 0 0 0 4 0 0 0 0 O u r A p p ro a c h O u r A p p ro a c h 3 5 0 0 0 N u m b e r o f Q u e s tio n s 3 5 0 0 0 N u m b e r o f Q u e s tio n s N u m b e r o f W o rd s N u m b e r o f W o rd s 3 0 0 0 0 S e r ia l 3 0 0 0 0 S e r ia l T o ta l E x e c u tio n T im e ( m s ) T o ta l E x e c u tio n T im e ( m s ) 2 5 0 0 0 2 5 0 0 0 2 0 0 0 0 2 0 0 0 0 1 5 0 0 0 1 5 0 0 0 1 0 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 0 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1 0 Acknowledgments This work is supported by the National Key Research and Development Program of China (2017YFC0908401) and the National Natural Science Foundation of China (61972455,61672377). Xiaowang Zhang is supported by the Peiyang Young Scholars in Tianjin University (2019XRX-0032). References 1. Zhang W., Sheng Q., Qin Y., et al: Learning-based SPARQL query performance modeling and prediction. In Proc. of WWW 2018, pp.1015-1035. 2. Chifu A., Laporte L., Mothe J., et al: Query Performance Prediction Focused on Summarized Letor Features. In Proc. of SIGIR 2018, pp.1177–1180. 3. Zou L., Huang R., Wang H., et al: Natural language question answering over RDF: a graph data driven approach. In Proc. of SIGMOD 2014, pp.313–324. 4. Hu S., Zou L., Yu J., et al: Answering Natural Language Questions by Subgraph Matching over Knowledge Graphs. In Proc. of ICDE 2018, pp.1815-1816. 5. Jiao J., Wang S., Zhang X., et al: Multi-Query Optimization in RDF Q/A System. In Proc. of ISWC 2019, pp.77–80.