<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning-based Translation Performance Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shujun Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jie Jiao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mingyu Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaowang Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiyong Feng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Intelligence and Computing, Tianjin University</institution>
          ,
          <addr-line>Tianjin 300350</addr-line>
          ,
          <country>China Tianjin</country>
          <institution>Key Laboratory of Cognitive Computing and Application</institution>
          ,
          <addr-line>Tianjin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>RDF question/answering (Q/A) can translate questions into SPARQL queries by employing question translation.One of the challenges of RDF Q/A is predicting the performance of questions before they are translated. Performance characteristics, such as the translation time, can help data consumers identify unexpected long-running questions before they start and estimate the system workload for scheduling. In this paper, we adopt machine learning techniques to predict the performance of question translation in RDF Q/A.Our work focuses on modeling features of a question to a vector representation. Our feature modeling method does not depend on the knowledge of underlying systems and the structure of the underlying data, but only on the nature of questions. Then we use these features to train prediction models.Finally, based on this model, we designed a single parallel-batching RDF Q/A application.Evaluations are performed on real-world questions, whose translation time ranges from milliseconds to minutes. The results demonstrate that our approach can e ectively predict question translation performance.</p>
      </abstract>
      <kwd-group>
        <kwd>RDF</kwd>
        <kwd>Question Answering</kwd>
        <kwd>Performance Prediction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>RDF Q/A allows users to ask questions in natural languages over a knowledge
base represented by RDF. Hence, it has received extensive attention in both
natural language processing and database areas. The core task of RDF Q/A is
to translate natural language questions into SPARQLs. Prediction of question
translation can bene t many system management decisions. The challenge in
our work centers on capturing characteristics of questions and representing the
characteristics as features for the application of machine learning techniques.</p>
      <p>The main contributions of this work are summarized as follows:
{ We propose four ways to model features of a question. The lexical features,
part of speech features, and dependency relation features can be acquired
from the question's dependency tree. The hybrid feature can be derived from
part of speech features and dependency features. All features can be easily
obtained without the information provided by the underlying systems.
{ The RDF Q/A system we used is one of the most used systems in the
community of Semantic Web. Thus our work will bene t a large population
of users.</p>
      <p>With the decline of computer hardware costs, the parallelism of computers
increases gradually. Based on the prediction algorithm proposed above, we
designed a single-machine high-parallel RDF Q / A application to implement the
speci c query transformation process.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Feature Modeling</title>
      <p>We formulate the problem as follows: Let N = (W; P; T ) denote a question,
where W is a set of words, which contained in N , P is a set of posTags and T is
a dependency tree of N . Feature modeling is the transformer that maps N ! N ,
where N 2 Rm and m is the number of features.
2.1</p>
      <p>Lexical Features
We rst focus on each word's characteristics in the question, such as their lengths,
and the number of special words. More speci cally,
{ Word Length: the number of words whose length belongs to [1; 15], and the
number of words whose length is 16.
{ Special Words : We detect the number of three kinds of special wordsfall
upper-case, contains a hyphen and stop wordg</p>
      <p>Most importantly, we use information entropy I(N ) to measure the
uncertainty of a question.</p>
      <p>n
I(N ) = X p(wi) log2 p(wi)</p>
      <p>
        i=1
wi 2 N , P (wi) refers to the probability of wi appearing in the corpus.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
2.2
      </p>
      <p>Part of Speech Features
In the process of translating natural language questions into SPARQL queries in
the RDF Q/A system, the part of speech of a word can determine whether the
word participates in the construction of the SPARQL query graph. For
example, nouns, verbs, and adjectives in questions are important components of the
SPARQL query graph. Therefore, in our work, we apply Standford pos tagger to
obtain the part of speech of each word contained in N. We collect the number of
di erent parts of speech as part of speech features of a given question. Besides,
we further insert the number of words at the beginning of the vector.
2.3</p>
      <p>Dependency Relation Features
The above two kinds of features mainly express the characteristics of the words
in the questions. In this subsection, we emphasize the relationships between
di erent words.</p>
      <p>In our work, we collect the number of di erent dependencies as dependency
relation features. Note that we further insert the height of the dependency tree
at the beginning of the vector.
2.4</p>
      <p>Hybrid Features
We build hybrid features by selecting the most predictive features based on the
part of speech features and dependency relation features.</p>
      <p>De nition 1 (Triple). Let T = hpi; d; pj i, where pi and pj are part of speech
features, and d is a dependency relation feature between pi and pj . For
example, there is a triple hW P; nsubj; V BDi in Figure 1. T describes the structural
characteristics of questions.</p>
      <p>We use T as our hybrid features, which represent the structural
characteristics of questions. A synthetic feature vector example is shown below.</p>
      <p>nsubj
WP
Who</p>
      <p>VBD
acted</p>
      <p>IN
in</p>
      <p>DT
The
nmod
case</p>
      <p>det
NNP</p>
      <sec id="sec-2-1">
        <title>Green</title>
        <p>compound NNP</p>
      </sec>
      <sec id="sec-2-2">
        <title>Mile</title>
        <p>cc
CC
and
conj</p>
      </sec>
      <sec id="sec-2-3">
        <title>Forrest</title>
        <sec id="sec-2-3-1">
          <title>NNP compound NNP</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Gump</title>
        <p>An advantage of SVR is its insensitivity to outliers.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Parallel RDF Q/A</title>
      <sec id="sec-3-1">
        <title>A set of questions : N1,N2, N3..,.</title>
        <p>,Nn</p>
      </sec>
      <sec id="sec-3-2">
        <title>Overhead Prediction Model</title>
        <p>N1, N2, ... ... ... Ni, ...,Nn
1
2
m</p>
      </sec>
      <sec id="sec-3-3">
        <title>Server with maximum parallelism is m.</title>
      </sec>
      <sec id="sec-3-4">
        <title>RDF Q/A System</title>
        <p>Our system's task is to predict the overhead of translating N questions into N
SPARQL queries, and then divide the overhead of N questions into M processors,
in order to achieve this goal, we design the algorithm 1 to minimize the loss
function in Formula 3.</p>
        <p>
          Loss = min(max(M1; M2; : : : ; Mn))
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
where Mi is the sum overhead of all questions in the i-th processor.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We use QALD to verify the e ectiveness of our parallel-batching RDF Q/A
system. Four experiments with a parallelism of 2,4,6,8 are shown in the following
four gures. Each experiment includes ten groups (10 questions in each group)
of question translation tests.</p>
      <p>In each experiment, we compare our approach with the other three methods
in performance, i.e., divide ten questions into M processors according to the
number of questions, the number of words, and serial execution.</p>
      <p>Experiments show that our prediction model is accurate, and our parallel
RDF Q/A system can achieve a single server high parallel question translation.
http://qald.aksw.org/</p>
      <p>O u r S y s t e m
N u m b e r o f q u e s t i o n s
N u m b e r o f w o r d s
S e r i a l</p>
      <p>0 1 2 3 4 5 6 7 8 9 1 0</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Key Research and Development Program
of China (2017YFC0908401) and the National Natural Science Foundation of
China (61972455,61672377). Xiaowang Zhang is supported by the Peiyang Young
Scholars in Tianjin University (2019XRX-0032).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Zhang W.,
          <string-name>
            <surname>Sheng</surname>
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            <given-names>Y.</given-names>
          </string-name>
          , et al:
          <article-title>Learning-based SPARQL query performance modeling and prediction</article-title>
          .
          <source>In Proc. of WWW</source>
          <year>2018</year>
          , pp.
          <fpage>1015</fpage>
          -
          <lpage>1035</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chifu</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laporte</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            <given-names>J.</given-names>
          </string-name>
          , et al:
          <article-title>Query Performance Prediction Focused on Summarized Letor Features</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>2018</year>
          , pp.
          <volume>1177</volume>
          {
          <fpage>1180</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zou</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>H.</given-names>
          </string-name>
          , et al:
          <article-title>Natural language question answering over RDF: a graph data driven approach</article-title>
          .
          <source>In Proc. of SIGMOD</source>
          <year>2014</year>
          , pp.
          <volume>313</volume>
          {
          <fpage>324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hu</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>J.</given-names>
          </string-name>
          , et al:
          <article-title>Answering Natural Language Questions by Subgraph Matching over Knowledge Graphs</article-title>
          .
          <source>In Proc. of ICDE</source>
          <year>2018</year>
          , pp.
          <fpage>1815</fpage>
          -
          <lpage>1816</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jiao</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>X.</given-names>
          </string-name>
          , et al:
          <article-title>Multi-Query Optimization in RDF Q/A System</article-title>
          .
          <source>In Proc. of ISWC</source>
          <year>2019</year>
          , pp.
          <volume>77</volume>
          {
          <fpage>80</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>