<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Joint Model of Entity Linking and Predicate Recognition for Knowledge Base Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yang Li</string-name>
          <email>liyang54@lenovo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qingliang Miao</string-name>
          <email>miaoql1@lenovo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ChenXin Yin</string-name>
          <email>yincx1@lenovo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chao Huo</string-name>
          <email>huochao2@lenovo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenxiang Mao</string-name>
          <email>maowenxaing612@163.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Changjian Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Feiyu Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Building H</institution>
          ,
          <addr-line>No.6, West Shangdi Road, Haidian District Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the paper, we build a QA system which can automatically find the right answers from Chinese knowledge base. In particular, we first identify all possible topic entities in the knowledge base for a question. Then some predicate scores are utilized to pre-rank all candidate triple paths of topic entities by logistic model. Second, we use a joint training entity linking and predicate recognition model to re-rank candidate triple paths for the question. Finally, the paper selects the answer component from matched triple path based on heuristic rules. Our approach achieved the averaged F1-score of 57.67% on test data which obtained the second place in the contest of CCKS 2018 COQA task.</p>
      </abstract>
      <kwd-group>
        <kwd>KBQA Entity Linking Predicate Recognition Semantic Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>In the paper, we introduce a system that answers an open domain factoid question in
Chinese automatically. Our method recognizes topic entities at first. Then one-hop and
two-hop triple paths are selected and pre-ranked for these topic entities. Second, we
use a semantic matching model BiMPM [1] to train a joint model for entity linking and
predicate recognition to re-rank candidate triple paths. At last, the answer component is
selected from matched candidate triple path based on heuristic rules. By pre-processing
and analysing the training data, questions only with one-hop or two-hop candidate triple
paths account for 90.02%. Thus, the paper concerns with these questions mainly.
Open domain KBQA is an important task in the field of natural language processing.
There are two mainstream approaches: semantic parsing based and retrieval based.</p>
      <p>The semantic parsing based method first parses question into a logical form which
is a semantic tree explicitly representing the meaning of the question in a
compositional manner, and then the logical form is executed based on the knowledge base to get
the answer [2]. The logical form in this method is helpful to understand the
semantic structure of a question, which also increases the difficulty of this task. Lai [3] uses
word embedding based features to search best subject predicate pair and obtains the
first place in NLPCC 2016 KBQA task by rules. Lai [4] proposed a novel method based
on deep CNNs to rerank the entity-predicate pairs which generated by shallow features.
The approach obtained the first place in the contest of NLPCC 2017 KBQA task. Hu
[5] proposed a dynamic query graph matching method to process disambiguation tasks
for entities and relationships from a data-driven perspective. In our work, we also use
retrieval based approach.
3</p>
    </sec>
    <sec id="sec-2">
      <title>The Proposed System</title>
      <p>The architecture of our system is shown in Fig 1. The first step is pre-processing
which is word segmentation for input question. Based on the pre-processing results,
Topic Entity Recognition module first recognizes topic entity mentions and then link
them to knowledge base. In Predicate Recognition module, we pre-rank candidate triple
paths with some features. Then BiMPM is utilized to select the matched triple paths.
Finally, we select the answer from matched candidate triple path.
Entity Mention Recognition If the segmentation word of the question in the
Segmentation Dic then the word is seemed as entity mention. The recognized entity mentions
have different probabilities of being a topic mention from the perspective of some
features. The features used in our system are defined as follows:</p>
      <p>F1:The Length of Entity Mention An entity mention with a longer string is more
likely to be a topic entity than shorter one.</p>
      <p>F2:The TF value of Entity Mention An entity mention with a high Term
Frequency (TF) value tends to have a low probability to be a topic entity than lower ones.</p>
      <sec id="sec-2-1">
        <title>F3:The Distance Between the Entity Mention and Interrogative Word Entity</title>
        <p>mentions in question close to the interrogative word is more likely to be a topic entity.
Entity Linking Entity mentions that recognized by the last step is not the entities in
knowledge base so this step is aimed to determine the identity of entity mentions in
question. Relations and properties information of an entity are helpful for entity linking
so at first we extract two-hop sub-graph of the entity. Based on the selected candidate
entity mentions, we use three features below to rank and select the matched topic entity.</p>
      </sec>
      <sec id="sec-2-2">
        <title>F4:Word Overlap Between Question and Triple Paths The more overlap words</title>
        <p>shared between question and candidate entity’s two-hop sub-graph, the bigger
probability that the entity mention be a topic entity.</p>
      </sec>
      <sec id="sec-2-3">
        <title>F5:Word Embedding Similarity Between Question and Triple Paths The larg</title>
        <p>er similarity between the question and candidate entity’s two-hop sub-graph, the bigger
probability that the entity mention be a topic entity.</p>
      </sec>
      <sec id="sec-2-4">
        <title>F6:Char Overlap Between Question and Triple Paths The feature is similar to</title>
        <p>F4. The only difference is that this feature uses char level instead of word level.</p>
        <p>After calculating and normalizing all features a linear weighing method is utilized
to rank candidate entities. The score equation is defined as below equation where wi
indicates the weight of feature i.</p>
        <p>Scoretopicentity = w1 F1 + w2 F2 + w3 F3 + w4 F4 + w5 F5 + w6 F6
3.3</p>
      </sec>
      <sec id="sec-2-5">
        <title>Predicate Recognition</title>
        <p>A topic entity can extract about 349.6 candidate triple paths. It’s difficult to select the
best matched one from such large amount candidate triple paths. Narrowing down
candidate triple paths is an important step to improve the final result. In this module, we
first extract four features about predicates of triple path. Then logistic regression
algorithm is utilized to pre-rank candidate triple paths with below four features and topic
entity recognition features. At last, we select top 10 triple paths as candidates for next
semantic matching module.</p>
      </sec>
      <sec id="sec-2-6">
        <title>F7:Word Overlap Between Question and Predicates</title>
        <p>The more overlap words shared between question and candidate predicates of triple
path, the bigger probability that the candidate predicates be truly predicates.</p>
      </sec>
      <sec id="sec-2-7">
        <title>F8:Word Embedding Similarity Between Question and Predicate</title>
        <p>The larger similarity between question and candidate predicates, the bigger probability
that the candidate predicates be truly predicates.</p>
      </sec>
      <sec id="sec-2-8">
        <title>F9:Char Overlap Between Question and Predicates</title>
        <p>This feature is almost same as F7. The only difference is that this feature uses char level
instead of word level.</p>
      </sec>
      <sec id="sec-2-9">
        <title>F10:Char Embedding Similarity Between Question and Predicates</title>
        <p>This feature is almost same as F8. The only difference is that this feature uses char level
instead of word level.
3.4</p>
      </sec>
      <sec id="sec-2-10">
        <title>Semantic Matching</title>
        <p>Problem Formalization The goal of this module is to identify the TPi from n candidate
triple paths fT P1; T P2; :::; T Png that best matches Q. Q is the question of user. TPi is
a candidate triple path of Q. In this paper, we use a pairwise scoring function S(TPi,Q)
to score and sort all candidate triple paths. In the paper, n is 10.</p>
        <p>BiMPM+Fea In this section, we present a innovative solution that incorporate word
embedding and all ten features into BiMPM to select the best matched triple path.
BiMPM+Fea contains five kernel layers.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Word Representation Layer: The goal of this layer is to represent each word in
question and triple path with d-dimensional vector. The word embedding in the paper
is pre-trained with Gensim [6] and d is 100.
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Context Representation Layer:The purpose of this layer is to incorporate
contextual information into the representation of each time step of question and triple path.
This paper uses a BiLSTM to encode contextual embeddings for each time step.
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Matching Layer: This is the core layer and it is used to obtain the similarity
of the question and triple path in time-steps. Moreover, the matching is bi-directional,
means that the question and the triple path will match each other and get the matching
information from their respective.
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) Aggregation Layer: This layer is applied to aggregate question and triple path of
matching information into fixed-length. The aggregation layer is composed of BiLSTM,
and we use the final hidden state to represent the information aggregated.
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) Feature Aggregation Layer: The layer concatenates the fixed-length tensor of
the last layer with our 10 extracted features.
3.5
        </p>
      </sec>
      <sec id="sec-2-11">
        <title>Answers Selection</title>
        <p>Matched triple paths are selected in semantic matching modules. Then we generate the
answer based on heuristic rules. The Fig 2 displays examples of our heuristic rules. In
the figure, the circle node or rectangle node just represents entity or attribute value and
without affecting the rules to select answer. The blue node is the answer.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) One-hop Triple Path: In this situation the answer is the component in triple path
which does not appear in the question.
        </p>
        <p>
          (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Two-hop Triple Path: If both the far right node and the far left node in the triple
path do not appear in the question then the middle node is the answer. If either the far
right node or the far left node appears in the question then the other one is the answer.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and discussion</title>
      <p>We evaluate our approaches by using CCKS data. The data set is published by CCKS
2018 evaluation task which includes a knowledge base, knowledge entity mention file
and question-answer pairs for training, validation and testing. The knowledge base has
41 million triples. The 2018-Train set, 2018-Val set, 2018-Test set contain 1283,400,400
samples respectively. To obtain negative samples in the training process, for each
question, we select top 10 wrong candidate triple paths. To alleviate the impact of
unbalanced training data, we oversample positive samples.
4.1</p>
      <sec id="sec-3-1">
        <title>Topic Entity Recognition Result</title>
        <p>2018-Val 2018-Test
baselineT E 92.58% 90.79%
baselineT E+Emb 96.29% 93.28%
baselineT E+Emb+Char 98.58% 95.35%</p>
        <p>Table 1 shows systems performance for topic entity recognition module. The basic
model baselineT E only uses F1; F2; F3; F4. The second model is baselineT E +Emb,
which also uses embedding feature F5. The last one is baselineT E +Emb+Char, which
also uses embedding feature F5 and char level feature F6. From Table 1, it is obvious
that embedding feature F5 and char level feature F6 all can improve the Pre@1 of topic
entity recognition. Hyper-parameters wi in the model baselineT E +Emb+Char is [0.25,
0.37, -0.32, 0.67, 0.71, 0.58].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In the paper, we present a joint model of entity linking and predicate recognition for
KBQA. The system achieves the F1-score of 57.67% on CCKS 2018 COQA task. For
future research, we plan to extend our approach to alleviate unseen predicates issue.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamza</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Florian</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>Bilateral Multi-Perspective Matching for Natural Language Sentences</article-title>
          .
          <source>arXiv preprint arXiv:1702.03814</source>
          , (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>An Information Retrieval-Based Approach to Table-Based Question Answering</article-title>
          .
          <source>In: 6th National CCF Conference on Natural Language Processing and Chinese Computing</source>
          , pp.
          <fpage>601</fpage>
          -
          <lpage>611</lpage>
          . Springer, Cham (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , J.:
          <article-title>Open domain question answering system based on knowledge base</article-title>
          .
          <source>In: 5th National CCF Conference on Natural Language Processing and Chinese Computing</source>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>733</lpage>
          . Springer, Cham (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A Chinese Question Answering System for Single-Relation Factoid Questions</article-title>
          .
          <source>In: 6th National CCF Conference on Natural Language Processing and Chinese Computing</source>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>135</lpage>
          . Springer, Cham (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          :
          <article-title>Answering Natural Language Questions by Subgraph Matching over Knowledge Graphs</article-title>
          .
          <source>IEEE Transactions on Knowledge &amp; Data Engineering</source>
          ,
          <year>2018</year>
          :
          <fpage>824</fpage>
          -
          <lpage>837</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Rˇehu˚rˇek,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Sojka</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In:7th Proceedings of the LREC Workshop on New Challenges for NLP Frameworks</source>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . ELRA,
          <string-name>
            <surname>Valletta Malta</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>