<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge Based High-Frequency Question Answering in AliMe Chat</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuangyong Song</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chao Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haiqing Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alibaba Group</institution>
          ,
          <addr-line>Beijing 100102</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In our online chatbot serving, AliMe Chat, we design a knowledge graph based approach for solving high-frequency chitchat question answering. For meeting the demand of high Question per Second (QPS) of online system, we design several solutions to escape from questioning a large knowledge graph, details of those solutions are given in this paper, and the experimental results show the effectiveness and efficiency of them.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graph</kwd>
        <kwd>E-commerce Chatbot</kwd>
        <kwd>Lucene Index</kwd>
        <kwd>Text Matching</kwd>
        <kwd>Multiple Answers Generation</kwd>
        <kwd>Index of Subgraph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        AliMe Chat, presented by Alibaba in 2015, has provided services for billions of users
and now on average with ten million of users access per day [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. AliMe service can be
roughly classified into assistance service, customer service and chatting service, and
the main idea of this paper is to improve ability of AliMe Chat with knowledge graph.
      </p>
      <p>
        A seq2seq based re-ranking and generation method has been proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to
chat with AliMe users with general topics, such as greetings, jokes and other kinds of
chitchats. However, fact-based and knowledge-based chatting ability of AliMe is still
weak, and for improving those kinds of ability of AliMe and meanwhile increasing the
diversity of chatting answers, we design a question-answering framework.
      </p>
      <p>
        Since online servicing has a very high demand of QPS, our framework is just
oriented to high-frequent questions or entities in historical user question logs. We design
several methods: 1) for high-frequent questions, we try to find which of them can be
answered with knowledge graph and those ‘question-answer’ pairs are indexed by
Lucene for online matching and re-ranking; 2) for high-frequent entities, we extract
subgraphs from complete knowledge graph, and differing from some related work
which do this step in real time [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we prepared those subgraphs offline for reducing
online processing. We classify questions with those entities to 3 kinds: questions with
an unambiguous entity, questions with an ambiguous entity and questions with
multiple entities. For different kinds of questions, we design different answer generation
methods.
      </p>
      <p>In the following parts of this article, we will illustrate the details of the proposed
framework, and report the experimental results.
1 Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Framework</title>
      <p>
        Text clustering is utilized to cluster users’ question log and representative questions in
top ranking clusters are extracted as high frequent questions. On the clustering step,
we utilize a self-adapting clustering method proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and set a strict threshold
to ensure that questions in a cluster are very similar to each other. On the
representative question extraction step, we consider cluster-level keywords, question length and
distance to cluster center as three factors, and a question with more keywords, average
question length and nearest distance to cluster center has more chance to be chosen as
the final representative one.
      </p>
      <p>
        A classic knowledge graph based question-answering technique [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is for obtaining
answers of each representative questions, and all questions with knowledge graph
based answers are collected into a ‘question-answer’ index with Lucene and in the
online part, we first use Lucene to roughly recall top K candidates and then use a deep
learning based text similarity model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to exactly rank those candidates to get the
final answer.
Entities with high frequency are extracted from user question log, and then we
categorize those entities to unambiguous entities and ambiguous entities. For unambiguous
entities, we can answer questions such as “where was Joe Hisaishi born” easily with
classic knowledge graph based question-answering technique [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. And for questions
with ambiguous entities, such as “you know Carlos, right?”, we can answer this
question with “you mean the Brazilian football player?” or “you mean the Brazilian
football player or Carlos the Jackal?”.
      </p>
      <p>
        Especially, for a user question that contains more than one entity, such as a
question “Who is older, Louis Koo or Andy Lau?”, the method proposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is referred
in our work.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Dataset and Parameter Settings</title>
      </sec>
      <sec id="sec-3-2">
        <title>Datasets:</title>
        <p>Question log: we collect anonymous online user question log from Nov. 1, 2018 to
Dec. 31, 2018. This dataset contains 125.9 million user questions and with merging
duplicate ones we can obtain 44.9 million diverse user questions.</p>
        <p>High frequent questions: occurrences of 1.26 million questions are greater than or
equal to 5, which are chosen as high frequent questions (HFQs).</p>
        <p>QA pairs: we input each HFQ into knowledge-based QA system, and if we can get
an answer, we take this ‘HFQ-answer’ as a QA pair. We totally obtained 53,187 QA
pairs.</p>
        <p>High frequent entities: 25,682 high frequent entities (HFEs), more than 10.
Subgraph of entities: we extract all subgraphs of HFEs from Wikipedia.</p>
        <p>Text matching training data: for creating enough dataset for training the text
matching model, we implement following strategies: we randomly select 10,000 user
questions from chatbot log, and top 15 candidates for each of them can be obtained
with Lucene index of all question log. Then 8 service experts labeled those candidates
with right/wrong, and some examples are shown in Table 1. Serious data unbalance
shows in above labeled data, since just 14.3% candidates are labeled as right ones
(positive samples). For balancing the data, we randomly extract about 20%
candidates, which are labeled as wrong, of whole dataset as negative samples.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Parameter settings:</title>
        <p>When choosing top K candidates from Lucene index, we empirical set K = 20,
which is a number not too small to recall the real answer and not too huge to be
quickly processed in the text matching step.</p>
        <p>For the text-matching threshold, we check each decimal in (0,1) with an interval of
0.1, with respect to F1-value final answer obtaining, and a threshold of 0.85 can help
obtain the best F1-value.
3.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Experimental Results</title>
        <p>The main purpose of the proposed framework is to increase the coverage of AliMe
Chat, and reduce the ‘no-answer’ situations. With the real online testing, the coverage
of AliMe Chat in the whole Alime Assist has been increased from 4.18% to 4.87%,
which realizes a 16.5% increase.</p>
        <p>In Fig. 2, we show several examples of online results of the proposed approach. In
left sub-figure, the first user question is a frequent asked question and it can be
answered with knowledge graph, so Lucene has indexed it. The second question
contains a entity of ‘East Hope’ which has no ambiguity in knowledge graph and we can
answer it with ‘East Hope Group is a company’ or ‘East Hope Group is in electrolytic
aluminum industry’ etc. In right sub-figure, the first user question contains an
ambiguous name ‘James’, which is also a ‘half’ person name. We can give user some
choices of this ambiguous half name, and if then user choose one of the choices and
ask some related question, we can continue to answer it.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Future Works</title>
      <p>This paper is only a preliminary work. Knowledge based multi-turn conversation in
ecommerce chatbot will be a key point in our future work, and the utilization of
knowledge based named entity disambiguation models, especially that on
abbreviation disambiguation, are predictable to be a helpful way of getting better responses.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , AliMe Chat:
          <article-title>A Sequence to Sequence and Rerank based Chatbot Engine</article-title>
          .
          <source>In ACL'17.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in Ecommerce</article-title>
          .
          <source>WSDM</source>
          <year>2018</year>
          :
          <fpage>682</fpage>
          -
          <lpage>690</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zafar</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napolitano</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <article-title>Formal query generation for question answering over knowledge bases</article-title>
          .
          <source>In ESWC'18.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <article-title>AliMe assist: an intelligent assistant for creating an innovative e-commerce experience</article-title>
          .
          <source>In CIKM'17.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <article-title>Summarizing Microblogging Users with Existing Welldefined Hashtags</article-title>
          .
          <source>International Journal of Asian Language Processing</source>
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>111</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rinaldi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dowdall</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hess</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mollá</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwitter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaljurand</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Knowledgebased question answering</article-title>
          .
          <source>In KES'03.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dubey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J. Earl:
          <article-title>Joint entity and relation linking for question answering over knowledge graphs</article-title>
          .
          <source>In ISWC 2018</source>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>