<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dual-enhanced Word Representations based on Knowledge Base</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fangyuan He</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Zhou</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haodi Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiyong Feng</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Computer Science and Software Engineering, Shenzhen University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science and Technology,Tianjin University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computing, Engineering and Mathematics, Western Sydney University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Software,Tianjin University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose an approach for enhancing word representations twice based on large-scale knowledge bases. In the rst layer of enhancement, we use the knowledge base as another contextual form corresponding to the corpus and add it to the training of distributed semantics including neural network based and matrix-based. In the second layer, we utilize local features of the knowledge base to enhance the word representations by mutual reinforcement between the keyword and the strongly associated words. We evaluate our approach not only on the well-known datasets but also on a brand-new dataset, IQ-Synonym-323. The results show that our approach compares favorably to other word representations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Word representations as the fundamental tool of NLP become increasingly
important research. Currently, distributional semantics models that follow the
distributional hypothesis represent the most popular approach of word
representations. They commonly refer to the statistics derived from a large text corpus.
Meanwhile, it is proved that the larger the corpus is, the better the model
performs in most tasks [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ]. However, it also incurs obvious limitations. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] it's
shown that using the speci c-domain corpus has a de nite advantage in
addressing a speci c task. With the expansion of corpus, it covers wider domains and
concomitantly produces more mixed information in the context. Models relying
on corpora as context will therefore be hindered in accuracy.
      </p>
      <p>As another approach of word representations which attracts increasing
attention, the knowledge-based approaches mainly rely on the external structured
databases. The abundant and explicit lexical relationships between lexical items
in databases can just make up for the blurring of contexts in the large corpus.</p>
      <p>
        In this work, we propose an approach with double enhancement based on the
large lexical database. Firstly, we take the related words in knowledge base as
the additional accurate context in comparison with the large corpus. Afterwards,
inspired from Kiela et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], both contexts are added to the training process
of representative distributional semantics for the rst enhancement. In addition,
we take advantage of the related words again to construct the second layer
enhancement, which is a tuning process to highlight the strongly associated
words in extracted knowledge base. Our approach with double enhancement
has outstanding performance on the benchmarks including SimLex-999 and the
brand-new dataset IQ-Synonym-323 we build.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <sec id="sec-2-1">
        <title>Knowledge Base as Accurate Context for Training</title>
        <p>
          The knowledge base we use in our approach is composed of a large number of
oneto-many relationship structures, i.e. given a keyword, the knowledge base will
list its closest semantically related words. Therefore, in the rst enhancement
of our approach, we take the related words provided by knowledge base as the
relative accurate context of keywords and inject them in the existing
representative distributional semantics models. For comparison, we select skip-gram [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
and GloVe [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] representing neural network based and matrix based approach
respectively of distributional semantics models.
        </p>
        <p>Neural Network Based The original skip-gram is the neural network
framework with a single hidden layer. Its basic idea is to predict the maximum
probability of words appearing near the keyword. After the rst step of the original
training in large text corpus, our added step follows the formula (1). The
objective of the second step is to maximize the following average log probability. w1,
w2,... wT is a sequence of training words. For keyword wt, Awt is the set of its
related words. And the length of the set is regarded as the context window size
in the additional training step. We name this approach SG-KB-I.
1 XT
T
t=1 wa2Awt</p>
        <p>X
log p (wajwt)
(1)
Matrix Based GloVe is an unsupervised learning approach which emphasizes
the superiority of ratio in words' relevance and train log-bilinear regression model
based on a global word-word co-occurrence matrix.</p>
        <p>The cell of original matrix is the co-occurrence frequency of words in the
xed-length context window of the text corpora. In our approach to
attaching knowledge base as accurate context, we add the co-occurrence frequency of
keyword-related word in knowledge base to the original matrix. In this way, we
use the modi ed cell values to adjust the degree of association between words.
Then the original algorithm is applied to the new matrix to promote the word
representations. We called this approach GloVe-KB-I.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Enhancement Based on Features of Knowledge Base</title>
        <p>Within our extracted knowledge base, some pairs of words are mutual related.
For instance, for the keyword \people", \human" is one of its related words in
knowledge base. Meanwhile, \people" is also in the related word set of keyword
\human". We consider \people" and \human" as a strongly associated words
pair. For these word pairs, we attempt to tune their representations by mutual
reinforcement. In the formula (2), Wsnr is the set of strongly associated words of
keyword w, the vsr is the vector of the elements in this set, n is the length of
the set. Wcmr is the set of commonly associated words of keyword w, vcr is their
vector, the number is m. Wsnr and Wcmr together form the related words set of
keyword w in knowledge base. We set a weight value to the strongly associated
words, so that to pull keyword closer to the strongly associated words than the
commonly ones.</p>
        <p>vw =
n</p>
        <p>0
1</p>
        <p>X
vsr2Wsnr
vsr +</p>
        <p>X
vcr2Wcmr</p>
        <p>1
vcrA
(2)</p>
        <p>Afterwards, we use the SG-KB-I, GloVe-KB-I, as the initial vectors vi. We
tune each keyword's vector vt by vw and vi. and are the weight coe cients.
vt =
vw +
vi
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Knowledge Base</title>
        <p>Compared with raw corpus data, the knowledge base demonstrates more clari ed
relations between words. We choose two large lexical databases as our sources,
namely WordNet and ConceptNet, which contain adequate concepts and a very
broad range of word relationships. We extract more than 155 thousand keywords
from WordNet, and 766 thousand from ConceptNet. After combining the two
parts with mutual lexical items, we nally get 777 thousand keywords with
related words, to constitute our lexical relation knowledge base.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Evaluation</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>We evaluate our representations not only on the well-known dataset but also
on the brand-new dataset we build. We construct a new dataset by collecting
323 synonym questions from related real IQ test books and websites for testing
human intelligence, and name it IQ-Synonym-323. The questions we collected in
our 323 synonym dataset have several types, like \Choose the word most similar
in meaning to X?", or \Which word is closest to the X?" , etc. But all these
types can be included as the keyword and candidate words then we reorganized
them. Our dataset will be available as an open source. Table 1 shows a sample.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental Result</title>
        <p>
          We choose a 11G dumps of English Wikipedia as text corpus. Table 2 shows
the performances of all comparison approaches, including skip-gram and GloVe
as the starting points, ConceptNet [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and Counter- tting [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] as state of the
art models, SG-KB-I and GloVe-KB-I mentioned in the rst layer of the
approach, and SG-KB-II, GloVe-KB-II, two results trained by the second layer
with di erent initial values. Comparing with the starting points, both layers
of our approach improve the performance on the two benchmarks. SG-KB-II
performs best on our dataset. Counter- tting which takes the embedding tuned
by SimLex-999 as start point has a particular advantage in this benchmark,
however, it does not perform so well on our IQ-Synonym-323.
Approach
skip-gram
        </p>
        <p>GloVe
GloVe-KB-I</p>
        <p>SG-KB-I
GloVe-KB-II</p>
        <p>SG-KB-II</p>
        <p>ConceptNet
Counter- tting</p>
        <p>SimLex-999 IQ-Synonym-323
0.39 60.14%
0.35 59.61%</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we propose a double enhancement approach relying on
knowledge base. Since knowledge base can specify more accurate related words of
keywords as context information, we use it to compensate for the noises
generated by multiple domains covered by the large corpus. Utilizing the features of
knowledge base twice brings two signi cant improvements, as shown in Table 2.
We evaluate our approach on the well-known SimLex-999, and the brand-new
dataset, IQ-Synonym-323. The outstanding performance explains the advantage
of our approach in embracing more accurate semantic similarity between similar
vocabularies under large-scale corpora.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kiela</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Specializing Word Embeddings for Similarity or Relatedness</article-title>
          .
          <source>In: EMNLP</source>
          . pp.
          <year>2044</year>
          {
          <year>2048</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mrki</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaghdha</surname>
            ,
            <given-names>D.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rojas-Barahona</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandyke</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          : Counter- tting Word Vectors to Linguistic Constraints.
          <source>CoRR abs/1603</source>
          .00892 (
          <year>2016</year>
          ), http://arxiv.org/abs/1603.00892
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global Vectors for Word Representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Speer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Havasi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>ConceptNet 5.5: An Open Multilingual Graph of General Knowledge</article-title>
          . In: AAAI. pp.
          <volume>4444</volume>
          {
          <issue>4451</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Stenetorp</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soyer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pyysalo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chikayama</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Size (and Domain) Matters: Evaluating Semantic Word Space Representations for Biomedical Text</article-title>
          .
          <source>Proceedings of SMBM 12</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>