Dual-enhanced Word Representations based on
               Knowledge Base

            Fangyuan He1 , Yi Zhou2,∗ , Haodi Zhang3 , Zhiyong Feng4
            1
              School of Computer Science and Technology,Tianjin University
 2
     School of Computing, Engineering and Mathematics, Western Sydney University
     3
       College of Computer Science and Software Engineering, Shenzhen University
                         4
                           School of Software,Tianjin University


        Abstract. In this paper, we propose an approach for enhancing word
        representations twice based on large-scale knowledge bases. In the first
        layer of enhancement, we use the knowledge base as another contextual
        form corresponding to the corpus and add it to the training of distributed
        semantics including neural network based and matrix-based. In the sec-
        ond layer, we utilize local features of the knowledge base to enhance the
        word representations by mutual reinforcement between the keyword and
        the strongly associated words. We evaluate our approach not only on the
        well-known datasets but also on a brand-new dataset, IQ-Synonym-323.
        The results show that our approach compares favorably to other word
        representations.


1     Introduction
Word representations as the fundamental tool of NLP become increasingly im-
portant research. Currently, distributional semantics models that follow the dis-
tributional hypothesis represent the most popular approach of word representa-
tions. They commonly refer to the statistics derived from a large text corpus.
Meanwhile, it is proved that the larger the corpus is, the better the model per-
forms in most tasks [2, 4]. However, it also incurs obvious limitations. In [6] it’s
shown that using the specific-domain corpus has a definite advantage in address-
ing a specific task. With the expansion of corpus, it covers wider domains and
concomitantly produces more mixed information in the context. Models relying
on corpora as context will therefore be hindered in accuracy.
    As another approach of word representations which attracts increasing at-
tention, the knowledge-based approaches mainly rely on the external structured
databases. The abundant and explicit lexical relationships between lexical items
in databases can just make up for the blurring of contexts in the large corpus.
    In this work, we propose an approach with double enhancement based on the
large lexical database. Firstly, we take the related words in knowledge base as
the additional accurate context in comparison with the large corpus. Afterwards,
inspired from Kiela et al. [1], both contexts are added to the training process
of representative distributional semantics for the first enhancement. In addition,
we take advantage of the related words again to construct the second layer
enhancement, which is a tuning process to highlight the strongly associated
words in extracted knowledge base. Our approach with double enhancement
has outstanding performance on the benchmarks including SimLex-999 and the
brand-new dataset IQ-Synonym-323 we build.

2     Approach
2.1   Knowledge Base as Accurate Context for Training

The knowledge base we use in our approach is composed of a large number of one-
to-many relationship structures, i.e. given a keyword, the knowledge base will
list its closest semantically related words. Therefore, in the first enhancement
of our approach, we take the related words provided by knowledge base as the
relative accurate context of keywords and inject them in the existing represen-
tative distributional semantics models. For comparison, we select skip-gram [2]
and GloVe [4] representing neural network based and matrix based approach
respectively of distributional semantics models.

Neural Network Based The original skip-gram is the neural network frame-
work with a single hidden layer. Its basic idea is to predict the maximum prob-
ability of words appearing near the keyword. After the first step of the original
training in large text corpus, our added step follows the formula (1). The objec-
tive of the second step is to maximize the following average log probability. w1 ,
w2 ,... wT is a sequence of training words. For keyword wt , Awt is the set of its
related words. And the length of the set is regarded as the context window size
in the additional training step. We name this approach SG-KB-I.
                              T
                           1X X
                                   log p (wa |wt )                              (1)
                           T t=1 a
                                 w ∈Awt


Matrix Based GloVe is an unsupervised learning approach which emphasizes
the superiority of ratio in words’ relevance and train log-bilinear regression model
based on a global word-word co-occurrence matrix.
    The cell of original matrix is the co-occurrence frequency of words in the
fixed-length context window of the text corpora. In our approach to attach-
ing knowledge base as accurate context, we add the co-occurrence frequency of
keyword-related word in knowledge base to the original matrix. In this way, we
use the modified cell values to adjust the degree of association between words.
Then the original algorithm is applied to the new matrix to promote the word
representations. We called this approach GloVe-KB-I.

2.2   Enhancement Based on Features of Knowledge Base

Within our extracted knowledge base, some pairs of words are mutual related.
For instance, for the keyword “people”, “human” is one of its related words in
knowledge base. Meanwhile, “people” is also in the related word set of keyword
“human”. We consider “people” and “human” as a strongly associated words
pair. For these word pairs, we attempt to tune their representations by mutual
                                     n
reinforcement. In the formula (2), Wsr  is the set of strongly associated words of
keyword w, the vsr is the vector of the elements in this set, n is the length of
           m
the set. Wcr is the set of commonly associated words of keyword w, vcr is their
                             n         m
vector, the number is m. Wsr   and Wcr    together form the related words set of
keyword w in knowledge base. We set a weight value α to the strongly associated
words, so that to pull keyword closer to the strongly associated words than the
commonly ones.
                                                               
                          1    α ∗
                                       X               X
               vw =                           vsr +         vcr               (2)
                     n∗α+m                  n             m
                                     vsr ∈Wsr         vcr ∈Wcr

   Afterwards, we use the SG-KB-I, GloVe-KB-I, as the initial vectors vi . We
tune each keyword’s vector vt by vw and vi . γ and β are the weight coefficients.
                               vt = γ ∗ vw + β ∗ vi
2.3   Knowledge Base
Compared with raw corpus data, the knowledge base demonstrates more clarified
relations between words. We choose two large lexical databases as our sources,
namely WordNet and ConceptNet, which contain adequate concepts and a very
broad range of word relationships. We extract more than 155 thousand keywords
from WordNet, and 766 thousand from ConceptNet. After combining the two
parts with mutual lexical items, we finally get 777 thousand keywords with
related words, to constitute our lexical relation knowledge base.

3     Experimental Evaluation
3.1   Dataset
We evaluate our representations not only on the well-known dataset but also
on the brand-new dataset we build. We construct a new dataset by collecting
323 synonym questions from related real IQ test books and websites for testing
human intelligence, and name it IQ-Synonym-323. The questions we collected in
our 323 synonym dataset have several types, like “Choose the word most similar
in meaning to X?”, or “Which word is closest to the X?” , etc. But all these
types can be included as the keyword and candidate words then we reorganized
them. Our dataset will be available as an open source. Table 1 shows a sample.

3.2   Experimental Result
We choose a 11G dumps of English Wikipedia as text corpus. Table 2 shows
the performances of all comparison approaches, including skip-gram and GloVe
as the starting points, ConceptNet [5] and Counter-fitting [3] as state of the
art models, SG-KB-I and GloVe-KB-I mentioned in the first layer of the ap-
proach, and SG-KB-II, GloVe-KB-II, two results trained by the second layer
with different initial values. Comparing with the starting points, both layers
of our approach improve the performance on the two benchmarks. SG-KB-II
performs best on our dataset. Counter-fitting which takes the embedding tuned
by SimLex-999 as start point has a particular advantage in this benchmark,
however, it does not perform so well on our IQ-Synonym-323.
                            “Which word is closest to the AUGUST?”
    IQ question format    A.Common B. Ridiculous C. Dignified D. Petty
                                          Answer: C
    Reorganized format AUGUST::Common,Ridiculous,Dignified,Petty::Dignified

                     Table 1: A sample in IQ-Synonym-323

                      Approach       SimLex-999 IQ-Synonym-323
                      skip-gram         0.39        60.14%
                        GloVe           0.35        59.61%
                    GloVe-KB-I          0.44        72.34%
                     SG-KB-I            0.60        75.25%
                    GloVe-KB-II         0.55        80.85%
                     SG-KB-II           0.64        84.08%
                    ConceptNet         0.61         81.19%
                   Counter-fitting     0.74         64.75%

Table 2: Comparing the performances of the start points, our approaches and
state of the art models on SimLex-999 and IQ-Synonym-323.


4    Conclusion
In this paper, we propose a double enhancement approach relying on knowl-
edge base. Since knowledge base can specify more accurate related words of
keywords as context information, we use it to compensate for the noises gener-
ated by multiple domains covered by the large corpus. Utilizing the features of
knowledge base twice brings two significant improvements, as shown in Table 2.
We evaluate our approach on the well-known SimLex-999, and the brand-new
dataset, IQ-Synonym-323. The outstanding performance explains the advantage
of our approach in embracing more accurate semantic similarity between similar
vocabularies under large-scale corpora.

References
1. Kiela, D., Hill, F., Clark, S.: Specializing Word Embeddings for Similarity or Re-
   latedness. In: EMNLP. pp. 2044–2048 (2015)
2. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Repre-
   sentations in Vector Space. CoRR abs/1301.3781 (2013)
3. Mrki, N., Séaghdha, D.Ó., Thomson, B., Gai, M., Rojas-Barahona, L.M., Su, P.,
   Vandyke, D., Wen, T., Young, S.J.: Counter-fitting Word Vectors to Linguistic
   Constraints. CoRR abs/1603.00892 (2016), http://arxiv.org/abs/1603.00892
4. Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Repre-
   sentation. In: Proceedings of the 2014 conference on empirical methods in natural
   language processing (EMNLP). pp. 1532–1543 (2014)
5. Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: An Open Multilingual Graph of
   General Knowledge. In: AAAI. pp. 4444–4451 (2017)
6. Stenetorp, P., Soyer, H., Pyysalo, S., Ananiadou, S., Chikayama, T.: Size (and Do-
   main) Matters: Evaluating Semantic Word Space Representations for Biomedical
   Text. Proceedings of SMBM 12 (2012)