<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building Memory with Concept Learning Capabilities from Large-scale Knowledge Base</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiaxin Shi? Jun Zhuy</string-name>
          <email>ishijiaxin@126.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science Tsinghua University Beijing</institution>
          ,
          <addr-line>100084</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a new perspective on neural knowledge base (KB) embeddings, from which we build a framework that can model symbolic knowledge in the KB together with its learning process. We show that this framework well regularizes previous neural KB embedding model for superior performance in reasoning tasks, while having the capabilities of dealing with unseen entities, that is, to learn their embeddings from natural language descriptions, which is very like human's behavior of learning semantic concepts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Recent years have seen great advances in neural networks and their applications in modeling images
and natural languages. With deep neural networks, people are able to achieve superior performance
in various machine learning tasks [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. One of those is relational learning, which aims at
modeling relational data such as user-item relations in recommendation systems, social networks
and knowledge base, etc. In this paper we mainly focus on knowledge base.
      </p>
      <p>
        Generally a knowledge base (KB) consists of triplets (or facts) like (e1; r; e2), where e1 and e2
denote the left entity and the right entity, and r denotes the relation between them. Previous works on
neural KB embeddings model entities and relations with distributed representation, i.e., vectors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
or matrices [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and learn them from the KB. These prove to be scalable approaches for relational
learning. Experiments also show that neural embedding models obtain state-of-art performance on
reasoning tasks like link prediction. Section 2 will cover more related work.
      </p>
      <p>
        Although such methods on neural modeling of KB have shown promising results on reasoning tasks,
they have limitations of only addressing known entities that appear in the training set and do not
generalize well to settings where we have unseen entities. Because they do not know embedding
representations of new entities, they cannot establish relations with them. On the other hand, the
capability of KB to learn new concepts as entities, or more specifically, to learn what a certain name
used by human means, is obviously highly useful, particularly in a KB-based dialog system. We
observe that during conversations human does this task by first asking for explanation and then
establishing knowledge about the concept from other peoples’ natural language descriptions. This
inspired our framework of modeling human’s cognitive process of learning concepts during
conversations, i.e., the process from natural language description to a concept in memory.1 We use
a neural embedding model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to model the memory of concepts. When given description text of
a new concept, our framework directly transforms it into an entity embedding, which captures
semantic information about this concept. The entity embedding can be stored and later used for other
1Concept learning in cognitive science usually refers to the cognitive process where people grow abstract
generalizations from several example objects [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We use concept learning here to represent a different behavior.
semantic tasks. Details of our framework are described in Section 3. We will show efficiency of this
framework in modeling entity relationships, which involve both natural language understanding and
reasoning.
      </p>
      <p>Our perspective on modeling symbolic knowledge with its learning process has two main
advantages. First, it enables us to incorporate natural language descriptions to augment the modeling of
relational data, which fits human’s behavior of learning concepts during conversations well. Second,
we also utilize the large number of symbolic facts in knowledge base as labeled information to guide
the semantic modeling of natural language. The novel perspective together with framework are the
key contributions of this work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Statistical relational learning has long been an important topic in machine learning. Traditional
methods such as Markov logic networks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] often suffer from scalability issues due to intractable
inference. Following the success of low rank models [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in collaborative filtering, tensor
factorization [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] was proposed as a more general form to deal with multi-relational learning (i.e.,
multiple kinds of relations exist between two entities). Another perspective is to regard elements in
factorized tensors as probabilistic latent features of entities. This leads to methods that apply
nonparametric Bayesian inference to learn latent features [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ] for link prediction. Also, attempts
have been made to address the interpretability of latent feature based models under the framework
of Bayesian clustering [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. More recently, with the noticeable achievements of neural embedding
models like word vectors [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] in natural language processing area, various neural embedding models
[
        <xref ref-type="bibr" rid="ref17 ref18 ref4 ref5 ref6">6, 17, 5, 4, 18</xref>
        ] for relational data have been proposed as strong competitors in both scalability and
predictability for reasoning tasks.
      </p>
      <p>
        All these methods above model relational data under the latent-feature assumption, which is a
common perspective in machine learning to gain high performance in prediction tasks. However, these
models leave all latent features to be learnt from data, which suffers from substantial increments of
model complexity when applying to large-scale knowledge bases. For example, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] can be seen
as having a feature vector for each entity in factorized tensors, while [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also represents entities in
separate vectors, or embeddings, thus the number of parameters scales linearly with the number of
entities. A large number of parameters in these models often increases the risk of overfitting, but few
of these works have proposed effective regularization techniques to address it. On the other hand,
when applying these models to real world tasks (e.g., knowledge base completion), most of them
have a shared limitation that entities unseen in training set cannot be dealt with, that is, they can
only complete relations between known entities, which is far from what human’s ability of learning
new concepts can achieve. From this perspective, we develop a general framework that is capable
of modeling symbolic knowledge together with its learning process, as detailed in Section 3.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The framework</title>
      <p>
        Our framework consists of two parts. The first part is a memory storage of embedding
representations. We use it to model the large-scale symbolic knowledge in the KB, which can be thought as
memory of concepts. The other part is a concept learning module, which accepts natural language
descriptions of concepts as the input, and then transforms them into entity embeddings in the same
space of the memory storage. In this paper we use translating embedding model from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as our
memory storage and use neural networks for the concept learning module.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Translating embedding model as memory storage</title>
        <p>
          We first describe translating embedding (TransE) model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which we use as the memory storage of
concepts. In TransE, relationships are represented as translations in the embedding space. Suppose
we have a set of N true facts D = f(e1; r; e2)gN as the training set. If a fact (e1; r; e2) is true, then
TransE requires e1 + r to be close to e2. Formally, we define the set of entity vectors as E, the set
of relation vectors as R, where R; E Rn, e1; e2 2 E, r 2 R. Let d be some distance measure,
which is either the L1 or L2 norm. TransE minimizes a margin loss between the score of true facts
in the training set and randomly made facts, which serve as negative samples:
        </p>
        <p>L(D) =</p>
        <p>X</p>
        <p>
          X
(e1;r;e2)2D (e01;r0;e02)2D(0e1;r;e2)
max(0;
+ d(e1 + r; e2)
d(e01 + r0; e02));
(1)
where D(0e1;r;e2) = f(e01; r; e2) : e01 2 Eg [ f(e1; r; e02) : e02 2 Eg, and is the margin. Note
that this loss favors lower distance between translated left entities and right entities for training facts
than for random generated facts in D0. The model is optimized by stochastic gradient descent with
mini-batch. Besides, TransE forces the L2 norms of entity embeddings to be 1, which is essential
for SGD to perform well according to [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], because it prevents the training process from trivially
minimizing loss by increasing entity embedding norms.
        </p>
        <p>
          There are advantages of using embeddings instead of symbolic representations for cognitive tasks.
For example, it’s kind of easier for us to figure out that a person who is a violinist can play
violin than to tell his father’s name. However, in symbolic representations like knowledge base,
the former fact &lt;A, play, violin&gt; can only be deduced by reasoning process through facts
&lt;A, has profession, violinist&gt; and &lt;violinist, play, violin&gt;, which is a
two-step procedure, while the latter result can be acquired in one step through the fact &lt;A, has
father, B&gt;. If we look at how TransE embeddings do this task, we can figure out that A plays
violin by finding nearest neighbors of A’s embedding + play’s embedding, which costs at most
the same amount of time as finding out who A’s father is. This claim is supported by findings in
cognitive science that the general properties of concepts (e.g., &lt;A, play, violin&gt;) are more
strongly bound to an object than its more specific properties (e.g., &lt;A, has father, B&gt;) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Concept learning module</title>
        <p>
          As mentioned earlier, the concept learning module accepts natural language descriptions of
concepts as the input, and outputs corresponding entity embeddings. As this requires natural language
understanding with knowledge in the KB transferred into the module, neural networks can be good
candidates for this task. We explore two kinds of neural network architectures for the concept
learning module, including multi-layer perceptrons (MLP) and convolutional neural networks (CNN).
For MLP, we use one hidden layer with 500 neurons and RELU activations. Because MLP is
fullyconnected, we cannot afford the computational cost when the input length is too long. For large scale
datasets, the vocabulary size is often as big as millions, which means that bag-of-words features
cannot be used. Here, we use bag-of-n-grams features as inputs (there are at most 263 = 17576
kinds of 3-grams in pure English text). Given a word, for example word, we first add starting and
ending marks to it like #word#, and then break it into 3-grams (#wo, wor, ord, rd#). Suppose we
have V kinds of 3-grams in our training set. For an input description, we count the numbers of all
kinds of 3-grams in this text, which form a V -dimensional feature vector x. To control scale of the
input per dimension, we use log(1 + x) instead of x as input features. Then we feed this vector into
the MLP, with the output to be the corresponding entity embedding under this description.
Since MLP with bag-of-n-grams features loses information of the word order, it has very little sense
of the semantics. Even at the word level, it fails to identify words with similar meanings. From this
point of view, we further explore the convolutional architecture, i.e. CNN together with word vector
features. Let s = w1w2:::wk be the paragraph of a concept description and let v(wi) 2 Rd be the
vector representation for word wi. During experiments in this paper, we set d = 50 and initialize
v(wi) with wi’s word vector pretrained from large scale corpus, using methods in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Let As be
the input matrix for s, which is defined by:
where A:s;i denotes the ith column of matrix As. For the feature maps at the lth layer F (l) 2
Rc n m, where c is the number of channels, we add the convolutional layer like:
        </p>
        <p>A:s;i = v(wi);
Fi(;l:+;: 1) =</p>
        <p>c
X Fj(;l:);:
j=1</p>
        <p>Ki(;lj);:;:;
where K(l) denotes all convolution kernels at the lth layer, which forms an order-4 tensor (output
channels, input channels, y axis, x axis). When modeling natural language, which is in a sequence
(2)
(3)
form, we choose K(l) to have the same size in the y axis as feature maps F (l). So for the first layer
that has the input size 1 D L, we use kernel size D 1 in the last two axes, where D is the
dimension of word vectors. After the first layer, the last two axes of feature maps in each layer
remain to be vectors. We list all layers we use in Table 1, where kernels are described by output
channels y axis x axis.</p>
        <p>Note that we use neural networks (either MLP or CNN) to output the entity embeddings, while
according to Section 3.1, the embedding model requires the L2-norms of entity embeddings to be
1. This leads to a special normalization layer (the 12th layer in Table 1) designed for our purpose.
Given the output of the second last layer x 2 Rn, we define the last layer as:</p>
        <p>wkT;:x + bk
ek = [Pn
k0=1(wkT0;:x + bk0 )2]1=2
(4)
e is the output embedding. It’s easy to show that kek2 = 1. Throughout our experiments, we found
that this trick plays an essential role in making joint training of the whole framework work. We will
describe the training process in Section 3.3.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Training</title>
        <p>
          We jointly train our embedding model and concept learning module together by stochastic gradient
descent with mini-batch and Nesterov momentum [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], using the loss defined by equation 1, where
the entity embeddings are given by outputs of the concept learning module. When doing SGD with
mini-batch, We back-propagate the error gradients into the neural network, and for CNN, finally
into word vectors. The relation embeddings are also updated with SGD, and we re-normalize them
in each iteration to make their L2-norms stay 1.
4
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>
          Since no public datasets satisfy our need, we have built two new datasets to test our method and
make them public for research use. The first dataset is based on FB15k released by [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We dump
natural language descriptions of all entities in FB15k from Freebase [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], which are stored under
relation /common/topic/description. We refer to this dataset as FB15k-desc2. The other
dataset is also from Freebase, while we make it much larger. In fact, we include all entities that have
descriptions in Freebase and remove triplets with relations in a filter set. Most relations in the filter
set are schema relations like /type/object/key. This dataset has more than 4M entities, for
which we call it FB4M-desc3. Statistics of the two datasets are presented in Table 2.
2FB15k-desc: Available at http://ml.cs.tsinghua.edu.cn/˜jiaxin/fb15k desc.tar.gz
3FB4M-desc: Available at http://ml.cs.tsinghua.edu.cn/˜jiaxin/fb4m desc.tar.gz
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Dataset</title>
        <p>FB15k-desc
FB4M-desc</p>
      </sec>
      <sec id="sec-4-3">
        <title>Entities</title>
        <p>14951
4629345</p>
      </sec>
      <sec id="sec-4-4">
        <title>Relations</title>
        <p>1345
2651</p>
      </sec>
      <sec id="sec-4-5">
        <title>Descriptions</title>
        <p>Vocabulary Length
58954 6435
1925116 6617
Note that the scale is not the only difference between these two datasets. They also differ in splitting
criteria. FB15k-desc follows FB15k’s original partition of training, validation and test sets, in which
all entities in validation and test sets are already seen in the training set. FB4M-desc goes the
contrary way, as it is designed to test the concept learning ability of our framework. All facts in
validation and test sets include an entity on one side that are not seen in the training set. So when
evaluated on FB4M-desc, a good embedding for a new concept can only rely on information from
the natural language description and knowledge transferred in the concept learning module.
4.2</p>
      </sec>
      <sec id="sec-4-6">
        <title>Link prediction</title>
        <p>
          We first describe the task of link prediction. Given a relation and an entity on one side, the task is
to predict the entity on the other side. This is a natural reasoning procedure which happens in our
thoughts all the time. Following previous work [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we use below evaluation protocol for this task.
For each test triplet (e1; r; e2), e1 is removed and replaced by all the other entities in the training
set in turn. The neural embedding model should give scores for these corrupted triplets. The rank
of the correct entity is stored. We then report the mean of predicted ranks on the test set as the left
mean rank. This procedure is repeated by corrupting e2 and then we get the right mean rank. The
proportion of correct entities ranked in the top 10 is another index, which we refer to as hits@10.
We test our link prediction performance on FB15k-desc and report it in Table 3. The type of concept
learning module we use here is CNN. Note that all the triplets in training, validation and test sets
of FB15k-desc are the same as FB15k, so we list TransE’s results on FB15k in the same table.
Compared to TransE which cannot make use of information in descriptions, our model performs
much better, in terms of both mean rank and hits@10. As stated in Section 4.1, all entities in the
test set of FB15k are contained in the training set, which, together with the results, shows that
our framework well regularizes the embedding model by forcing embeddings to reflect information
from natural language descriptions. We demonstrate the concept learning capability in the next
subsection.
4.3
        </p>
      </sec>
      <sec id="sec-4-7">
        <title>Concept learning capabilities</title>
        <p>It has been shown in Section 4.2 that our framework well regularizes the neural embedding model for
memory storage. Next we use FB4M-desc to evaluate the capability of our framework on learning
new concepts and performing reasoning based on learnt embeddings. We report the link prediction
performance on FB4M-desc in Table 4. Note that the test set contains millions of triples, which
is very time-consuming in the ranking-based evaluation. So we randomly sample 1k, 10k and 80k
triplets from the test set to report the evaluation statistics. We can see that CNN consistently
outperforms MLP in terms of both mean rank and hits@10. All the triplets in the test set of FB4M-desc
include an entity unseen in the training set on one side, requiring the model to understand natural
language descriptions and to do reasoning based on it. As far as we know, no traditional knowledge
base embedding model can compete with us on this task, which again claims the novelty of our
framework.
MLP
CNN</p>
      </sec>
      <sec id="sec-4-8">
        <title>Left entity</title>
        <sec id="sec-4-8-1">
          <title>Lily Burana</title>
        </sec>
        <sec id="sec-4-8-2">
          <title>Ajeyo</title>
          <p>4272 Entsuji is a main-belt
4272 Entsuji aMstaerrcohid12d,is1c9o7v7erbeyd Honiroki
Kosai and Kiichiro
Hurukawa at Kiso
Observatory.</p>
        </sec>
      </sec>
      <sec id="sec-4-9">
        <title>Description</title>
        <sec id="sec-4-9-1">
          <title>Rank</title>
          <p>Lily Burana is an American 0
writer whose publications
include the memoir I Love a 1
Man in Uniform: A Memoir
of Love, War, and Other 0
Battles, the novel Try and
Strip ... 0</p>
        </sec>
        <sec id="sec-4-9-2">
          <title>Ajeyo is a 2014 Assamese</title>
          <p>language drama film
directed by Jahnu Barua ...
Ajeyo depicts the struggles
of an honest, ideal
revolutionary youth Gajen
Keot who fought against the
social evils in rural Assam
during the freedom
movement in India. The
film won the Best Feature
Film in Assamese award in
the 61st National Film
Awards ...</p>
        </sec>
      </sec>
      <sec id="sec-4-10">
        <title>Hit@10 facts (partial)</title>
        <p>Relation, right entity
/people/person/profession,
writer
/people/person/profession,
author
/people/person/gender,
female
/people/person/nationality,
the United States
/film/film/country, India
/film/film/film festivals,
Mumbai Film Festival
/film/film/genre, Drama
/film/film/language,
Assamese
/astronomy/astronomical
discovery/discoverer,
Kiichir Furukawa
/astronomy/celestial
object/category, Asteroids
/astronomy/star system body/
star system, Solar System
/astronomy/asteroid/member
of asteroid group, Asteroid
belt
/astronomy/orbital
relationship/orbits, Sun
Finally, we show some examples in Table 5 to illustrate our framework’s capability of learning
concepts from natural language descriptions. From the first example, we can see that our framework
is able to infer &lt;Lily Burana, has profession, author&gt; from the sentence “Lily
Burana is an American writer.” To do this kind of reasoning requires a correct understanding of the
original sentence and knowledge that writer and author are synonyms. In the third example, with
limited information in the description, the framework hits correct facts almost purely based on its
knowledge of astronomy, demonstrating the robustness of our approach.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work</title>
      <p>We present a novel perspective on knowledge base embeddings, which enables us to build a
framework with concept learning capabilities from large-scale KB based on previous neural embedding
models. We evaluate our framework on two newly constructed datasets from Freebase, and the
results show that our framework well regularizes the neural embedding model to give superior
performance, while has the ability to learn new concepts and use the newly learnt embeddings to deal
with semantic tasks (e.g., reasoning).</p>
      <p>Future work may include consistently improving performance of learnt concept embeddings on
large-scale datasets like FB4M-desc. For applications, we think this framework is very promising
in solving problems of unknown entities in KB-powered dialog systems. The dialog system can ask
users for description when meeting an unknown entity, which is a natural behavior even for human
during conversations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geoffrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <source>Li Deng</source>
          ,
          <string-name>
            <given-names>Dong</given-names>
            <surname>Yu</surname>
          </string-name>
          , George E Dahl, Abdel-rahman
          <string-name>
            <surname>Mohamed</surname>
          </string-name>
          , Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen,
          <string-name>
            <surname>Tara N Sainath</surname>
          </string-name>
          , et al.
          <article-title>Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups</article-title>
          .
          <source>Signal Processing Magazine</source>
          , IEEE,
          <volume>29</volume>
          (
          <issue>6</issue>
          ):
          <fpage>82</fpage>
          -
          <lpage>97</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <article-title>Oriol Vinyals, and Quoc VV Le</article-title>
          .
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Socher</surname>
          </string-name>
          , Danqi Chen,
          <string-name>
            <surname>Christopher D Manning</surname>
            , and
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Reasoning with neural tensor networks for knowledge base completion</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>926</fpage>
          -
          <lpage>934</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          , Nicolas Usunier, Alberto Garcia-Duran,
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Oksana</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          .
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2787</fpage>
          -
          <lpage>2795</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          , Jason Weston, Ronan Collobert, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Learning structured embeddings of knowledge bases</article-title>
          .
          <source>In Conference on Artificial Intelligence, number EPFL-CONF192344</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Joshua</surname>
            <given-names>B Tenenbaum</given-names>
          </string-name>
          , Charles Kemp, Thomas L Griffiths, and
          <string-name>
            <given-names>Noah D</given-names>
            <surname>Goodman</surname>
          </string-name>
          .
          <article-title>How to grow a mind: Statistics, structure, and abstraction</article-title>
          . science,
          <volume>331</volume>
          (
          <issue>6022</issue>
          ):
          <fpage>1279</fpage>
          -
          <lpage>1285</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Richardson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Domingos</surname>
          </string-name>
          .
          <article-title>Markov logic networks</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>62</volume>
          (
          <issue>1- 2</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yehuda</given-names>
            <surname>Koren</surname>
          </string-name>
          , Robert Bell, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <article-title>Matrix factorization techniques for recommender systems</article-title>
          . Computer, (
          <volume>8</volume>
          ):
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Maximilian</surname>
            <given-names>Nickel</given-names>
          </string-name>
          , Volker Tresp, and
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
          </string-name>
          .
          <article-title>A three-way model for collective learning on multi-relational data</article-title>
          .
          <source>In Proceedings of the 28th international conference on machine learning (ICML-11)</source>
          , pages
          <fpage>809</fpage>
          -
          <lpage>816</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Maximilian</surname>
            <given-names>Nickel</given-names>
          </string-name>
          , Volker Tresp, and
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
          </string-name>
          .
          <article-title>Factorizing yago: scalable machine learning for linked data</article-title>
          .
          <source>In Proceedings of the 21st international conference on World Wide Web</source>
          , pages
          <fpage>271</fpage>
          -
          <lpage>280</lpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Charles</surname>
            <given-names>Kemp</given-names>
          </string-name>
          , Joshua B Tenenbaum, Thomas L Griffiths,
          <string-name>
            <surname>Takeshi Yamada</surname>
            , and
            <given-names>Naonori</given-names>
          </string-name>
          <string-name>
            <surname>Ueda</surname>
          </string-name>
          .
          <article-title>Learning systems of concepts with an infinite relational model</article-title>
          .
          <source>In AAAI</source>
          , volume
          <volume>3</volume>
          , page 5,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Kurt</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael I Jordan,</surname>
          </string-name>
          and Thomas L Griffiths.
          <article-title>Nonparametric latent feature models for link prediction</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1276</fpage>
          -
          <lpage>1284</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Max-margin nonparametric latent feature models for link prediction</article-title>
          .
          <source>In Proceedings of the 29th International Conference on Machine Learning (ICML-12)</source>
          , pages
          <fpage>719</fpage>
          -
          <lpage>726</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ilya</surname>
            <given-names>Sutskever</given-names>
          </string-name>
          , Joshua B Tenenbaum, and
          <string-name>
            <surname>Ruslan R Salakhutdinov.</surname>
          </string-name>
          <article-title>Modelling relational data using bayesian clustered tensor factorization</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1821</fpage>
          -
          <lpage>1828</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Antoine</surname>
            <given-names>Bordes</given-names>
          </string-name>
          , Xavier Glorot, Jason Weston, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>A semantic matching energy function for learning with multi-relational data</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>94</volume>
          (
          <issue>2</issue>
          ):
          <fpage>233</fpage>
          -
          <lpage>259</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Zhen</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Jianwen Zhang, Jianlin Feng, and
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Knowledge graph and text jointly embedding</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1591</fpage>
          -
          <lpage>1601</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>James L McClelland and Timothy T Rogers</surname>
          </string-name>
          .
          <article-title>The parallel distributed processing approach to semantic cognition</article-title>
          .
          <source>Nature Reviews Neuroscience</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <fpage>310</fpage>
          -
          <lpage>322</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Ilya</surname>
            <given-names>Sutskever</given-names>
          </string-name>
          , James Martens, George Dahl, and
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>On the importance of initialization and momentum in deep learning</article-title>
          .
          <source>In Proceedings of the 30th international conference on machine learning (ICML-13)</source>
          , pages
          <fpage>1139</fpage>
          -
          <lpage>1147</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Kurt</surname>
            <given-names>Bollacker</given-names>
          </string-name>
          , Colin Evans, Praveen Paritosh, Tim Sturge, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Taylor</surname>
          </string-name>
          . Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge</article-title>
          .
          <source>In Proceedings of the 2008 ACM SIGMOD international conference on Management of data</source>
          , pages
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>