<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analogy-based Reasoning With Memory Networks for Future Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Andrade</string-name>
          <email>s-andrade@cj.jp.nec.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramkumar Rajendrany</string-name>
          <email>ramkumar.rajendran@vanderbilt.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bing Bai</string-name>
          <email>bbai@nec-labs.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yotaro Watanabe</string-name>
          <email>y-watanabe@fe.jp.nec.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Vanderbilt University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Science Research Laboratories, NEC Corporation</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Machine Learning</institution>
          ,
          <addr-line>NEC Laboratories America</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Making predictions about what might happen in the future is important for reacting adequately in many situations. For example, observing that “Man kidnaps girl” may have the consequence that “Man kills girl”. While this is part of common sense reasoning for humans, it is not obvious how machines can learn and generalize over such knowledge automatically. The order of event's textual occurrence in documents offers a clue to acquire such knowledge automatically. Here, we explore another clue, namely, logical and temporal relations of verbs from lexical resources. We argue that it is possible to generalize to unseen events, by using the entailment relation between two events expressed as (subject, verb, object) triples. We formulate our hypotheses of analogy-based reasoning for future prediction, and propose a memory network that incorporates our hypotheses. Our evaluation for predicting the next future event shows that the proposed model can be competitive to (deep) neural networks and rankSVM, while giving interpretable answers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Making predictions about what might happen in the future is important for reacting adequately in
many situations. For example, observing that “Man kidnaps girl” may have the consequence that
“Man kills girl”. While this is part of common sense reasoning for humans, it is not obvious how
machines can learn and generalize over such knowledge automatically.</p>
      <p>
        One might think of learning such knowledge from massive amount of text data, such as news corpora.
However, detecting temporal relations between events is still a difficult problem. Temporal order of
events are often presented in different order in text. Although the problem can be partially addressed
by using temporal markers like “afterwards”, particularly with discourse parsers [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], overall, it
remains a challenge.3
In this work, we propose to exploit the distinction between logical relations and temporal relations.
We note that if an entailment relation holds between two events, then the second event is likely to be
not a new future event.4 For example, the phrase “man kissed woman” entails that “man met woman”,
where “man met woman” happens before (not after) “man kissed woman”. To find such entailments,
we can leverage relation of verbs in WordNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Verbs that tend to be in a temporal (happens-before)
relation have been extracted on a large scale and are openly available in VerbOcean [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For example,
we observe (subject, buy, object) tends to be temporally preceding (subject, use, object).
We present a model that can predict future events given a current event triplet (subject, verb, object).
To make the model generalizable to unseen events, we adopt a deep learning structure such that the
semantics of unseen events can be learned through word/event embeddings. We present a novel
Memory Comparison Network (MCN) that can learn to compare and combine the similarity of input
events to the event relations saved in memory. Our evaluation shows that this method is competitive
to other (deep) neural networks and rankSVM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], while giving interpretable answers.
In the first part of this work, in Section 2, we describe previous work related to future prediction. In
Section 3, we discuss some connections between logical and temporal relations, and explain how we
use lexical resources to create a knowledge base of positive and negative temporal relations. This
knowledge base is then used by our experiments in the second part of our work.
      </p>
      <p>In the second part, in Section 4, we formulate our assumptions of analogy based reasoning for future
prediction. Underlying these assumptions, we propose our new method MCN. In Section 5, we
describe several other methods that were previously proposed for future prediction, and ranking
models that can be easily adapted to this task. In Section 6, we evaluate all methods on a future
prediction task that requires to reason about unseen events. Finally, in Sections 7 and 8, we discuss
some current limitations of our proposed method, and summarize our conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        One line of research, pioneered by VerbOcean [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], extracts happens-before relations from large
collections of texts using bootstrapping methods. In the context of script learning, corpora statistics,
such as event bi-grams, are used to define a probability distribution over next possible future events
[
        <xref ref-type="bibr" rid="ref13 ref3">13, 3</xref>
        ]. However, such models cannot generalize to situations of new events that have not been
observed before. Therefore, the more recent methods proposed in [
        <xref ref-type="bibr" rid="ref11 ref15 ref6">11, 15, 6</xref>
        ] are based on word
embeddings. Script learning is traditionally evaluated on small prototypical sequences that were
manually created, or on event sequences that were automatically extracted from text. Due to the lack
of training data, these models cannot learn to distinguish the fact that some events later in the text are
actually entailed by events previously mentioned, i.e. already known events and new events are not
distinguished.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Exploiting lexical resources</title>
      <p>
        Our main focus is on distinguishing future events from other events. In texts, like news stories, an
event el is more likely to have happened before event er (temporal order), if el occurs earlier in
the text than er (textual order). However, there are also many situations where this is not the case:
re-phrasing, introducing background knowledge, conclusions, etc. One obvious solution are discourse
parsers. However, without explicit temporal markers, they suffer from low recall [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], and therefore
in practice most script-learning systems use textual order as a proxy for temporal order. Here we
explore whether common knowledge can help to improve future detection from event sequences in
textual order.
      </p>
      <p>We assume common knowledge is given in the form of simple relations (or rules) like
(company, buy, share) ! (company, use, share) ,
where “!” denotes the temporal happens-before relation. In contrast, we denote the logical entailment
(implication) relation by “)”.</p>
      <p>
        To extract such common knowledge rules we explore the use of the lexical resources WordNet and
VerbOcean. As also partly mentioned in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], logical and temporal relations are not independent, but
4We consider here entailment and (logical) implication as equivalent. In particular, synonyms are considered
to be in an entailment relation, as in contrast to the classification by WordNet.
(1) “minister leaves factory”, “minister enters factory”
(2) “company donates money”, “company gives money”
(3) “John starts marathon” , “John finishes marathon”
(4) “governor kisses girlfriend”, “governor meets girlfriend”
(5) “people buy apple”, “people use apple”
(6) “minister likes criticism”, “minister hates criticism”
(7) “X’s share falls 10%”, “X’s share rises 10%”
(6)
      </p>
      <p>Contradiction</p>
      <p>Happens</p>
      <p>after
(1)</p>
      <p>(7)
(3)
Happensbefore
(5)
(4)</p>
      <p>Entailment
(2)
an interesting overlap exists as illustrated in Figure 1, and corresponding examples shown in Table 1.
We emphasis that, for temporal relations, the situation is not always as clear cut as shown in Figure 1
(e.g. repeated actions). Nevertheless, there is a tendency of event relations belonging mostly only to
one relation. In particular, in the following, we consider “wrong” happens-before relations, as less
likely to be true than “correct” happens-before relations.
3.1</p>
      <sec id="sec-3-1">
        <title>Data creation</title>
        <p>
          For simplicity, we restrict our investigation here to events of the form (subject, verb, object). All events
are extracted from around 790k news articles in Reuters [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We preprocessed the English Reuters
articles using the Stanford dependency parser and co-reference resolution [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We lemmatized all
words, and for subjects and objects we considered only the head words, and ignored words like
WH-pronouns.
        </p>
        <p>All relations are defined between two events of the form (S; Vl; O) and (S; Vr; O), where subject S
and object O are the same. As candidates we consider only events in sequence (occurrence in text).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Positive Samples</title>
        <p>We extract positive samples of the form (S; Vl; O) ! (S; Vrpos; O), if
1. Vl ! Vrpos is listed in VerbOcean as happens-before relation.
2. :[Vl ) Vrpos] according to WordNet. That means, for example, if (S; Vr; O) is paraphrasing
(S; Vl; O), then this is not considered as a temporal relation.</p>
        <p>This way, we were able to extract 1699 positive samples. Examples are shown in Table 2.
Negative Samples Using VerbOcean, we extracted negative samples of the form (S; Vl; O) 9
(S; Vrneg ; O), i.e. the event on the left hand (S; Vl; O) is the same as for a positive sample.5 This way,
we extracted 1177 negative samples.</p>
        <p>5If (S; Vl; O) 9 (S; Vrneg; O), then Vl ! Vrneg is not listed in VerbOcean.
(company, buy, share) ! (company, use, share)
(ex-husband, stalk, her) ! (ex-husband, kill, her)
(farmer, plant, acre) ! (farmer, harvest, acre)
There are several reasons for a relation not being in a temporal relation. Using VerbOcean and
WordNet we analyzed the negative samples, and found that the majority (1030 relations) could not be
classified with either VerbOcean or WordNet. We estimated conservatively that around 27% of these
relations are false negatives: for a sub-set of 100 relations, we labeled a sample as a false negative, if
it can have an interpretation as a happens-before relation.6
To simplify the task, we created a balanced data set, by pairing all positive and negative samples: each
sample pair contains one positive and one negative sample, and the task is to find that the positive
sample is more likely to be a happens-before relation than a negative sample. The resulting data set
contains in total 1765 pairs.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Analogy-based reasoning for happens-before relation scoring</title>
      <p>In the following, let r be a happens-before relation of the form:</p>
      <p>r : el ! er ;
where el and er are two events of the form (S; Vl; O) and (S; Vr; O), respectively. Furthermore, let
e0 be any event of the form (S0; V 0; O0).</p>
      <p>Our working hypotheses consists of the following two claims:
(I) If (e0 ) el) ^ (el ! er), then e0 ! er :
(II) If (e0 ) er) ^ (el ! er), then el ! e0 :</p>
      <sec id="sec-4-1">
        <title>For example, consider</title>
      </sec>
      <sec id="sec-4-2">
        <title>Using (I), we can reason that:</title>
        <p>“John buys computer” ) “John acquires computer” ,
“John acquires computer” ! “John uses computer” .</p>
        <p>“John buys computer” ! “John uses computer” .</p>
        <p>We note that, in some cases, \ )00 in (I) and (II) cannot be replace by \ (00. This is illustrated by
the following example:
“John knows Sara” ( “John marries Sara” ,
“John marries Sara” ! “John divorces from Sara” .</p>
        <p>“John knows Sara” ! “John divorces from Sara” .</p>
        <p>However, the next statement is considered wrong (or less likely to be true):
In practice, using word embeddings, it can be difficult to distinguish between \ )00 and \ (00.
Therefore, our proposed method uses the following simplified assumptions:
(I ) If (e0
(II ) If (e0
el) ^ (el ! er), then e0 ! er :
er) ^ (el ! er), then el ! e0 :
where</p>
        <p>denotes some similarity that can be measured by means of word embeddings.</p>
        <p>6Therefore, this over-estimates the number of false negatives. This is because it also counts a happens-before
relation that is less likely than a happens-after relation as a false negative.
4.1</p>
        <sec id="sec-4-2-1">
          <title>Memory Comparison Network</title>
          <p>We propose a memory-based network model that uses the assumptions (I ) and (II ). It bases its
decision on one (or more) training samples that are similar to a test sample. In contrast to other
methods like neural networks for script learning, and (non-linear) SVM ranking models, it has
the advantage of giving an explanation of why a relation is considered (or not considered) as a
happens-before relation.</p>
          <p>In the following, let r1 and r2 be two happens-before relations of the form:
r1 : (S1; Vl1 ; O1) ! (S1; Vr1 ; O1) ;
r2 : (S2; Vl2 ; O2) ! (S2; Vr2 ; O2) :
Let xsi , xvli , xvri and xoi 2 Rd denote the word embeddings corresponding to Si; Vli ; Vri and Oi.7
We define the similarity between two relations r1 and r2 as:</p>
          <p>sim (r1; r2) = g (xvTl1 xvl2 ) + g (xvTr1 xvr2 ) ;
where g is an artificial neuron with = f ; g, a scale 2 R, and a bias 2 R parameter, followed
by a non-linearity. We use as non-linearity the sigmoid function. Furthermore, here we assume that
all word embeddings are l2-normalized.</p>
          <p>Given the input relation r : el ! er, we test whether the relation is correct or wrong as follows. Let
npos and nneg denote the number of positive and negative training samples, respectively. First, we
compare to all positive and negative training relations in the training data set, and denote the resulting
vectors as upos 2 Rnpos and uneg 2 Rnneg , respectively. That is formally
upos = sim (r; rtpos) and uneg = sim (r; rtneg) ;</p>
          <p>t t
where rtpos and rtneg denotes the t-th positive/negative training sample.</p>
          <p>Next, we define the score that r is correct/wrong as the weighted average of the relation similarities:
opos = softmax (upos)T upos and oneg = softmax (uneg)T uneg
where softmax (u) returns a column vector with the t-th output defined as
softmax (u)t =</p>
          <p>e ut
P e ui ;</p>
          <p>i
! 1; softmax (u) = max(u), and for
= 0,
and 2 R is a weighting parameter. Note that for
o is the average of u.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Finally, we define the happens-before score for r as</title>
        <p>l(el; er) = opos(el; er)
oneg(el; er) :
The score l(el; er) can be considered as an unnormalized log probability that relation r is a
happensbefore relation. The basic components of the network are illustrated in Figure 2.
For optimizing the parameters of our model we minimize the rank margin loss:</p>
        <p>L(rpos; rneg) = maxf0; 1
l(el; erpos) + l(el; erneg))g ;
where rpos : el ! epos and rneg : el ! eneg are positive and negative samples from the held-out
r r
training data. All parameters of the models are trained using stochastic gradient descent (SGD). Word
embeddings (xs; xv, and xo) are kept fixed during training.</p>
        <p>
          Our model can be interpreted as an instance of the Memory Networks proposed in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Using
the notation from [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], I( ) corresponds to the word embedding lookup, G( ) saves all training
samples into the memory, the O( ) function corresponds to (opos; oneg), and the output of R( ) equals
Equation (3).
        </p>
        <p>
          Our model also has similarity to the memory-based reasoning system proposed in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], with two
differences. First, we use here a trainable similarity measure, see Equation (1), rather than a fixed
distance measure. Second, we use the trainable softmax rather than max.
        </p>
        <p>7Remark about our notation: we use bold fonts, like v to denote a column vector; vT to denote the transpose,
and vt to denote the t-th dimension of v.
(1)
(2)
(3)
(4)
“John starts marathon”
“John finishes marathon”</p>
        <p>Relation Similarity
  
Score</p>
        <p>Difference
Similarities
To Positive
Relations</p>
        <p>Similarities
To Negative</p>
        <p>Relations
   
Weighted
Average</p>
        <p>Weighted
Average</p>
        <p>Input:
“Sara starts Tokyo-Marathon”
“Sara withdrawsfrom Tokyo-Marathon”
“John starts marathon”
“John attends marathon”
Relation Similarity</p>
        <p>We also investigate several other models that can be applied for ranking temporal relations. All
models that we consider are based on word embeddings in order to be able to generalize to unseen
events.</p>
        <p>
          Our first model is based on the bilinear model proposed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for document retrieval, with scoring
function l(el; er) = zlT M zr, where zl and zr are the concatenated word embeddings xs; xvl ; xo and
xs; xvr ; xo, respectively, and parameter matrix M 2 R3d 3d. We denote this model as Bai2009.
We also test three neural network architecture that were proposed in different contexts. The model
in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], originally proposed for semantic parsing, is a three layer network that can learn non-linear
combinations of (subject, verb) and (verb, object) pairs. The non-linearity is achieved by the
Hadamard product of the hidden layers. The original network can handle only events (relations
between verb and objects, but not relations between events). We recursively extend the model to
handle relations between events. We denote the model as Bordes2012.
        </p>
        <p>
          In the context of script learning, recently two neural networks have been proposed for detecting
happens-before relations. The model proposed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] (here denoted Modi2014) learns event
embeddings parameterized with verb and context (subject or object) dependent embedding matrices.
The event embeddings are then mapped to a score that indicates temporal time. To score a relation
between events, we use the dot-product between the two events’ embeddings.8 The model in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
suggests a deeper architecture than [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Their model (denoted here Granroth2016) uses additionally
two non-linear layers for combining the left and right events. All neural networks and the Bai2009
model were trained in the same way as the proposed method, i.e. optimized with respect to rank
margin loss using Equation (4).9 For all methods, we kept the word embeddings fixed (i.e. no
training), since this improved performance in general.
        </p>
        <p>
          Our final two models use the rankSVM Algorithm proposed in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] with the implementation from
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We tested both a linear and a rbf-kernel with the hyper-parameters optimized via grid-search. To
represent a sample, we concatenate the embeddings of all words in the relation.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>We split the data set into training (around 50%), validation (around 25%), and testing (around 25%)
set. Due to the relatively small size of the data we repeated each experiment 10 times for different
random splits (training/validation/test).</p>
      <p>8We also tried two variations: left and right events with different and same parameterization. However, the
results did not change significantly.</p>
      <p>
        9Originally, the model in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] was optimized with respect to negative log-likelihood, however in our setting
we found that rank-margin loss performed better.
For the bilinear model and all neural networks, we performed up to 2000 epochs, and used early
stopping with respect to the validation set. Some models were quite sensitive to the choice of the
learning rates, so we tested 0.00001, 0.0001, and 0.001, and report the best results on the validation
set.
      </p>
      <p>For our proposed method, we set the learning rate constant to 0.001. Furthermore, we note that our
proposed method requires two types of training data, one type of training data that is in memory, the
other type that is used for learning the parameters. For the former and latter we used the training and
validation fold, respectively. As initial parameters for this non-convex optimization problem we set
= 1:0; = 0:5; = 5:0, that were selected via the validation set.</p>
      <p>For testing, we consider the challenging scenario, where the left event of the sample contains a verb
that is not contained in the training set (and also not in the validation set).</p>
      <p>
        We report accuracy, when asking the question: given observation (S; Vl; O), is (S; Vrpos; O) more
likely to be a future event than (S; Vrneg; O)?
We used the 50 dimensional word embeddings from GloVe tool [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] trained on Wikipedia + Gigaword
5 provided by the authors (GloVe)11.
      </p>
      <p>
        The results of our method and previously proposed methods are shown in Table 3, upper half. By
using the false-negative estimate from Section 3.1, we also calculated an estimate of the human
performance on this task.12
The results suggest that our proposed model provides good generalization performance that is at par
with the neural network recently proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (Granroth2016), and SVM ranking with RBF-kernel.
The results support our claim that the happens-before relation can be detected by analogy-based
reasoning.
6.1
      </p>
      <sec id="sec-5-1">
        <title>Analysis</title>
        <p>We also compared to four variations of our proposed method. The results are shown in Table 3, lower
half.</p>
        <p>The first two variations use as similarity measure the addition of the word embeddings’ inner products,
i.e. g in Equation (1) is the identity function, and have no trainable parameters. The variation
denoted by “Memory Comparison Network (max, no parameters)”, is a kind of nearest neighbour
ranking, that uses the max function instead of softmax . The second variation, denoted by “Memory
11http://nlp.stanford.edu/projects/glove/
12We assume that distinguishing a false-negative from a true-positive is not possible (i.e. a human needs to
guess), and that all guesses are wrong.
input relation: (index,climb,percent) ! (index,slide,percent)
input relation: (parliament,discuss,budget) ! (parliament,adopt,budget)
supporting evidence:
supporting evidence:
(rate,rise,percent) ! (rate,tumble,percent)
(index,finish,point) 9 (index,slide,point)
supporting evidence:
supporting evidence:
(refiner,introduce,system) ! (refiner,adopt,system)
(union,call,strike) 9 (union,propose,strike)
input relation: (price,gain,cent) 9 (price,strengthen,cent)
input relation: (farmer,plant,acre) 9 (farmer,seed,acre)
supporting evidence:
supporting evidence:
(investment,build,plant) ! (investment,expand,plant)
(dollar,rise,yen) 9 (dollar,strengthen,yen)
supporting evidence:
supporting evidence:
(refinery,produce,tonne) ! (refinery,process,tonne)
(refinery,produce,tonne) 9 (refinery,receive,tonne)
Comparison Network (average, no parameters)”, uses for opos and oneg , in Equations (2), the average
of upos and uneg , respectively. The performance of both variations is below our proposed method.
Furthermore, we compared to an alternative model, where the softmax is replaced by the max
function, marked by “(max, trained)” in Table 3, lower half. Also, we compared to our proposed
model, but without learning parameters, i.e. the parameters are set to the initial parameters, marked
by “(softmax , initial parameters)” in Table 3, lower half. We can see that the choice of softmax ,
over max, improves performance, and that the training of all parameters with SGD is effective (in
particular, see improvement on validation data).</p>
        <p>Since our model uses analogy-based reasoning, we can easily identify ”supporting evidence” for the
output of our system. Four examples are shown in Table 4. Here, “supporting evidence” denotes the
training sample with the highest similarity sim to the input. In the first and second example, the
input is a happens-before relation, in the third and fourth example, the input is not a happens-before
relation.13
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>
        Our current method does not model the interaction between subject, object and verb. However,
temporal relations can also crucially depend on subject and object. As an example, in our data set
(see Table 2), we have the happens-before relation (company, buy, share) ! (company, use, share).
Clearly, if we replace the subject by “kid” and the object by “ice-cream”, the happens-before relation
becomes wrong, or much less likely. In particular, (kid, buy, ice-cream) ! (kid, use, ice-cream) is
much less likely than, for example, (kid, buy, ice-cream) ! (kid, eat, ice-cream).14
Here, we compared two temporal rules r1 and r2 and asked which one is more likely, by ranking
them. However, reasoning in terms of probabilities of future events, would allow us to integrate our
predictions into a probabilistic reasoning framework like MLN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
8
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>We investigated how common knowledge, provided by lexical resources, can be generalized and used
to predict future events. In particular, we proposed a memory network that can learn how to compare
and combine the similarity of the input events to event relations saved in memory. This way our
proposed method can generalize to unseen events and also provide evidence for its reasoning. Our
experiments suggest that our method is competitive to other (deep) neural networks and rankSVM.
13Since we considered only the head, a unit like “percent” means “x percent”, where x is some number.
14Partly, this could be addressed by considering also the selectional preference of verbs like “eat” and “use”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Bing</given-names>
            <surname>Bai</surname>
          </string-name>
          , Jason Weston, David Grangier,
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Collobert</surname>
          </string-name>
          , Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and
          <string-name>
            <given-names>Kilian</given-names>
            <surname>Weinberger</surname>
          </string-name>
          .
          <article-title>Supervised semantic indexing</article-title>
          .
          <source>In Proceedings of the 18th ACM conference on Information and knowledge management</source>
          , pages
          <fpage>187</fpage>
          -
          <lpage>196</lpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          , Xavier Glorot, Jason Weston, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Joint learning of words and meaning representations for open-text semantic parsing</article-title>
          .
          <source>In International Conference on Artificial Intelligence and Statistics</source>
          , pages
          <fpage>127</fpage>
          -
          <lpage>135</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Nathanael</given-names>
            <surname>Chambers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of narrative event chains</article-title>
          .
          <source>In ACL</source>
          , volume
          <volume>94305</volume>
          , pages
          <fpage>789</fpage>
          -
          <lpage>797</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Chklovski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Pantel</surname>
          </string-name>
          .
          <article-title>Verbocean: Mining the web for fine-grained semantic verb relations</article-title>
          .
          <source>In EMNLP</source>
          , volume
          <volume>4</volume>
          , pages
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Christiane</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          and
          <string-name>
            <given-names>George</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Wordnet: An electronic lexical database</article-title>
          . MIT Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Granroth-Wilding</surname>
          </string-name>
          and
          <article-title>Clark. What happens next? Event prediction using a compositional neural network model</article-title>
          .
          <source>AAAI</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Optimizing search engines using clickthrough data</article-title>
          .
          <source>In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          . ACM,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Training linear svms in linear time</article-title>
          .
          <source>In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>217</fpage>
          -
          <lpage>226</lpage>
          . ACM,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>David D</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          , Tony G Rose, and
          <string-name>
            <given-names>Fan</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Rcv1: A new benchmark collection for text categorization research</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>5</volume>
          :
          <fpage>361</fpage>
          -
          <lpage>397</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Christopher</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
            , Mihai Surdeanu, John Bauer, Jenny Finkel,
            <given-names>Steven J.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
          </string-name>
          , and
          <string-name>
            <surname>David McClosky</surname>
          </string-name>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          .
          <source>In ACL System Demonstrations</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Ashutosh</given-names>
            <surname>Modi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Titov</surname>
          </string-name>
          .
          <article-title>Inducing neural models of script knowledge</article-title>
          .
          <source>In CoNLL</source>
          , volume
          <volume>14</volume>
          , pages
          <fpage>49</fpage>
          -
          <lpage>57</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Conference on Empirical Methods on Natural Language Processing</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>43</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Karl</given-names>
            <surname>Pichotta and Raymond J Mooney</surname>
          </string-name>
          .
          <article-title>Statistical script learning with multi-argument events</article-title>
          .
          <source>In EACL</source>
          , volume
          <volume>14</volume>
          , pages
          <fpage>220</fpage>
          -
          <lpage>229</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Richardson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Domingos</surname>
          </string-name>
          .
          <article-title>Markov logic networks</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>62</volume>
          (
          <issue>1- 2</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Rachel</surname>
            <given-names>Rudinger</given-names>
          </string-name>
          , Pushpendre Rastogi, Francis Ferraro, and Benjamin Van Durme.
          <article-title>Script induction as language modeling</article-title>
          .
          <source>In EMNLP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Craig</given-names>
            <surname>Stanfill</surname>
          </string-name>
          and
          <string-name>
            <given-names>David</given-names>
            <surname>Waltz</surname>
          </string-name>
          .
          <article-title>Toward memory-based reasoning</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>29</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1213</fpage>
          -
          <lpage>1228</lpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Jason</surname>
            <given-names>Weston</given-names>
          </string-name>
          , Sumit Chopra, and
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          .
          <article-title>Memory networks</article-title>
          .
          <source>ICLR</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Nianwen</surname>
            <given-names>Xue</given-names>
          </string-name>
          , Hwee Tou Ng, Sameer Pradhan, Rashmi Prasad, Christopher Bryant, and
          <string-name>
            <surname>Attapol</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Rutherford</surname>
          </string-name>
          . The conll
          <article-title>-2015 shared task on shallow discourse parsing</article-title>
          .
          <source>In CoNLL, page 2</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>