Introduction

Analogy-based Reasoning With Memory Networks for Future Prediction

Daniel Andrade

s-andrade@cj.jp.nec.com 1

Ramkumar Rajendrany

ramkumar.rajendran@vanderbilt.edu 0

Bing Bai

bbai@nec-labs.com 2

Yotaro Watanabe

y-watanabe@fe.jp.nec.com 1 0 Computer Science Department, Vanderbilt University 1 Data Science Research Laboratories, NEC Corporation , Japan 2 Department of Machine Learning , NEC Laboratories America

Making predictions about what might happen in the future is important for reacting adequately in many situations. For example, observing that “Man kidnaps girl” may have the consequence that “Man kills girl”. While this is part of common sense reasoning for humans, it is not obvious how machines can learn and generalize over such knowledge automatically. The order of event's textual occurrence in documents offers a clue to acquire such knowledge automatically. Here, we explore another clue, namely, logical and temporal relations of verbs from lexical resources. We argue that it is possible to generalize to unseen events, by using the entailment relation between two events expressed as (subject, verb, object) triples. We formulate our hypotheses of analogy-based reasoning for future prediction, and propose a memory network that incorporates our hypotheses. Our evaluation for predicting the next future event shows that the proposed model can be competitive to (deep) neural networks and rankSVM, while giving interpretable answers.

Introduction

One might think of learning such knowledge from massive amount of text data, such as news corpora. However, detecting temporal relations between events is still a difficult problem. Temporal order of events are often presented in different order in text. Although the problem can be partially addressed by using temporal markers like “afterwards”, particularly with discourse parsers [ 18 ], overall, it remains a challenge.3 In this work, we propose to exploit the distinction between logical relations and temporal relations. We note that if an entailment relation holds between two events, then the second event is likely to be not a new future event.4 For example, the phrase “man kissed woman” entails that “man met woman”, where “man met woman” happens before (not after) “man kissed woman”. To find such entailments, we can leverage relation of verbs in WordNet [ 5 ]. Verbs that tend to be in a temporal (happens-before) relation have been extracted on a large scale and are openly available in VerbOcean [ 4 ]. For example, we observe (subject, buy, object) tends to be temporally preceding (subject, use, object). We present a model that can predict future events given a current event triplet (subject, verb, object). To make the model generalizable to unseen events, we adopt a deep learning structure such that the semantics of unseen events can be learned through word/event embeddings. We present a novel Memory Comparison Network (MCN) that can learn to compare and combine the similarity of input events to the event relations saved in memory. Our evaluation shows that this method is competitive to other (deep) neural networks and rankSVM [ 7 ], while giving interpretable answers. In the first part of this work, in Section 2, we describe previous work related to future prediction. In Section 3, we discuss some connections between logical and temporal relations, and explain how we use lexical resources to create a knowledge base of positive and negative temporal relations. This knowledge base is then used by our experiments in the second part of our work.

In the second part, in Section 4, we formulate our assumptions of analogy based reasoning for future prediction. Underlying these assumptions, we propose our new method MCN. In Section 5, we describe several other methods that were previously proposed for future prediction, and ranking models that can be easily adapted to this task. In Section 6, we evaluate all methods on a future prediction task that requires to reason about unseen events. Finally, in Sections 7 and 8, we discuss some current limitations of our proposed method, and summarize our conclusions. 2

Related work

One line of research, pioneered by VerbOcean [ 4 ], extracts happens-before relations from large collections of texts using bootstrapping methods. In the context of script learning, corpora statistics, such as event bi-grams, are used to define a probability distribution over next possible future events [ 13, 3 ]. However, such models cannot generalize to situations of new events that have not been observed before. Therefore, the more recent methods proposed in [ 11, 15, 6 ] are based on word embeddings. Script learning is traditionally evaluated on small prototypical sequences that were manually created, or on event sequences that were automatically extracted from text. Due to the lack of training data, these models cannot learn to distinguish the fact that some events later in the text are actually entailed by events previously mentioned, i.e. already known events and new events are not distinguished. 3

Exploiting lexical resources

Our main focus is on distinguishing future events from other events. In texts, like news stories, an event el is more likely to have happened before event er (temporal order), if el occurs earlier in the text than er (textual order). However, there are also many situations where this is not the case: re-phrasing, introducing background knowledge, conclusions, etc. One obvious solution are discourse parsers. However, without explicit temporal markers, they suffer from low recall [ 18 ], and therefore in practice most script-learning systems use textual order as a proxy for temporal order. Here we explore whether common knowledge can help to improve future detection from event sequences in textual order.

We assume common knowledge is given in the form of simple relations (or rules) like (company, buy, share) ! (company, use, share) , where “!” denotes the temporal happens-before relation. In contrast, we denote the logical entailment (implication) relation by “)”.

To extract such common knowledge rules we explore the use of the lexical resources WordNet and VerbOcean. As also partly mentioned in [ 5 ], logical and temporal relations are not independent, but 4We consider here entailment and (logical) implication as equivalent. In particular, synonyms are considered to be in an entailment relation, as in contrast to the classification by WordNet. (1) “minister leaves factory”, “minister enters factory” (2) “company donates money”, “company gives money” (3) “John starts marathon” , “John finishes marathon” (4) “governor kisses girlfriend”, “governor meets girlfriend” (5) “people buy apple”, “people use apple” (6) “minister likes criticism”, “minister hates criticism” (7) “X’s share falls 10%”, “X’s share rises 10%” (6)

Contradiction

Happens

after (1)

(7) (3) Happensbefore (5) (4)

Entailment (2) an interesting overlap exists as illustrated in Figure 1, and corresponding examples shown in Table 1. We emphasis that, for temporal relations, the situation is not always as clear cut as shown in Figure 1 (e.g. repeated actions). Nevertheless, there is a tendency of event relations belonging mostly only to one relation. In particular, in the following, we consider “wrong” happens-before relations, as less likely to be true than “correct” happens-before relations. 3.1

Data creation

For simplicity, we restrict our investigation here to events of the form (subject, verb, object). All events are extracted from around 790k news articles in Reuters [ 9 ]. We preprocessed the English Reuters articles using the Stanford dependency parser and co-reference resolution [ 10 ]. We lemmatized all words, and for subjects and objects we considered only the head words, and ignored words like WH-pronouns.

All relations are defined between two events of the form (S; Vl; O) and (S; Vr; O), where subject S and object O are the same. As candidates we consider only events in sequence (occurrence in text).

Positive Samples

We extract positive samples of the form (S; Vl; O) ! (S; Vrpos; O), if 1. Vl ! Vrpos is listed in VerbOcean as happens-before relation. 2. :[Vl ) Vrpos] according to WordNet. That means, for example, if (S; Vr; O) is paraphrasing (S; Vl; O), then this is not considered as a temporal relation.

This way, we were able to extract 1699 positive samples. Examples are shown in Table 2. Negative Samples Using VerbOcean, we extracted negative samples of the form (S; Vl; O) 9 (S; Vrneg ; O), i.e. the event on the left hand (S; Vl; O) is the same as for a positive sample.5 This way, we extracted 1177 negative samples.

5If (S; Vl; O) 9 (S; Vrneg; O), then Vl ! Vrneg is not listed in VerbOcean. (company, buy, share) ! (company, use, share) (ex-husband, stalk, her) ! (ex-husband, kill, her) (farmer, plant, acre) ! (farmer, harvest, acre) There are several reasons for a relation not being in a temporal relation. Using VerbOcean and WordNet we analyzed the negative samples, and found that the majority (1030 relations) could not be classified with either VerbOcean or WordNet. We estimated conservatively that around 27% of these relations are false negatives: for a sub-set of 100 relations, we labeled a sample as a false negative, if it can have an interpretation as a happens-before relation.6 To simplify the task, we created a balanced data set, by pairing all positive and negative samples: each sample pair contains one positive and one negative sample, and the task is to find that the positive sample is more likely to be a happens-before relation than a negative sample. The resulting data set contains in total 1765 pairs. 4

Analogy-based reasoning for happens-before relation scoring

In the following, let r be a happens-before relation of the form:

r : el ! er ; where el and er are two events of the form (S; Vl; O) and (S; Vr; O), respectively. Furthermore, let e0 be any event of the form (S0; V 0; O0).

Our working hypotheses consists of the following two claims: (I) If (e0 ) el) ^ (el ! er), then e0 ! er : (II) If (e0 ) er) ^ (el ! er), then el ! e0 :

For example, consider Using (I), we can reason that:

“John buys computer” ) “John acquires computer” , “John acquires computer” ! “John uses computer” .

“John buys computer” ! “John uses computer” .

We note that, in some cases, \ )00 in (I) and (II) cannot be replace by \ (00. This is illustrated by the following example: “John knows Sara” ( “John marries Sara” , “John marries Sara” ! “John divorces from Sara” .

“John knows Sara” ! “John divorces from Sara” .

However, the next statement is considered wrong (or less likely to be true): In practice, using word embeddings, it can be difficult to distinguish between \ )00 and \ (00. Therefore, our proposed method uses the following simplified assumptions: (I ) If (e0 (II ) If (e0 el) ^ (el ! er), then e0 ! er : er) ^ (el ! er), then el ! e0 : where

denotes some similarity that can be measured by means of word embeddings.

6Therefore, this over-estimates the number of false negatives. This is because it also counts a happens-before relation that is less likely than a happens-after relation as a false negative. 4.1

Memory Comparison Network

We propose a memory-based network model that uses the assumptions (I ) and (II ). It bases its decision on one (or more) training samples that are similar to a test sample. In contrast to other methods like neural networks for script learning, and (non-linear) SVM ranking models, it has the advantage of giving an explanation of why a relation is considered (or not considered) as a happens-before relation.

In the following, let r1 and r2 be two happens-before relations of the form: r1 : (S1; Vl1 ; O1) ! (S1; Vr1 ; O1) ; r2 : (S2; Vl2 ; O2) ! (S2; Vr2 ; O2) : Let xsi , xvli , xvri and xoi 2 Rd denote the word embeddings corresponding to Si; Vli ; Vri and Oi.7 We define the similarity between two relations r1 and r2 as:

sim (r1; r2) = g (xvTl1 xvl2 ) + g (xvTr1 xvr2 ) ; where g is an artificial neuron with = f ; g, a scale 2 R, and a bias 2 R parameter, followed by a non-linearity. We use as non-linearity the sigmoid function. Furthermore, here we assume that all word embeddings are l2-normalized.

Given the input relation r : el ! er, we test whether the relation is correct or wrong as follows. Let npos and nneg denote the number of positive and negative training samples, respectively. First, we compare to all positive and negative training relations in the training data set, and denote the resulting vectors as upos 2 Rnpos and uneg 2 Rnneg , respectively. That is formally upos = sim (r; rtpos) and uneg = sim (r; rtneg) ;

t t where rtpos and rtneg denotes the t-th positive/negative training sample.

Next, we define the score that r is correct/wrong as the weighted average of the relation similarities: opos = softmax (upos)T upos and oneg = softmax (uneg)T uneg where softmax (u) returns a column vector with the t-th output defined as softmax (u)t =

e ut P e ui ;

i ! 1; softmax (u) = max(u), and for = 0, and 2 R is a weighting parameter. Note that for o is the average of u.

Finally, we define the happens-before score for r as

l(el; er) = opos(el; er) oneg(el; er) : The score l(el; er) can be considered as an unnormalized log probability that relation r is a happensbefore relation. The basic components of the network are illustrated in Figure 2. For optimizing the parameters of our model we minimize the rank margin loss:

L(rpos; rneg) = maxf0; 1 l(el; erpos) + l(el; erneg))g ; where rpos : el ! epos and rneg : el ! eneg are positive and negative samples from the held-out r r training data. All parameters of the models are trained using stochastic gradient descent (SGD). Word embeddings (xs; xv, and xo) are kept fixed during training.

Our model can be interpreted as an instance of the Memory Networks proposed in [ 17 ]. Using the notation from [ 17 ], I( ) corresponds to the word embedding lookup, G( ) saves all training samples into the memory, the O( ) function corresponds to (opos; oneg), and the output of R( ) equals Equation (3).

Our model also has similarity to the memory-based reasoning system proposed in [ 16 ], with two differences. First, we use here a trainable similarity measure, see Equation (1), rather than a fixed distance measure. Second, we use the trainable softmax rather than max.

7Remark about our notation: we use bold fonts, like v to denote a column vector; vT to denote the transpose, and vt to denote the t-th dimension of v. (1) (2) (3) (4) “John starts marathon” “John finishes marathon”

Relation Similarity Score

Difference Similarities To Positive Relations

Similarities To Negative

Relations Weighted Average

Weighted Average

Input: “Sara starts Tokyo-Marathon” “Sara withdrawsfrom Tokyo-Marathon” “John starts marathon” “John attends marathon” Relation Similarity

We also investigate several other models that can be applied for ranking temporal relations. All models that we consider are based on word embeddings in order to be able to generalize to unseen events.

Our first model is based on the bilinear model proposed in [ 1 ] for document retrieval, with scoring function l(el; er) = zlT M zr, where zl and zr are the concatenated word embeddings xs; xvl ; xo and xs; xvr ; xo, respectively, and parameter matrix M 2 R3d 3d. We denote this model as Bai2009. We also test three neural network architecture that were proposed in different contexts. The model in [ 2 ], originally proposed for semantic parsing, is a three layer network that can learn non-linear combinations of (subject, verb) and (verb, object) pairs. The non-linearity is achieved by the Hadamard product of the hidden layers. The original network can handle only events (relations between verb and objects, but not relations between events). We recursively extend the model to handle relations between events. We denote the model as Bordes2012.

In the context of script learning, recently two neural networks have been proposed for detecting happens-before relations. The model proposed in [ 11 ] (here denoted Modi2014) learns event embeddings parameterized with verb and context (subject or object) dependent embedding matrices. The event embeddings are then mapped to a score that indicates temporal time. To score a relation between events, we use the dot-product between the two events’ embeddings.8 The model in [ 6 ] suggests a deeper architecture than [ 11 ]. Their model (denoted here Granroth2016) uses additionally two non-linear layers for combining the left and right events. All neural networks and the Bai2009 model were trained in the same way as the proposed method, i.e. optimized with respect to rank margin loss using Equation (4).9 For all methods, we kept the word embeddings fixed (i.e. no training), since this improved performance in general.

Our final two models use the rankSVM Algorithm proposed in [ 7 ] with the implementation from [ 8 ]. We tested both a linear and a rbf-kernel with the hyper-parameters optimized via grid-search. To represent a sample, we concatenate the embeddings of all words in the relation. 6

Experiments

We split the data set into training (around 50%), validation (around 25%), and testing (around 25%) set. Due to the relatively small size of the data we repeated each experiment 10 times for different random splits (training/validation/test).

8We also tried two variations: left and right events with different and same parameterization. However, the results did not change significantly.

9Originally, the model in [ 6 ] was optimized with respect to negative log-likelihood, however in our setting we found that rank-margin loss performed better. For the bilinear model and all neural networks, we performed up to 2000 epochs, and used early stopping with respect to the validation set. Some models were quite sensitive to the choice of the learning rates, so we tested 0.00001, 0.0001, and 0.001, and report the best results on the validation set.

For our proposed method, we set the learning rate constant to 0.001. Furthermore, we note that our proposed method requires two types of training data, one type of training data that is in memory, the other type that is used for learning the parameters. For the former and latter we used the training and validation fold, respectively. As initial parameters for this non-convex optimization problem we set = 1:0; = 0:5; = 5:0, that were selected via the validation set.

For testing, we consider the challenging scenario, where the left event of the sample contains a verb that is not contained in the training set (and also not in the validation set).

We report accuracy, when asking the question: given observation (S; Vl; O), is (S; Vrpos; O) more likely to be a future event than (S; Vrneg; O)? We used the 50 dimensional word embeddings from GloVe tool [ 12 ] trained on Wikipedia + Gigaword 5 provided by the authors (GloVe)11.

The results of our method and previously proposed methods are shown in Table 3, upper half. By using the false-negative estimate from Section 3.1, we also calculated an estimate of the human performance on this task.12 The results suggest that our proposed model provides good generalization performance that is at par with the neural network recently proposed in [ 6 ] (Granroth2016), and SVM ranking with RBF-kernel. The results support our claim that the happens-before relation can be detected by analogy-based reasoning. 6.1

Analysis

We also compared to four variations of our proposed method. The results are shown in Table 3, lower half.

The first two variations use as similarity measure the addition of the word embeddings’ inner products, i.e. g in Equation (1) is the identity function, and have no trainable parameters. The variation denoted by “Memory Comparison Network (max, no parameters)”, is a kind of nearest neighbour ranking, that uses the max function instead of softmax . The second variation, denoted by “Memory 11http://nlp.stanford.edu/projects/glove/ 12We assume that distinguishing a false-negative from a true-positive is not possible (i.e. a human needs to guess), and that all guesses are wrong. input relation: (index,climb,percent) ! (index,slide,percent) input relation: (parliament,discuss,budget) ! (parliament,adopt,budget) supporting evidence: supporting evidence: (rate,rise,percent) ! (rate,tumble,percent) (index,finish,point) 9 (index,slide,point) supporting evidence: supporting evidence: (refiner,introduce,system) ! (refiner,adopt,system) (union,call,strike) 9 (union,propose,strike) input relation: (price,gain,cent) 9 (price,strengthen,cent) input relation: (farmer,plant,acre) 9 (farmer,seed,acre) supporting evidence: supporting evidence: (investment,build,plant) ! (investment,expand,plant) (dollar,rise,yen) 9 (dollar,strengthen,yen) supporting evidence: supporting evidence: (refinery,produce,tonne) ! (refinery,process,tonne) (refinery,produce,tonne) 9 (refinery,receive,tonne) Comparison Network (average, no parameters)”, uses for opos and oneg , in Equations (2), the average of upos and uneg , respectively. The performance of both variations is below our proposed method. Furthermore, we compared to an alternative model, where the softmax is replaced by the max function, marked by “(max, trained)” in Table 3, lower half. Also, we compared to our proposed model, but without learning parameters, i.e. the parameters are set to the initial parameters, marked by “(softmax , initial parameters)” in Table 3, lower half. We can see that the choice of softmax , over max, improves performance, and that the training of all parameters with SGD is effective (in particular, see improvement on validation data).

Since our model uses analogy-based reasoning, we can easily identify ”supporting evidence” for the output of our system. Four examples are shown in Table 4. Here, “supporting evidence” denotes the training sample with the highest similarity sim to the input. In the first and second example, the input is a happens-before relation, in the third and fourth example, the input is not a happens-before relation.13 7

Discussion

Our current method does not model the interaction between subject, object and verb. However, temporal relations can also crucially depend on subject and object. As an example, in our data set (see Table 2), we have the happens-before relation (company, buy, share) ! (company, use, share). Clearly, if we replace the subject by “kid” and the object by “ice-cream”, the happens-before relation becomes wrong, or much less likely. In particular, (kid, buy, ice-cream) ! (kid, use, ice-cream) is much less likely than, for example, (kid, buy, ice-cream) ! (kid, eat, ice-cream).14 Here, we compared two temporal rules r1 and r2 and asked which one is more likely, by ranking them. However, reasoning in terms of probabilities of future events, would allow us to integrate our predictions into a probabilistic reasoning framework like MLN [ 14 ]. 8

Conclusions

We investigated how common knowledge, provided by lexical resources, can be generalized and used to predict future events. In particular, we proposed a memory network that can learn how to compare and combine the similarity of the input events to event relations saved in memory. This way our proposed method can generalize to unseen events and also provide evidence for its reasoning. Our experiments suggest that our method is competitive to other (deep) neural networks and rankSVM. 13Since we considered only the head, a unit like “percent” means “x percent”, where x is some number. 14Partly, this could be addressed by considering also the selectional preference of verbs like “eat” and “use”.

[1]

Bing

Bai , Jason Weston, David Grangier,

Ronan

Collobert , Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and

Kilian

Weinberger . Supervised semantic indexing . In Proceedings of the 18th ACM conference on Information and knowledge management , pages 187 - 196 . ACM, 2009 .

[2]

Antoine

Bordes , Xavier Glorot, Jason Weston, and

Yoshua

Bengio . Joint learning of words and meaning representations for open-text semantic parsing . In International Conference on Artificial Intelligence and Statistics , pages 127 - 135 , 2012 .

[3]

Nathanael

Chambers and

Daniel

Jurafsky . Unsupervised learning of narrative event chains . In ACL , volume 94305 , pages 789 - 797 , 2008 .

[4]

Timothy

Chklovski and

Patrick

Pantel . Verbocean: Mining the web for fine-grained semantic verb relations . In EMNLP , volume 4 , pages 33 - 40 , 2004 .

[5]

Christiane

Fellbaum and

George

Miller . Wordnet: An electronic lexical database . MIT Press, 1998 .

[6]

Mark

Granroth-Wilding and Clark. What happens next? Event prediction using a compositional neural network model . AAAI , 2016 .

[7]

Thorsten

Joachims . Optimizing search engines using clickthrough data . In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 133 - 142 . ACM, 2002 .

[8]

Thorsten

Joachims . Training linear svms in linear time . In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 217 - 226 . ACM, 2006 .

[9]

David D

Lewis ,

Yiming

Yang , Tony G Rose, and

Fan

Li . Rcv1: A new benchmark collection for text categorization research . The Journal of Machine Learning Research , 5 : 361 - 397 , 2004 .

[10] Christopher

Manning , Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J.

Bethard , and David McClosky . The Stanford CoreNLP natural language processing toolkit . In ACL System Demonstrations , pages 55 - 60 , 2014 .

[11]

Ashutosh

Modi and

Ivan

Titov . Inducing neural models of script knowledge . In CoNLL , volume 14 , pages 49 - 57 , 2014 .

[12] Jeffrey

Pennington

, Richard Socher, and

Christopher D

Manning . Glove: Global vectors for word representation . In Conference on Empirical Methods on Natural Language Processing , pages 1532 - 43 , 2014 .

[13]

Karl

Pichotta and Raymond J Mooney . Statistical script learning with multi-argument events . In EACL , volume 14 , pages 220 - 229 , 2014 .

[14]

Matthew

Richardson and

Pedro

Domingos . Markov logic networks . Machine learning , 62 ( 1- 2 ): 107 - 136 , 2006 .

[15] Rachel

Rudinger

, Pushpendre Rastogi, Francis Ferraro, and Benjamin Van Durme. Script induction as language modeling . In EMNLP , 2015 .

[16]

Craig

Stanfill and

David

Waltz . Toward memory-based reasoning . Communications of the ACM , 29 ( 12 ): 1213 - 1228 , 1986 .

[17] Jason

Weston

, Sumit Chopra, and

Antoine

Bordes . Memory networks . ICLR 2015 , 2015 .

[18] Nianwen

Xue

, Hwee Tou Ng, Sameer Pradhan, Rashmi Prasad, Christopher Bryant, and Attapol

Rutherford . The conll -2015 shared task on shallow discourse parsing . In CoNLL, page 2 , 2015 .