MEDEA: Merging Event knowledge and Distributional vEctor Addition Ludovica Pannitto Alessandro Lenci CoLing Lab, University of Pisa CoLing Lab, University of Pisa ellepannitto@gmail.com alessandro.lenci@unipi.it Abstract Model (Baroni et al., 2014). However, the success of vector addition is quite puzzling from the lin- English. The great majority of composi- guistic and cognitive point of view: the meaning tional models in distributional semantics of a complex expression is not simply the sum of present methods to compose distributional the meaning of its parts, and the contribution of vectors or tensors in a representation of the a lexical item might be different depending on its sentence. Here we propose to enrich the syntactic as well as pragmatic context. best performing method (vector addition, which we take as a baseline) with distri- The majority of available models in literature butional knowledge about events, outper- assumes the meaning of complex expressions like forming our baseline. sentences to be a vector (i.e., an embedding) pro- Italiano. La maggior parte dei mod- jected from the vectors representing the content elli proposti nell’ambito della seman- of its lexical parts. However, as pointed out by tica disribuzionale composizionale si basa Erk and Padó (2008), while vectors serve well the sull’utilizzo dei soli vettori lessicali. Pro- cause of capturing the semantic relatedness among poniamo di arricchire il miglior modello lexemes, this might not be the best choice for presente in letteratura (la somma di vet- more complex linguistic expressions, because of tori, che consideriamo come baseline) con the limited and fixed amount of information that informazione distribuzionale sugli eventi can be encoded. Moreover events and situations, elicitati dalla frase, migliorando sistem- expressed through sentences, are by definition in- aticamente i risultati della baseline. herently complex and structured semantic objects. Actually, assuming the equation “meaning is vec- tor” is eventually too limited even at the lexical 1 Compositional Distributional level. Semantics: Beyond vector addition Psycholinguistic evidence shows that lexical Composing word representations into larger items activate a great amount of generalized event phrases and sentences notoriously represents a knowledge (GEK) (Elman, 2011; Hagoort and big challenge for distributional semantics (Lenci, van Berkum, 2007; Hare et al., 2009), and that this 2018). Various approaches have been proposed knowledge is crucially exploited during online ranging from simple arithmetic operations on language processing, constraining the speakers’ word vectors (Mitchell and Lapata, 2008), to expectations about upcoming linguistic input algebraic compositional functions on higher-order (McRae and Matsuki, 2009). GEK is concerned objects (Baroni et al., 2014; Coecke et al., 2010), with the idea that the lexicon is not organized as as well as neural networks approaches (Socher et a dictionary, but rather as a network, where words al., 2010; Mikolov et al., 2013). trigger expectations about the upcoming input, influenced by pragmatic knowledge along with Among all proposed compositional functions, lexical knowledge. Therefore sentence compre- vector addition still shows the best performances hension can be phrased as the identification of the on various tasks (Asher et al., 2016; Blacoe and event that best explains the linguistic cues used in Lapata, 2012; Rimell et al., 2016), beating more the input (Kuperberg and Jaeger, 2016). complex methods, such as the Lexical Functional In this paper, we introduce MEDEA, a compo- would ideally keep track of each event automat- sitional distributional model of sentence meaning ically retrieved from corpora, thus indirectly con- which integrates vector addition with GEK acti- taining information about schematic or underspec- vated by lexical items. MEDEA is directly in- ified events, by abstracting over one or more par- spired by the model in Chersoni et al. (2017a) and ticipants from each recorded instance. Events are relies on two major assumptions: cued by all the potential participants to the event. The nodes of DEG are lexical embeddings, and • lexical items are represented with embed- edges link lexical items participating to the same dings within a network of syntagmatic rela- events (i.e., its syntagmatic neighbors). Edges are tions encoding prototypical knowledge about weighted with respect to the statistical salience of events; the event given the item. Weights, expressed in • the semantic representation of a sentence is terms of a statistical association measure such as a structured object incrementally integrat- Local Mutual Information, determine the event ac- ing the semantic information cued by lexical tivation strength by linguistic cues. items. In order to build DEG, we automatically har- vested events from corpora, using syntactic re- We test MEDEA on two datasets for composi- lations as an approximation of semantic roles of tional distributional semantics in which addition event participants. From a dependency parsed sen- has proven to be very hard to beat. At least, before tence we identified an event by selecting a seman- meeting MEDEA. tic head (verb or noun) and grouping all its syn- tactic dependents together (Figure 1). Since we 2 Introducing MEDEA expect each participant to be able to trigger the event and consequently any of the other partici- MEDEA consists of two main components: i.) a pants, a relation can be created and added to the Distributional Event Graph (DEG) that models a graph from each subset of each group extracted fragment of semantic memory activated by lexical from sentence. units (Section 2.1); ii.) a Meaning Composition Function that dynamically integrates information activated from DEG to build a sentence semantic representation (Section 2.2). 2.1 Distributional Event Graph We assume a broad notion of event, corresponding to any configuration of entities, actions, prop- Figure 1: Dependency analysis for the sentence The student erties, and relationships. Accordingly, an event is reading the book about Shakespeare in the university li- brary. Three events are identified (dotted boxes). can be a complex relationship between entities, as the one expressed by the sentence The student read a book, but also the association between an indi- The resulting structure is therefore a weighted hy- vidual and a property, as expressed by the noun pergraph, as it contains relations holding among phrase heavy book. groups of nodes, and a labeled multigraph, since In order to represent the GEK cued by lexi- each edge or hyperedge is labeled in order to rep- cal items during sentence comprehension, we ex- resent the syntactic pattern holding in the group. plored a graph based implementation of a distri- As graph nodes are embeddings, given a lexical butional model, for both theoretical and method- cue w, DEG can be queried in two modes: ological reasons: in graphs, structural-syntactic • retrieving the most similar nodes to w (i.e., information and lexical information can naturally its paradigmatic neighbors), using a standard coexist and be related, moreover vectorial distri- vector similarity measure like the cosine (Ta- butional models often struggle with the model- ble 1, top row); ing of dynamic phenomena, as it is often difficult • retrieving the closest associates of w (i.e., its to update the recorded information, while graphs syntagmatic neighbors), using the weights on are more suitable for situations where relations the graph edges (Table 1, bottom row). among items change overtime. The data structure essay/N, anthology/N, novel/N, author/N, para. neighbors publish/N, biography/N, autobiography/N, nonfiction/N, story/N, novella/N publish/V, write/V, read/V, synt. neighbors include/V, child/N, series/N, have/V, buy/V, author/N, contain/V Table 1: The 10 nearest paradigmatic (top) and syntagmatic (bottom) neighbours of book/N, extracted from DEG. By fur- ther restricting the query on the graph neighbors, we can ob- tain for instance typical subjects of book as a direct object (people/N, child/N, student/N, etc.). 2.2 Meaning Composition Function Figure 2: The image shows the internal architecture of a piece of EK retrieved from DEG. The interface with DEG In MEDEA, we model sentence comprehension is shown on the left side of the picture, each internal list of as the creation of a semantic representation SR, neighbors is labeled with their expected syntactic role in the sentence. All the items are intended to be embeddings. which includes two different yet interacting in- formation tiers that are equally relevant in the overall representation of sentence meaning: i.) element of a weighted lists is re-ranked accord- the lexical meaning component (LM), which is a ing to its cosine similarity with the correspondent context-independent tier of sentence meaning that centroid (e.g., the newly retrieved weighted list of accumulates the lexical content of the sentence, subjects is ranked according to the cosine similar- as traditional models do; ii.) an active context ity of each item in the list with the weighted cen- (AC), which aims at representing the most prob- troid of subjects available in AC). able event, in terms of its participants, that can be The final semantic representation of a sentence reconstructed from DEG portions cued by lexical consists of two vectors, the lexical meaning vec- −−→ −→ items. This latter component corresponds to the tor (LM ) and the event knowledge vector (AC), GEK activated by the single lexemes (or by other which is obtained by composing the weighted cen- contextual elements) and integrated into a seman- troids of each role in AC. tically coherent structure representing the sentence interpretation. It is incrementally updated during 3 Experiments processing, when a new input is integrated into ex- 3.1 Datasets isting information. We wanted to evaluate the contribution of ac- 2.2.1 Active Context tivated event knowledge in a sentence compre- hension task. For this reason, among the many Each lexical item in the input activates a portion of existing datasets concerning entailment or para- GEK that is integrated into the current AC through phrase detection, we chose RELPRON (Rimell et a process of mutual re-weighting that aims at max- al., 2016), a dataset of subject and object rela- imizing the overall semantic coherence of the SR. tive clauses, and the transitive sentence similar- At the outset, no information is contained in the ity dataset presented in Kartsaklis and Sadrzadeh AC of the sentence. When new lexeme - syntac- (2014). These two datasets show an intermediate tic role pair hwi , ri i (e.g., student - nsbj) are en- level of grammatical complexity, as they involve countered, expectations about the set of upcoming complete sentences (while other datasets include roles in the sentences are generated from DEG (fig- smaller phrases), but have fixed length structures ure 2). These include: i.) expectations about the featuring similar syntactic constructions (i.e., tran- role filled by the lexeme itself, which consists of sitive sentences). The two datasets differ with re- its vector (and possibly its p-neighbours); ii.) ex- spect to size and construction method. pectations about sentence structure and other par- ticipants, which are collected in weighted list of RELPRON consists of 1,087 pairs, split in devel- vectors of its s-neighbours. opment and test set, made up by a target noun These expectations are then weighted with re- labeled with a syntactic role (either subject spect to what is already in the AC, and the AC is or direct object) and a property expressed as similarly adapted to the ewly retrieved informa- [head noun] that [verb] [argument]. For in- tion: each weighted list is represented with the stance, here are some example properties for weighted centroid of its top elements, and each the target noun treaty: (1) a. OBJ treaty/N: document/N that delega- compose the proper relative clause, and each el- tion/N negotiate/V ement of the triplet is associated with its syntactic b. SBJ treaty/N: document/N that grant/V in- dependence/N role in the property sentence.3 Likewise, each sen- tence of the transitive sentences dataset is a triplet Transitive sentence similarity dataset consists ((w1 , nsbj), (w2 , root), (w3 , dobj)). of 108 pairs of transitive sentences, each annotated with human similarity judgments 3.3 Active Context implementation collected through the Amazon Mechanical In MEDEA, the SR is composed of two vectors: Turk platform. Each transitive sentence in −−→ • LM , as the sum of the word embeddings (as composed by a triplet subject verb object. this was the best performing model in litera- Here are two pairs with high (2) and low (3) ture, on the chosen datasets); similarity scores respectively: −→ • AC, obtained by summing up all the (2) a. government use power weighted centroids of triggered participants. b. authority exercise influence Each lexeme - syntactic role pair is used to re- (3) a. team win match trieve its 50 top s-neighbors from the graph. b. design reduce amount The top 20 re-ranked elements were used to build each weighted centroid. These thresh- 3.2 Graph implementation old were choosen empirically, after a few tri- We tailored the construction of the DEG to this als with different (i.e., higher) thresholds (as kind of simple syntactic structures, restricting it in Chersoni et al. (2017b)). to the case of relations among pairs of event We provide an example of the re-weighting pro- participants. Relations were automatically ex- cess with the property document that store main- tracted from a 2018 dump of Wikipedia, BNC, tains, whose target is inventory: i.) at first the head and ukWaC corpora, parsed with the Stanford noun document is encountered: its vector is ac- CoreNLP Pipeline (Manning et al., 2014). tivated as event knowledge for the object role of Each h(word1 , word2 ), (r1 , r2 )i pair was then the sentence and constitutes the contextual infor- weighted with a smoothed version of Local Mu- mation in AC against which GEK is re-weighted; tual Information1 : ii.) store as a subject triggers some direct object participants, such as product, range, item, technol- LM Iα (w1 , w2 , r1 , r2 ) = f (w1 , w2 , r1 , r2 )log( P̂ (wP̂ (w 1 ,w2 ,r1 ,r2 ) )Pˆ (w )P̂ (r ,r ) ) (1) 1 α 2 1 2 ogy, etc. If the centroid were built from the top of where: this list, the cosine similarity with the target would f (x)α be around 0.62; iii.) s-neighbours of store are re- Pˆα (x) = P (2) f (x)α x weighted according to the fact that AC contains some information about the target already, (i.e., Each lexical node in DEG was then represented the fact that it is a document). The re-weighting with its embedding. We used the same training process has the effect of placing on top of the list parameters as in Rimell et al. (2016),2 , since we elements that are more similar to document. Thus, wanted our model to be directly comparable with now we find collection, copy, book, item, name, their results on the dataset. While Rimell et al. trading, location, etc., improving the cosine sim- (2016) built the vectors from a 2015 download of ilarity with the target, that goes up to 0.68; iv.) Wikpedia, we needed to cover all the lexemes con- the same happens for maintain: its s-neighbors are tained in the graph and therefore we used the same retrieved and weighted against the complete AC, corpora from which the DEG was extracted. improving their cosine similarity with inventory, We represented each property in RELPRON as from 0.55 to 0.61. a triplet ((hn, r), (w1 , r1 ), (w2 , r2 )) where hn is the head noun, w1 and w2 are the lexemes that 3.4 Evaluation 1 The smoothed version (with α = 0.75) was chosen in We evaluated our model on RELPRON develop- order to alleviate PMI’s bias towards rare words (Levy et al., 2015), which arises especially when extending the graph to ment set using Mean Average Precision (MAP), as more complex structures than pairs. 2 3 lemmatized 100-dim vectors with skip-gram with nega- The relation for the head noun is assumed to be the same tive sampling (SGNS (Mikolov et al., 2013)), setting mini- as the target relation (either subject of direct object of the mum item frequency at 100 and context window size at 10. relative clause). in Rimell et al. (2016). We produced the compo- 4 Conclusion sitional representation of each property in terms We provided a basic implementation of a mean- of SR, and then ranked for each target all the 518 ing composition model, which aims at being in- properties of the dataset portion, according to their cremental and cognitively plausible. While still similarity to the target. Our main goal was to eval- relying on vector addition, our results suggest that uate the contribution of event knowledge, there- distributional vectors do not encode sufficient in- fore the similarity between the target vector and formation about event knowledge, and that, in line the property SR was measured as the sum of the −−→ with psycholinguistic results, activated GEK plays cosine similarity of the target vector with the LM an important role in building semantic representa- of the property, and the cosine similarity of the tar- −→ tions during online sentence processing. get vector with the AC cued by each property. As shown in Table 2, the full MEDEA model (last col- Our ongoing work focuses on refining the way umn) achieves top performance, above the simple in which this event knowledge takes part in the additive model LM. processing phase and testing its performance on more complex datasets: while both RELPRON and the transitive sentences dataset provided a straight RELPRON forward mapping between syntactic label and se- LM AC LM + AC mantic roles, more naturalistic datasets show a verb 0,18 0,18 0,20 much wider range of syntactic phenomena that arg 0,34 0,34 0,36 would allow us to test how expectations jointly hn+verb 0,27 0,28 0,29 work on syntactic structure and semantic roles. hn+arg 0,47 0,45 0,49 verb+arg 0,42 0,28 0,39 hn+verb+arg 0,51 0,47 0,55 References Table 2: The table shows results in terms of MAP for the Nicholas Asher, Tim Van de Cruys, Antoine Bride, and development subset of RELPRON. Except for the case of Márta Abrusán. 2016. Integrating Type Theory and verb+arg, the models involving event knowledge in AC al- Distributional Semantics: A Case Study on Adjec- ways improve the baselines (i.e., LM models). tive–Noun Compositions. Computational Linguis- tics, 42(4):703–725. For the transitive sentences dataset, we evalu- Marco Baroni, Raffaela Bernardi, and Roberto Zam- ated the correlation of our scores with human rat- parelli. 2014. Frege in Space: A Program of Com- positional Distributional Semantics. Linguistic Is- ings with Spearman’s ρ. The similarity between sues in Language Technology, 9(6):5–110. a pair of sentences s1 , s2 is defined as the cosine between their LM vectors plus the cosine between William Blacoe and Mirella Lapata. 2012. A com- their EK vectors. MEDEA is in the last column of parison of vector-based representations for seman- tic composition. In Proceedings of the 2012 joint Table 3 and again outperforms simple addition. conference on empirical methods in natural lan- guage processing and computational natural lan- guage learning, pages 546–556. Association for transitive sentences dataset Computational Linguistics. LM AC LM + AC sbj 0.432 0.475 0.482 Emmanuele Chersoni, Alessandro Lenci, and Philippe root 0.525 0.547 0.555 Blache. 2017a. Logical metonymy in a distribu- tional model of sentence comprehension. In Sixth obj 0.628 0.537 0.637 Joint Conference on Lexical and Computational Se- sbj+root 0.656 0.622 0.648 mantics (* SEM 2017), pages 168–177. sbj+obj 0.653 0.605 0.656 root+obj 0.732 0.696 0.750 Emmanuele Chersoni, Enrico Santus, Philippe Blache, and Alessandro Lenci. 2017b. Is structure neces- sbj+root+obj 0.732 0.686 0.750 sary for modeling argument expectations in distribu- tional semantics? In 12th International Conference Table 3: The table shows results in terms of Spearman’s ρ on the transitive sentences dataset. Except for the case of on Computational Semantics (IWCS 2017). sbj+root, the models involving event knowledge in AC al- ways improve the baselines. p-values are not shown because Bob Coecke, Stephen Clark, and Mehrnoosh they are all equally significant (p < 0.01). Sadrzadeh. 2010. Mathematical foundations for a compositional distributional model of mean- ing. Technical report. Jeffrey L Elman. 2011. Lexical knowledge without a Richard Socher, Christopher D Manning, and An- lexicon? The mental lexicon, 6(1):1–33. drew Y Ng. 2010. Learning continuous phrase representations and syntactic parsing with recursive Katrin Erk and Sebastian Padó. 2008. A structured neural networks. In Proceedings of the NIPS-2010 vector space model for word meaning in context. In Deep Learning and Unsupervised Feature Learning Proceedings of the Conference on Empirical Meth- Workshop, volume 2010, pages 1–9. ods in Natural Language Processing, pages 897– 906. Association for Computational Linguistics. Peter Hagoort and Jos van Berkum. 2007. Be- yond the sentence given. Philosophical Transac- tions of the Royal Society B: Biological Sciences, 362(1481):801–811. Mary Hare, Michael Jones, Caroline Thomson, Sarah Kelly, and Ken McRae. 2009. Activating event knowledge. Cognition, 111(2):151–167. Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. A study of entanglement in a categorical framework of natural language. In Proceedings of the 11th Work- shop on Quantum Physics and Logic (QPL). Kyoto, Japan. Gina R Kuperberg and T Florian Jaeger. 2016. What do we mean by prediction in language compre- hension? Language, cognition and neuroscience, 31(1):32–59. Alessandro Lenci. 2018. Distributional Models of Word Meaning. Annual Review of Linguistics, 4:151–171. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the Associ- ation for Computational Linguistics, 3:211–225. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc- Closky. 2014. The Stanford CoreNLP natural lan- guage processing toolkit. In Association for Compu- tational Linguistics (ACL) System Demonstrations, pages 55–60. Ken McRae and Kazunaga Matsuki. 2009. People use their knowledge of common events to understand language, and do so as quickly as possible. Lan- guage and linguistics compass, 3(6):1417–1429. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in neural information processing systems, pages 3111–3119. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. Laura Rimell, Jean Maillard, Tamara Polajnar, and Stephen Clark. 2016. Relpron: A relative clause evaluation data set for compositional distributional semantics. Computational Linguistics, 42(4):661– 701.