MEDEA: Merging Event knowledge and Distributional vEctor Addition

                  Ludovica Pannitto                          Alessandro Lenci
             CoLing Lab, University of Pisa             CoLing Lab, University of Pisa
            ellepannitto@gmail.com                    alessandro.lenci@unipi.it


                      Abstract                        Model (Baroni et al., 2014). However, the success
                                                      of vector addition is quite puzzling from the lin-
     English. The great majority of composi-          guistic and cognitive point of view: the meaning
     tional models in distributional semantics        of a complex expression is not simply the sum of
     present methods to compose distributional        the meaning of its parts, and the contribution of
     vectors or tensors in a representation of the    a lexical item might be different depending on its
     sentence. Here we propose to enrich the          syntactic as well as pragmatic context.
     best performing method (vector addition,
     which we take as a baseline) with distri-
                                                         The majority of available models in literature
     butional knowledge about events, outper-
                                                      assumes the meaning of complex expressions like
     forming our baseline.
                                                      sentences to be a vector (i.e., an embedding) pro-
     Italiano. La maggior parte dei mod-              jected from the vectors representing the content
     elli proposti nell’ambito della seman-           of its lexical parts. However, as pointed out by
     tica disribuzionale composizionale si basa       Erk and Padó (2008), while vectors serve well the
     sull’utilizzo dei soli vettori lessicali. Pro-   cause of capturing the semantic relatedness among
     poniamo di arricchire il miglior modello         lexemes, this might not be the best choice for
     presente in letteratura (la somma di vet-        more complex linguistic expressions, because of
     tori, che consideriamo come baseline) con        the limited and fixed amount of information that
     informazione distribuzionale sugli eventi        can be encoded. Moreover events and situations,
     elicitati dalla frase, migliorando sistem-       expressed through sentences, are by definition in-
     aticamente i risultati della baseline.           herently complex and structured semantic objects.
                                                      Actually, assuming the equation “meaning is vec-
                                                      tor” is eventually too limited even at the lexical
1    Compositional Distributional                     level.
     Semantics: Beyond vector addition
                                                         Psycholinguistic evidence shows that lexical
Composing word representations into larger            items activate a great amount of generalized event
phrases and sentences notoriously represents a        knowledge (GEK) (Elman, 2011; Hagoort and
big challenge for distributional semantics (Lenci,    van Berkum, 2007; Hare et al., 2009), and that this
2018). Various approaches have been proposed          knowledge is crucially exploited during online
ranging from simple arithmetic operations on          language processing, constraining the speakers’
word vectors (Mitchell and Lapata, 2008), to          expectations about upcoming linguistic input
algebraic compositional functions on higher-order     (McRae and Matsuki, 2009). GEK is concerned
objects (Baroni et al., 2014; Coecke et al., 2010),   with the idea that the lexicon is not organized as
as well as neural networks approaches (Socher et      a dictionary, but rather as a network, where words
al., 2010; Mikolov et al., 2013).                     trigger expectations about the upcoming input,
                                                      influenced by pragmatic knowledge along with
  Among all proposed compositional functions,         lexical knowledge. Therefore sentence compre-
vector addition still shows the best performances     hension can be phrased as the identification of the
on various tasks (Asher et al., 2016; Blacoe and      event that best explains the linguistic cues used in
Lapata, 2012; Rimell et al., 2016), beating more      the input (Kuperberg and Jaeger, 2016).
complex methods, such as the Lexical Functional
   In this paper, we introduce MEDEA, a compo-         would ideally keep track of each event automat-
sitional distributional model of sentence meaning      ically retrieved from corpora, thus indirectly con-
which integrates vector addition with GEK acti-        taining information about schematic or underspec-
vated by lexical items. MEDEA is directly in-          ified events, by abstracting over one or more par-
spired by the model in Chersoni et al. (2017a) and     ticipants from each recorded instance. Events are
relies on two major assumptions:                       cued by all the potential participants to the event.
                                                          The nodes of DEG are lexical embeddings, and
    • lexical items are represented with embed-        edges link lexical items participating to the same
      dings within a network of syntagmatic rela-      events (i.e., its syntagmatic neighbors). Edges are
      tions encoding prototypical knowledge about      weighted with respect to the statistical salience of
      events;                                          the event given the item. Weights, expressed in
    • the semantic representation of a sentence is     terms of a statistical association measure such as
      a structured object incrementally integrat-      Local Mutual Information, determine the event ac-
      ing the semantic information cued by lexical     tivation strength by linguistic cues.
      items.                                              In order to build DEG, we automatically har-
                                                       vested events from corpora, using syntactic re-
We test MEDEA on two datasets for composi-             lations as an approximation of semantic roles of
tional distributional semantics in which addition      event participants. From a dependency parsed sen-
has proven to be very hard to beat. At least, before   tence we identified an event by selecting a seman-
meeting MEDEA.                                         tic head (verb or noun) and grouping all its syn-
                                                       tactic dependents together (Figure 1). Since we
2     Introducing MEDEA                                expect each participant to be able to trigger the
                                                       event and consequently any of the other partici-
MEDEA consists of two main components: i.) a
                                                       pants, a relation can be created and added to the
Distributional Event Graph (DEG) that models a
                                                       graph from each subset of each group extracted
fragment of semantic memory activated by lexical
                                                       from sentence.
units (Section 2.1); ii.) a Meaning Composition
Function that dynamically integrates information
activated from DEG to build a sentence semantic
representation (Section 2.2).

2.1    Distributional Event Graph
We assume a broad notion of event, corresponding
to any configuration of entities, actions, prop-       Figure 1: Dependency analysis for the sentence The student
erties, and relationships. Accordingly, an event       is reading the book about Shakespeare in the university li-
                                                       brary. Three events are identified (dotted boxes).
can be a complex relationship between entities, as
the one expressed by the sentence The student read
a book, but also the association between an indi-      The resulting structure is therefore a weighted hy-
vidual and a property, as expressed by the noun        pergraph, as it contains relations holding among
phrase heavy book.                                     groups of nodes, and a labeled multigraph, since
   In order to represent the GEK cued by lexi-         each edge or hyperedge is labeled in order to rep-
cal items during sentence comprehension, we ex-        resent the syntactic pattern holding in the group.
plored a graph based implementation of a distri-          As graph nodes are embeddings, given a lexical
butional model, for both theoretical and method-       cue w, DEG can be queried in two modes:
ological reasons: in graphs, structural-syntactic         • retrieving the most similar nodes to w (i.e.,
information and lexical information can naturally           its paradigmatic neighbors), using a standard
coexist and be related, moreover vectorial distri-          vector similarity measure like the cosine (Ta-
butional models often struggle with the model-              ble 1, top row);
ing of dynamic phenomena, as it is often difficult        • retrieving the closest associates of w (i.e., its
to update the recorded information, while graphs            syntagmatic neighbors), using the weights on
are more suitable for situations where relations            the graph edges (Table 1, bottom row).
among items change overtime. The data structure
                          essay/N, anthology/N, novel/N, author/N,
        para. neighbors   publish/N, biography/N, autobiography/N,
                          nonfiction/N, story/N, novella/N
                          publish/V, write/V, read/V,
        synt. neighbors   include/V, child/N, series/N,
                          have/V, buy/V, author/N, contain/V

Table 1: The 10 nearest paradigmatic (top) and syntagmatic
(bottom) neighbours of book/N, extracted from DEG. By fur-
ther restricting the query on the graph neighbors, we can ob-
tain for instance typical subjects of book as a direct object
(people/N, child/N, student/N, etc.).


2.2     Meaning Composition Function                                 Figure 2: The image shows the internal architecture of a
                                                                     piece of EK retrieved from DEG. The interface with DEG
In MEDEA, we model sentence comprehension                            is shown on the left side of the picture, each internal list of
as the creation of a semantic representation SR,                     neighbors is labeled with their expected syntactic role in the
                                                                     sentence. All the items are intended to be embeddings.
which includes two different yet interacting in-
formation tiers that are equally relevant in the
overall representation of sentence meaning: i.)                      element of a weighted lists is re-ranked accord-
the lexical meaning component (LM), which is a                       ing to its cosine similarity with the correspondent
context-independent tier of sentence meaning that                    centroid (e.g., the newly retrieved weighted list of
accumulates the lexical content of the sentence,                     subjects is ranked according to the cosine similar-
as traditional models do; ii.) an active context                     ity of each item in the list with the weighted cen-
(AC), which aims at representing the most prob-                      troid of subjects available in AC).
able event, in terms of its participants, that can be                   The final semantic representation of a sentence
reconstructed from DEG portions cued by lexical                      consists of two vectors, the lexical meaning vec-
                                                                          −−→                                      −→
items. This latter component corresponds to the                      tor (LM ) and the event knowledge vector (AC),
GEK activated by the single lexemes (or by other                     which is obtained by composing the weighted cen-
contextual elements) and integrated into a seman-                    troids of each role in AC.
tically coherent structure representing the sentence
interpretation. It is incrementally updated during
                                                                     3     Experiments
processing, when a new input is integrated into ex-                  3.1    Datasets
isting information.                                                  We wanted to evaluate the contribution of ac-
2.2.1     Active Context                                             tivated event knowledge in a sentence compre-
                                                                     hension task. For this reason, among the many
Each lexical item in the input activates a portion of
                                                                     existing datasets concerning entailment or para-
GEK that is integrated into the current AC through
                                                                     phrase detection, we chose RELPRON (Rimell et
a process of mutual re-weighting that aims at max-
                                                                     al., 2016), a dataset of subject and object rela-
imizing the overall semantic coherence of the SR.
                                                                     tive clauses, and the transitive sentence similar-
   At the outset, no information is contained in the
                                                                     ity dataset presented in Kartsaklis and Sadrzadeh
AC of the sentence. When new lexeme - syntac-
                                                                     (2014). These two datasets show an intermediate
tic role pair hwi , ri i (e.g., student - nsbj) are en-
                                                                     level of grammatical complexity, as they involve
countered, expectations about the set of upcoming
                                                                     complete sentences (while other datasets include
roles in the sentences are generated from DEG (fig-
                                                                     smaller phrases), but have fixed length structures
ure 2). These include: i.) expectations about the
                                                                     featuring similar syntactic constructions (i.e., tran-
role filled by the lexeme itself, which consists of
                                                                     sitive sentences). The two datasets differ with re-
its vector (and possibly its p-neighbours); ii.) ex-
                                                                     spect to size and construction method.
pectations about sentence structure and other par-
ticipants, which are collected in weighted list of                   RELPRON consists of 1,087 pairs, split in devel-
vectors of its s-neighbours.                                               opment and test set, made up by a target noun
   These expectations are then weighted with re-                           labeled with a syntactic role (either subject
spect to what is already in the AC, and the AC is                          or direct object) and a property expressed as
similarly adapted to the ewly retrieved informa-                           [head noun] that [verb] [argument]. For in-
tion: each weighted list is represented with the                           stance, here are some example properties for
weighted centroid of its top elements, and each                            the target noun treaty:
           (1)       a. OBJ treaty/N: document/N that delega-                                   compose the proper relative clause, and each el-
                        tion/N negotiate/V                                                      ement of the triplet is associated with its syntactic
                     b. SBJ treaty/N: document/N that grant/V in-
                        dependence/N
                                                                                                role in the property sentence.3 Likewise, each sen-
                                                                                                tence of the transitive sentences dataset is a triplet
Transitive sentence similarity dataset consists                                                 ((w1 , nsbj), (w2 , root), (w3 , dobj)).
    of 108 pairs of transitive sentences, each
    annotated with human similarity judgments                                                   3.3    Active Context implementation
    collected through the Amazon Mechanical                                                     In MEDEA, the SR is composed of two vectors:
    Turk platform. Each transitive sentence in                                                        −−→
                                                                                                    • LM , as the sum of the word embeddings (as
    composed by a triplet subject verb object.
                                                                                                      this was the best performing model in litera-
    Here are two pairs with high (2) and low (3)
                                                                                                      ture, on the chosen datasets);
    similarity scores respectively:                                                                   −→
                                                                                                    • AC, obtained by summing up all the
           (2)       a. government use power                                                          weighted centroids of triggered participants.
                     b. authority exercise influence                                                  Each lexeme - syntactic role pair is used to re-
           (3)       a. team win match                                                                trieve its 50 top s-neighbors from the graph.
                     b. design reduce amount                                                          The top 20 re-ranked elements were used to
                                                                                                      build each weighted centroid. These thresh-
3.2     Graph implementation
                                                                                                      old were choosen empirically, after a few tri-
We tailored the construction of the DEG to this                                                       als with different (i.e., higher) thresholds (as
kind of simple syntactic structures, restricting it                                                   in Chersoni et al. (2017b)).
to the case of relations among pairs of event                                                       We provide an example of the re-weighting pro-
participants. Relations were automatically ex-                                                  cess with the property document that store main-
tracted from a 2018 dump of Wikipedia, BNC,                                                     tains, whose target is inventory: i.) at first the head
and ukWaC corpora, parsed with the Stanford                                                     noun document is encountered: its vector is ac-
CoreNLP Pipeline (Manning et al., 2014).                                                        tivated as event knowledge for the object role of
   Each h(word1 , word2 ), (r1 , r2 )i pair was then                                            the sentence and constitutes the contextual infor-
weighted with a smoothed version of Local Mu-                                                   mation in AC against which GEK is re-weighted;
tual Information1 :                                                                             ii.) store as a subject triggers some direct object
                                                                                                participants, such as product, range, item, technol-
   LM Iα (w1 , w2 , r1 , r2 ) = f (w1 , w2 , r1 , r2 )log( P̂ (wP̂ (w 1 ,w2 ,r1 ,r2 )
                                                                   )Pˆ (w )P̂ (r ,r )
                                                                                      )   (1)
                                                                  1   α   2     1   2
                                                                                                ogy, etc. If the centroid were built from the top of
where:                                                                                          this list, the cosine similarity with the target would
                                        f (x)α                                                  be around 0.62; iii.) s-neighbours of store are re-
                             Pˆα (x) = P                                                  (2)
                                           f (x)α
                                        x                                                       weighted according to the fact that AC contains
                                                                                                some information about the target already, (i.e.,
Each lexical node in DEG was then represented                                                   the fact that it is a document). The re-weighting
with its embedding. We used the same training                                                   process has the effect of placing on top of the list
parameters as in Rimell et al. (2016),2 , since we                                              elements that are more similar to document. Thus,
wanted our model to be directly comparable with                                                 now we find collection, copy, book, item, name,
their results on the dataset. While Rimell et al.                                               trading, location, etc., improving the cosine sim-
(2016) built the vectors from a 2015 download of                                                ilarity with the target, that goes up to 0.68; iv.)
Wikpedia, we needed to cover all the lexemes con-                                               the same happens for maintain: its s-neighbors are
tained in the graph and therefore we used the same                                              retrieved and weighted against the complete AC,
corpora from which the DEG was extracted.                                                       improving their cosine similarity with inventory,
   We represented each property in RELPRON as                                                   from 0.55 to 0.61.
a triplet ((hn, r), (w1 , r1 ), (w2 , r2 )) where hn is
the head noun, w1 and w2 are the lexemes that                                                   3.4    Evaluation
   1
      The smoothed version (with α = 0.75) was chosen in
                                                                                                We evaluated our model on RELPRON develop-
order to alleviate PMI’s bias towards rare words (Levy et al.,
2015), which arises especially when extending the graph to                                      ment set using Mean Average Precision (MAP), as
more complex structures than pairs.
    2                                                                                              3
      lemmatized 100-dim vectors with skip-gram with nega-                                           The relation for the head noun is assumed to be the same
tive sampling (SGNS (Mikolov et al., 2013)), setting mini-                                      as the target relation (either subject of direct object of the
mum item frequency at 100 and context window size at 10.                                        relative clause).
in Rimell et al. (2016). We produced the compo-               4   Conclusion
sitional representation of each property in terms
                                                              We provided a basic implementation of a mean-
of SR, and then ranked for each target all the 518
                                                              ing composition model, which aims at being in-
properties of the dataset portion, according to their
                                                              cremental and cognitively plausible. While still
similarity to the target. Our main goal was to eval-
                                                              relying on vector addition, our results suggest that
uate the contribution of event knowledge, there-
                                                              distributional vectors do not encode sufficient in-
fore the similarity between the target vector and
                                                              formation about event knowledge, and that, in line
the property SR was measured as the sum of the
                                                 −−→          with psycholinguistic results, activated GEK plays
cosine similarity of the target vector with the LM
                                                              an important role in building semantic representa-
of the property, and the cosine similarity of the tar-
                     −→                                       tions during online sentence processing.
get vector with the AC cued by each property. As
shown in Table 2, the full MEDEA model (last col-                Our ongoing work focuses on refining the way
umn) achieves top performance, above the simple               in which this event knowledge takes part in the
additive model LM.                                            processing phase and testing its performance on
                                                              more complex datasets: while both RELPRON and
                                                              the transitive sentences dataset provided a straight
                                 RELPRON
                                                              forward mapping between syntactic label and se-
                          LM       AC     LM + AC
                                                              mantic roles, more naturalistic datasets show a
       verb               0,18    0,18      0,20              much wider range of syntactic phenomena that
       arg                0,34    0,34      0,36
                                                              would allow us to test how expectations jointly
       hn+verb            0,27    0,28      0,29
                                                              work on syntactic structure and semantic roles.
       hn+arg             0,47    0,45      0,49
       verb+arg           0,42    0,28      0,39
       hn+verb+arg        0,51    0,47      0,55              References
Table 2: The table shows results in terms of MAP for the      Nicholas Asher, Tim Van de Cruys, Antoine Bride, and
development subset of RELPRON. Except for the case of           Márta Abrusán. 2016. Integrating Type Theory and
verb+arg, the models involving event knowledge in AC al-        Distributional Semantics: A Case Study on Adjec-
ways improve the baselines (i.e., LM models).                   tive–Noun Compositions. Computational Linguis-
                                                                tics, 42(4):703–725.

   For the transitive sentences dataset, we evalu-            Marco Baroni, Raffaela Bernardi, and Roberto Zam-
ated the correlation of our scores with human rat-             parelli. 2014. Frege in Space: A Program of Com-
                                                               positional Distributional Semantics. Linguistic Is-
ings with Spearman’s ρ. The similarity between                 sues in Language Technology, 9(6):5–110.
a pair of sentences s1 , s2 is defined as the cosine
between their LM vectors plus the cosine between              William Blacoe and Mirella Lapata. 2012. A com-
their EK vectors. MEDEA is in the last column of                parison of vector-based representations for seman-
                                                                tic composition. In Proceedings of the 2012 joint
Table 3 and again outperforms simple addition.                  conference on empirical methods in natural lan-
                                                                guage processing and computational natural lan-
                                                                guage learning, pages 546–556. Association for
                       transitive sentences dataset             Computational Linguistics.
                         LM       AC     LM + AC
    sbj                0.432 0.475         0.482              Emmanuele Chersoni, Alessandro Lenci, and Philippe
    root               0.525 0.547         0.555                Blache. 2017a. Logical metonymy in a distribu-
                                                                tional model of sentence comprehension. In Sixth
    obj                0.628 0.537         0.637                Joint Conference on Lexical and Computational Se-
    sbj+root           0.656 0.622         0.648                mantics (* SEM 2017), pages 168–177.
    sbj+obj            0.653 0.605         0.656
    root+obj           0.732 0.696         0.750              Emmanuele Chersoni, Enrico Santus, Philippe Blache,
                                                                and Alessandro Lenci. 2017b. Is structure neces-
    sbj+root+obj       0.732 0.686         0.750                sary for modeling argument expectations in distribu-
                                                                tional semantics? In 12th International Conference
Table 3: The table shows results in terms of Spearman’s ρ
on the transitive sentences dataset. Except for the case of     on Computational Semantics (IWCS 2017).
sbj+root, the models involving event knowledge in AC al-
ways improve the baselines. p-values are not shown because    Bob Coecke, Stephen Clark, and Mehrnoosh
they are all equally significant (p < 0.01).                    Sadrzadeh.     2010.    Mathematical foundations
                                                                for a compositional distributional model of mean-
                                                                ing. Technical report.
Jeffrey L Elman. 2011. Lexical knowledge without a        Richard Socher, Christopher D Manning, and An-
   lexicon? The mental lexicon, 6(1):1–33.                  drew Y Ng. 2010. Learning continuous phrase
                                                            representations and syntactic parsing with recursive
Katrin Erk and Sebastian Padó. 2008. A structured          neural networks. In Proceedings of the NIPS-2010
  vector space model for word meaning in context. In        Deep Learning and Unsupervised Feature Learning
  Proceedings of the Conference on Empirical Meth-          Workshop, volume 2010, pages 1–9.
  ods in Natural Language Processing, pages 897–
  906. Association for Computational Linguistics.

Peter Hagoort and Jos van Berkum. 2007. Be-
  yond the sentence given. Philosophical Transac-
  tions of the Royal Society B: Biological Sciences,
  362(1481):801–811.

Mary Hare, Michael Jones, Caroline Thomson, Sarah
 Kelly, and Ken McRae. 2009. Activating event
 knowledge. Cognition, 111(2):151–167.

Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. A
  study of entanglement in a categorical framework of
  natural language. In Proceedings of the 11th Work-
  shop on Quantum Physics and Logic (QPL). Kyoto,
  Japan.

Gina R Kuperberg and T Florian Jaeger. 2016. What
  do we mean by prediction in language compre-
  hension? Language, cognition and neuroscience,
  31(1):32–59.

Alessandro Lenci. 2018. Distributional Models of
  Word Meaning. Annual Review of Linguistics,
  4:151–171.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
 proving distributional similarity with lessons learned
 from word embeddings. Transactions of the Associ-
 ation for Computational Linguistics, 3:211–225.

Christopher D. Manning, Mihai Surdeanu, John Bauer,
  Jenny Finkel, Steven J. Bethard, and David Mc-
  Closky. 2014. The Stanford CoreNLP natural lan-
  guage processing toolkit. In Association for Compu-
  tational Linguistics (ACL) System Demonstrations,
  pages 55–60.

Ken McRae and Kazunaga Matsuki. 2009. People use
  their knowledge of common events to understand
  language, and do so as quickly as possible. Lan-
  guage and linguistics compass, 3(6):1417–1429.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  rado, and Jeff Dean. 2013. Distributed representa-
  tions of words and phrases and their compositional-
  ity. In Advances in neural information processing
  systems, pages 3111–3119.

Jeff Mitchell and Mirella Lapata. 2008. Vector-based
   models of semantic composition.

Laura Rimell, Jean Maillard, Tamara Polajnar, and
  Stephen Clark. 2016. Relpron: A relative clause
  evaluation data set for compositional distributional
  semantics. Computational Linguistics, 42(4):661–
  701.