Grounding the Lexical Sets of Causative-Inchoative Verbs with Word Embedding Edoardo Maria Ponti Elisabetta Jezek Bernardo Magnini University of Cambridge Università degli Studi di Pavia Fondazione Bruno Kessler ep490@cam.ac.uk jezek@unipv.it magnini@fbk.eu Abstract 1 Introduction English. Lexical sets contain the words Lexicographic attempts to cope with verb sense filling the argument positions of a verb disambiguation often rely on “lexical sets” in one of its senses. They can be ex- (Hanks, 1996), which represent the lists of corpus- tracted from corpora automatically. The derived words that appear as arguments for each purpose of this paper is demonstrating that distinct verb sense. The arguments are the “slots” their vector representation based on word that have to be filled to satisfy the valency of a verb embedding provides insights onto many (subject, object, etc.). For example, {gun, bullet, linguistic phenomena, such as causative- shot, projectile, rifle...} is the lexical set of the ob- inchoative verbs. A first experiment aims ject for the sense ‘to shoot’ of to fire. In previ- at investigating the internal structure of the ous works, e.g. Montemagni et al. (1995), lexi- sets, which are known to be radial and cal sets were collected manually and were com- continuous categories cognitively. A sec- pared through set analysis. The measure of simi- ond experiment shows that the distance larity between two sets was proportional to the ex- between the intransitive subject set and tent of their intersection. We believe that possible transitive object set is correlated with the improvements may stem from deriving the lexical spontaneity of the event expressed by the sets automatically and from exploiting the seman- verb, defined according to morphological tic information of the fillers fully. In this work, coding and frequency. we devise an extraction method from a huge cor- Italiano. I set lessicali contengono le pus and use a distributional semantics approach to parole che occupano le posizioni argo- perform our analyses. More specifically, we repre- mentali di un verbo in una delle sue ac- sent fillers as word vectors and compare them with cezioni, e possono essere estratti in modo spatial distance measures. In order to test the rel- automatico dai corpora. L’obiettivo di evance for linguistic theory of this approach, we questo articolo è dimostrare che la loro focus on a case study, namely the properties of rappresentazione vettoriale illumina al- verbs undergoing the causative-inchoative alterna- cuni fenomeni linguistici, come i verbi tion. Section 1.1. outlines a framework for word ad alternanza causativo-incoativa. Un embeddings and section 1.2 introduces the case esperimento investiga la struttura in- study. Section 2 presents the method and the data, terna degli insiemi, che a livello cog- whereas section 3 reports the results of a couple of nitivo sono ritenuti categorie radiali e experiments. continue. Inoltre, un secondo esperi- 1.1 Word Embedding mento mostra che la distanza fra l’insieme dei soggetti intransitivi e l’insieme degli The full exploitation of the semantic information oggetti transitivi è correlata alla spon- inherent to argument fillers for verbs can take ad- taneità dell’evento espresso dal verbo, vantage from some recent developments in distri- definita secondo la marca morfologica e butional semantics. Recently, efficient algorithms la frequenza. have been devised mapping each word of a vocab- ulary into a corresponding vector of n real num- and occur more frequently in the causative form. bers, which can be thought as a sequence of co- ordinates in a n-dimensional space (Mikolov et 2 Previous Work al., 2013). This mapping is yielded by unsuper- In the literature, many methods are available for vised machine learning, based on the assumption the automatic detection of verb classes, such as that the meaning of a word can be inferred by its causative-inchoative verbs. They exploit features context, i.e. its neighbouring words in texts. This based on argument alternations, such as subcate- model has some relevant properties: the geomet- gorization frames (Joanis et al., 2008). The identi- ric closeness of two vectors corresponds to the fication of verb classes displaying a diathesis alter- similarity in meaning of the corresponding words. nation was also performed through the analysis of Moreover, its dimensions have possibly a semantic selectional preferences. Most notably, the lexical interpretation. items were compared via distributional semantics 1.2 Causative-Inchoative Alternation (McCarthy, 2000). These features were usually induced from au- A possible testbed for the usefulness of represent- tomatic parses of heterogeneous and wide corpora ing the argument fillers as vectors are the verbs (Schulte Im Walde, 2000). In particular, the ex- showing the so called causative-inchoative alter- traction of subcategorization frames was refined nation. These verbs appear either as transitive or including e.g. noise filters based on frequency intransitive. In the first case, an agent brings about (Korhonen et al., 2000). Our work is inspired by a change of state; in the second, the change of a these attempts to automatically induce lexical in- patient is presented as spontaneous (e.g. to break, formation regarding verbs, but its direction of re- as in “Mary broke the key” vs. “the key broke”). search is reversed. Indeed, rather than classify- The two alternative forms of these verbs can ing verb classes given this information, it analyses be morphologically asymmetrical: in this case, this information given a verb class in order to shed one has a derivative affix and the other does not. light on its properties from the perspective of lin- The first is labelled here as “marked”, the sec- guistic theory. ond as “basic”. Italian verbs with an asymmetrical alternation derive from the phenomenon of anti- 3 Data and Method causativization. The intransitive form is marked The data are sourced from a sample of ItWac, a since it is sometimes preceded by the clitic si wide Italian corpus gathered through web crawling (Cennamo and Jezek, 2011). Haspelmath (1993) (Baroni et al., 2009). This sample was further en- maintain that verbs that show a preference for riched with morpho-syntactic information through a marked causative form (and a basic inchoative the MATE-tools parser (Bohnet, 2010)1 and fil- form) cross-linguistically denote a more “sponta- tered by sentence length (< 100). Eventually, neous” situation. Spontaneity is intended by the sentences in the sample amounted to 2,029,454 author as the likelihood of the occurrence of the items. A target group of 20 causative-inchoative event without the intervention of an agent. This verbs was taken from Haspelmath et al. (2014): work is non-committal with respect to whether they are listed here based on the reported transi- spontaneity be an actual semantic factor. Rather, tive/intransitive frequency ratio, from the highest it is considered a notion useful for labelling the to the lowest. observed variations in morphology and frequency. In this way, a correlation between the form close > open > improve > break > fill > gather > connect and the meaning of these verbs was demon- > split > stop > go out > rise > rock > burn > freeze > strated. Moreover, Samardzic and Merlo (2012) turn > dry > wake > melt > boil > sink and Haspelmath et al. (2014) argue that verbs that appear more frequently (intra- and cross- The extraction step consisted in identifying linguistically) in the inchoative form tend to mor- their argument fillers inside the sentences in the phologically derive the causative form, too. This sample. In particular, the arguments considered time, the correlation holds between form and fre- were the subjects of intransitives (S) and objects quency. Vice versa, situations entailing agentive 1 LAS scores for the relevant dependency relations: 0.751 participation prefer to mark the inchoative form with dobj (direct object), 0.719 with nsubj (subject), 0.691 with nsubjpass (subject of a passive verb). Once the fillers have been mapped to their re- spective vectors, a lexical set appears as a group of points in a multi-dimensional model. The cen- tre of this group is the Euclidean mean among the vectors, which is a vector itself and is called cen- troid. In the first experiment, we calculated the co- ordinates of the centroid of the lexical sets S and O for any selected verb5 . Then we evaluated the co- sine similarity of every vector member of the sets from its centroid. The value of this metric goes from 0 (overlap) to 1 (maximum distance) and is useful to evaluate how far a filler is from its pro- Figure 1: Distance of vectors from their centroid. totype. We obtained two sets of cosine similarity values for each verb: these can be plotted as boxes and whiskers, like in Figure 1. The example rep- resents those of dividere ‘to split’. The rectangles stand for the values in the second and third quar- (O) (Dixon, 1994).2 These arguments are relevant tiles, whereas the horizontal line for the median6 . because they are deemed to share the same fillers From all these distance values, we picked the me- (Pustejovsky, 1995). dian value for each lexical set. The plot of these These operations resulted in a database where medians for the S set and the O set of each verb or- each verb lemma had a single entry and was as- dered according to Haspelmath’s ranking is shown sociated with a list of fillers, divided by argument in Figure 2. type. With this procedure, lexical sets were ex- Two main results can be observed from these tracted automatically, although they were not di- plots: the S lexical set lies in a more compact vided by verb sense. Afterwards, each of the ar- range of distances, whereas O is more scattered. gument fillers was mapped to a vector relying on a On the other hand, the vectors of S tend farther space model pre-trained through Word2Vec (Dinu from the centroid. This is demonstrated by the et al., 2015).3 ranges where their distance values fall. Moreover, the averages of medians for the ten verbs on the 4 Experiments left part of the scale (frequently transitive) and for In order to bring to light the linguistic informa- the ten verbs on the right (frequently intransitive) tion concealed in the automatically extracted lexi- were compared. The average median in S was cal sets, we devised two experiments. One investi- 0.696567 for the former and 0.585263 for the lat- gates the internal structure of lexical sets. In fact, ter. The average median in O was 0.556878 for previous works based on set theory treated them as the former and 0.522418 for the latter. This shows categoric sets, of which a filler is either a member that the variation in O appears to be random. On or not. Research in psychology, however, has long the other hand, the median of the distances in S is since demonstrated that the members of a linguis- normally lower for verbs that lie in the bottom half tic set are found in a radial continuum where the of the Haspelmath’s scale. most central one is the prototype for its category, The second experiment consisted in estimating and those at the periphery are less representative the cosine distance between the centroid of S and (Rosch, 1973; Lakoff, 1987).4 Word vectors allow the centroid of O for each verb. This operation was to capture this spatial continuum. aimed at finding to which extent the lexical sets of 2 S and O overlap. In fact, Montemagni et al. (1995) Subjects of forms with si were treated as intransitive sub- jects. Subjects of passive verbs were treated as objects. and McCarthy (2000) assessed in a corpus some 3 It was generated by a CBOW algorithm with negative asymmetries between these lexical sets, which in sampling, 300 dimensions, a context window of 10 tokens, principle should share all their members. pruning of infrequent words and sub-sampling. 4 5 For previous work on lexical sets considering prototyp- Every filler was weighted proportionally to its absolute icality in the context of the notion of shimmering, see Jezek frequency. and Hanks (2010). 6 The median is the value separating the higher half of the ordered values from the lower half. Figure 2: Medians of S (left) and O (right) distances for verbs ranked by position in Haspelmath’s scale. Inspecting our results, the distance between S ρ = 0.56391 with a quite strong confidence, i.e. and O seems to behave as a measure of spon- with p < 0.01.7 taneity, intended as cross-linguistic frequency and morphological markedness of a verb: the more the 5 Discussion centroids tend to be set apart, the more the verb The representation of lexical sets of Italian tends to have a morphologically unmarked and causative-inchoative verbs as vectors was demon- more frequent intransitive form. In fact, we com- strated to provide insights into their internal struc- pared the ranking of 20 alternating verbs accord- ture and their relation with spontaneity defined ac- ing to the ratio of their cross-linguistic frequency cording to morphological coding and frequency. of transitive and intransitive forms (Haspelmath et The distances of the objects appeared to be dis- al., 2014) and a ranking based on the centroid dis- tributed more uniformly, whereas those of the tances of the same verbs. Both these rankings are intransitive subjects more densely and remotely plotted in Figure 3: every verb is associated with from the centroid. This difference cannot stem its position in the two scales. from the frequency of anaphoric fillers (contrary to transitive subjects), since both these argument positions share the discursive function of introduc- ing new referents, and are hence occupied by fully referential fillers (Du Bois, 1985). Moreover, the medians of the distances of the subject fillers from their centroid were shown to vary. An interpretation is that they are sensi- ble to the frequency scale: this implies that fre- quently transitive (hence, non-spontaneous) verbs have semantically less homogeneous sets of ref- erents, since they are farther from the prototype. Possibly this discovery can be related with the fact that non-spontaneous verbs impose less selec- tional restrictions on subjects (McKoon and Mac- farland, 2000). Figure 3: Ranking based on cross-linguistic The lack of a perfect correlation between these form frequencies (green triangles) against ranking vector distance and frequency measures is maybe based on distance of the centroids of S and O in due to errors in the automatic extraction and data Italian (blue squares). sparseness for the former, or an insufficient sample 7 Both scales display a common tendency. In par- An alternative measure was considered for the ranking: the cardinality of the S-O intersection weighted by the set ticular a Spearman’s ranking test was performed union. In this case, Spearman correlation was ρ = 0.42255, over them, yielding a mild positive correlation of but it was not significant because p ≈ 0.06. of languages in the typological survey of Haspel- Future works should also choose different pre- math et al. (2014) for the latter. A possible in- trained vector models, in order to try and replicate terpretation of the correlation is that the entities these results. In particular, the new vector models capable of bringing about a change of state and could be optimized for similarity through semantic those that undergo it are indiscernible only for lexica (Faruqui et al., 2015) or based on syntactic non-spontaneous verbs. Studies on causer entities dependencies (Séaghdha, 2010). The experiments related them not only with the feature of agentiv- in this work may be extended to other languages, ity, but also in general with the so-called ‘teleolog- either individually or through a multi-lingual word ical capability’ (Higginbotham, 1997). embedding (Faruqui and Dyer, 2014). 6 Conclusion References Our work provided evidence that lexical sets of Italian causative-inchoative verbs are continuous Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The wacky wide and radial categories, whose distribution around web: a collection of very large linguistically pro- the prototype vary to a great extent. It is sensi- cessed web-crawled corpora. Language resources tive to the grammatical role and sometimes to the and evaluation, 43(3):209–226. position of the verb in the so-called spontaneity Bernd Bohnet. 2010. Very high accuracy and fast de- scale. Moreover, a correlation was discovered be- pendency parsing is not a contradiction. In Proceed- tween the distance between transitive object and ings of the 23rd International Conference on Com- intransitive subject lexical sets of a given verb and putational Linguistics, pages 89–97. Association for Computational Linguistics. its cross-linguistic tendency to appear more fre- quently as intransitive or as transitive. Figure 4 Michela Cennamo and Elisabetta Jezek. 2011. The is a synopsis of this result in the context of the anticausative alternation in italian. I luoghi della traduzione, pages 809–823. correlations established in previous works. Georgiana Dinu, Angeliki Lazaridou, and Marco Ba- Spontaneous roni. 2015. Improving zero-shot learning by miti- gating the hubness problem. workshop contribution at ICLR 2015. ? Robert MW Dixon. 1994. Ergativity. Cambridge Uni- versity Press. John W Du Bois. 1985. Competing motivations. Frequently Intransitive Iconicity in syntax, pages 343–365. Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual τ =0.65 ρ=0.56 correlation. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Unmarked Intransitive Distant S and O centres Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Figure 4: Synopsis of correlations among fea- Proceedings of NAACL. tures of causative-inchoative verbs. The measures Patrick Hanks. 1996. Contextual dependency and lex- are based on Kendall Tau test (τ ) and Spearman’s ical sets. International Journal of Corpus Linguis- ranking test (ρ). tics, 1(1):75–98. Martin Haspelmath, Andreea Calude, Michael Spag- In Figure 4, solid lines stand for correla- nol, Heiko Narrog, and Elif Bamyaci. 2014. Cod- tions proven based on cross-linguistic evidence ing causal–noncausal verb alternations: A form– (frequency-form) and evidence from the Italian frequency correspondence explanation. Journal of language (frequency-lexical sets). The dotted line, Linguistics, 50(03):587–625. on the other hand, suggests the existence of and Martin Haspelmath. 1993. More on the typology of underlying motivation for the correlations, which inchoative/causative verb alternations. Causatives nonetheless remains unproven and undetermined and transitivity, 23:87. in its nature. Its possible validation is left to future James Higginbotham. 1997. Location and causation. research. Ms., University of Oxford. Elisabetta Jezek and Patrick Hanks. 2010. What lex- ical sets tell us about conceptual categories. Lexis, 4(7):22. Eric Joanis, Suzanne Stevenson, and David James. 2008. A general feature space for automatic verb classification. Natural Language Engineering, 14(03):337–367. Anna Korhonen, Genevieve Gorrell, and Diana Mc- Carthy. 2000. Statistical filtering and subcatego- rization frame acquisition. In Proceedings of the 2000 Joint SIGDAT conference on Empirical meth- ods in natural language processing and very large corpora, pages 199–206. Association for Computa- tional Linguistics. George Lakoff. 1987. Women, fire, and danger- ous things: What categories reveal about the mind. Cambridge University Press. Diana McCarthy. 2000. Using semantic preferences to identify verbal participation in role switching alter- nations. In Proceedings of the 1st North American chapter of the Association for Computational Lin- guistics conference, pages 256–263. Gail McKoon and Talke Macfarland. 2000. Externally and internally caused change of state verbs. Lan- guage, pages 833–858. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In Workshop at ICLR. Simonetta Montemagni, Nilda Ruimy, and Vito Pir- relli. 1995. Ringing things which nobody can ring. a corpus-based study of the causative-inchoative al- ternation in italian. Textus online only. 8 (1995), N. 2, 1995, 8(2):1000–1020. James Pustejovsky. 1995. The generative lexicon. The MIT Press. Eleanor H Rosch. 1973. Natural categories. Cognitive psychology, 4(3):328–350. Tanja Samardzic and Paola Merlo. 2012. The mean- ing of lexical causatives in cross-linguistic variation. Linguistic Issues in Language Technology, 7(12):1– 14. Sabine Schulte Im Walde. 2000. Clustering verbs se- mantically according to their alternation behaviour. In Proceedings of the 18th conference on Computa- tional linguistics-Volume 2, pages 747–753. Diarmuid O Séaghdha. 2010. Latent variable mod- els of selectional preference. In Proceedings of the 48th Annual Meeting of the Association for Compu- tational Linguistics, pages 435–444.