=Paper=
{{Paper
|id=Vol-1347/paper33
|storemode=property
|title=Mapping the constructicon with SYMPAThy. Italian word combinations between fixedness and productivity
|pdfUrl=https://ceur-ws.org/Vol-1347/paper33.pdf
|volume=Vol-1347
|dblpUrl=https://dblp.org/rec/conf/networds/LenciLSCMN15
}}
==Mapping the constructicon with SYMPAThy. Italian word combinations between fixedness and productivity==
Mapping the Constructicon with SYMPAThy. Italian Word Combinations between fixedness and productivity Alessandro Lenci Sara Castagnoli Malvina Nissim Gianluca E. Lebani, Marco S. G. Senaldi Francesca Masini University of Groningen University of Pisa University of Bologna m.nissim@rug.nl alessandro.lenci@ling.unipi.it s.castagnoli@unibo.it gianluca.lebani@for.unipi.it francesca.masini@unibo.it marco.senaldi@sns.it Abstract (P-level) and at the more abstract level of syntac- tic structure (S-level). These two levels are often This work introduces SYMPAThy, a data kept separate, not only theoretically, but also com- representation model in which the com- putationally, as their performance varies according binatorial properties of a lexical item are to the different types of combinations that we want described by merging surface and deeper to track (Sag et al., 2002; Evert and Krenn, 2005). linguistic information. The proposed ap- We advocate a unified and integrated view of a proach is then evaluated by comparing, lexeme’s combinatory potential, in order to cap- for a sample list of verbal idioms, a set ture both fixed combinations (MWEs of various of SYMPAThy-based fixedness indexes types) and more productive aspects of the lexeme’s against the relevant speaker-elicited in- distributional behaviour. The theoretical premises dexes available in the descriptive norms lie in the constructionist view of the mental lex- collected by Tabossi et al. (2011). icon outlined above, whereas a proposal for a 1 Word combinatorics and constructions computational implementation is illustrated here. By “Word Combinations” (WoCs) we broadly re- Specifically, we i) present SYMPAThy, a model fer to the range of constructions typically as- of data representation that takes into account both sociated with a lexical item. In Construction surface and deeper linguistic information; ii) de- Grammar, constructions (Cxn) are convention- velop and test an index of productivity for Italian alized form-meaning pairings that can vary in WoCs based on SYMPAThy. both complexity and schematicity (Fillmore et al., 2 SYMPAThy: a joint approach to WoCs 1988; Goldberg, 2006; Hoffmann and Trousdale, 2013). The Constructicon spans from fully spec- We argue that to obtain a comprehensive picture of ified structures (kick the bucket) to complex, pro- the combinatory potential of a word and enhance ductive abstract structures such as argument pat- extracting efficacy for WoCs, the P-based ap- terns (e.g., the Ditransitive Cxn “Subj V Obj1 proach (which exploits sequences of POS-patterns Obj2”, she baked him a cake), passing through and association measures) and the S-based ap- “intermediate” Cxns with different degrees of proach (which exploits syntactic dependencies and schematicity, complexity and productivity (e.g., association measures) should be combined. We il- take Obj for granted), in what is known as the lustrate this point with an example based on the lexicon-syntax continuum. WoCs thus comprise Target Lexeme (TL) gettare ‘throw’ (V).1 so-called Multiword Expressions (MWEs), i.e. a We want to use S-based methods to capture the variety of recurrent expressions acting as a sin- fact that V occurs typically within some syntac- gle unit at some level of linguistic analysis, like tic Frames and not others, that for each Frame phrasal lexemes, idioms, collocations (Calzolari et we have typical Fillers (lexical items) instantiating al., 2002; Sag et al., 2002; Gries, 2008), as well as Frame slots, and that each slot is associated with the preferred distributional properties of a word at certain semantic (ontological) classes:2 a more abstract level, i.e. argument structures and 1 All data is from a version of the “la Repubblica” corpus selectional preferences (Goldberg, 1995). (Baroni et al., 2004) POS tagged with the Part-Of-Speech tag- Each lexeme can thus be described as having a ger described in Dell’Orletta (2009) and dependency parsed with DeSR (Attardi and Dell’Orletta, 2009). combinatory potential to be defined and observed 2 Data extracted by LexIt (Lenci, 2014). The list is partial: at a more constrained, surface POS-pattern level only the first three Frames are included; Frames with the re- Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org 144 • subj#obj#comp-su • morphosyntactic features: gender, number, – OBJ Filler: {acqua, ombra, benzina, ...}; finiteness, tense, etc. {Substance, Natural Phenomenon, ...} – COMP-su Filler: {fuoco, tavolo, bilancia, las- 3 WoC fixedness with SYMPAThy trico, istituzione, ...}; {Artifact, Substance, ...} Since constructions span along a continuum be- • subj#obj#comp-in tween fixedness and productivity, there have been – OBJ Filler: {scompiglio, sasso, corpo, fumo, ca- davere, ...}; {Natural Object, Substance, ...} various attempts at measuring how fixed a given – COMP-in Filler: {panico, caos, sconforto, mare, WoC is, mostly based on surface features. Nissim stagno, cestino, ...}; {Feeling, State, ...} and Zaninello (2011) assess the fixedness of a sub- • subj#obj set of complex nominals by comparing inflected – OBJ Filler: {spugna, base, ombra, acqua, luce, and lemmatized forms, and taking into account the ponte, ...}; {Substance, Artifact, ...} proportion of elements that undergo variation in a given MWE. Inflection is also used by Squillante At this point, we observe that all these words are (2014) on noun-adjective expressions, and is com- typically associated with our TL, but we don’t bined with two other measures, interruptibility and know in which way they are all linked to one substitutability. Zeldes (2013) extends Baayen’s another. For instance, we have no elements morphological productivity approach to argument for thinking that subj#gettare#acqua#su fuoco is structure and estimates the productivity of a syn- any different from subj#gettare#acqua#su tavolo tactic slot from the number of its hapax noun or subj#gettare#ombra#su istituzione. However, fillers. Wulff (2009) uses a set of morphosyntac- while gettare acqua sul fuoco ‘defuse’ is an id- tic indexes of variations and a collocation-based iom in Italian, gettare acqua sul tavolo only has index of compositionality as variables in a regres- a literal meaning (‘throw water on the table’); sion study to determine fixedness. subj#gettare#fango#su istituzione is yet different, We extend the state of the art of the quantitative since gettare fango su ‘defame’ is a fixed expres- approach to construction fixedness by exploiting sion, but the Filler istituzione ‘institution’ is just the potentialities of SYMPAThy to develop a se- one of many possibilities, so the expression is par- ries of corpus-based indexes able to describe the tially fixed, resulting in something like [gettare fixedness of some idiomatic expressions. Our ap- fango su PERSON/INSTITUTION]. The signif- proach is then evaluated by comparing, for a sam- icance of gettare acqua sul fuoco with respect ple list of expressions, a composition of our in- to gettare acqua sul tavolo emerges much more dexes against the behavioral judgments of syntac- clearly if we use a P-based method. Extracting tic flexibility collected by Tabossi et al. (2011). surface material, the former expression will be ranked higher than the latter (given the pattern “V 3.1 The combinatory behaviour of a TL N PREPART N”) as the association between all In the SYMPAThy model, the combinatory space words is stronger. of a Target Lexeme is assumed to be formed by a So, fine-grained differences do not emerge with network of Cxns, varying for their degree of fixed- the S-method, while the P-based method fails to ness/productivity. For any given TL such a repre- capture the higher-level generalizations we get sentation is built by means of the following four- with the S-method. In order to get the best of both step procedure: worlds, we extracted corpus data into SYMPA- 1. its SYMPAThy patterns are extracted from a Thy (SYntactically Marked PATterns), a database reference corpus; where information on both levels is stored and ac- cessible jointly: 2. the set of single and multiple slot Cxns that TL combines with are semi-automatically identi- • syntactic frames with argument slots and fillers; fied. An example for the verb gettare is re- • linear order of all elements for each TL; ported and explained in Appendix 1; • POS tag for each element (simple preposition 3. each construction is associated with a varia- vs. preposition with article, definite vs. indefi- tional profile formed by a number of statistics nite article, modal vs. full verb, etc.); extracted from the SYMPAThy pattern to esti- flexive form gettarsi ’throw oneself’ and objectless forms are mate: i) the variability of the fillers that instan- excluded. tiate the syntactic slots of constructions; ii) the 145 morphological variability of the constructions’ L EXICAL VARIABILITY. The entropy of the lex- components; iii) the variability with respect to ical instantiation of the slot positions of a Frame determiners; iv) the variability with respect to is calculated by assuming that the states x of adjectival and adverbial modifications; v) the the random variable X are all the possible fillers variability in the linear order. that can instantiate a given slot in Cxn (e.g. in 4. variational profiles are then used to measure the subj#gettare#obj:luce#su X, X can be filled by vi- lexical, morphological and syntactic degrees of cenda ‘matter’, mistero ‘mystery’, etc.). freedom of Cxns, providing a multidimensional M ORPHOLOGICAL VARIABILITY. It is cal- quantitative characterization of their level of culated as the entropy of the morphological fixedness. features manifested by the fillers of a Cxn (e.g., gettare#ombra-fs ‘cast shadow-singular’; 3.2 Entropy-based Cxn fixedness modeling gettare#ombra-fp ‘cast shadow-plural’). In what follows, we devise a way to encode the A RTICLES VARIABILITY. This index encodes variation possibilities shown by Cxns, as well as how variable is the presence or absence of articles a meaningful way to combine them. Specifically, determining the available slots in a Cxn, and, if we distinguish a series of dimensions of variation appropriate, their type (DEFinite vs. INDefinite): and propose to exploit Entropy (Shannon, 1948) for instance, gettare#∅+acqua#su DEF+fuoco. to measure how fixed is the behavior of a Cxn in a given dimension. P RESENCE OF MODIFIERS . This index en- Entropy is a measure of randomness, calculated codes how variable is the presence or ab- as the average uncertainty of a single variable: sence of adjectives, adverbs or prepositional phrases modifying the available slots. In this X way, it is possible to account for patterns H(X) = − p(x) log2 (p(x)) (1) x∈X like:gettare#molta+acqua#su ∅+fuoco. D ISTANCE VARIABILITY. This index exploits This measure of randomness can be adapted to our information on linear order available in SYMPA- needs by taking the variable X as being a Cxn of Thy to estimate how variable is the distance in to- interest, and the states of the system x as its values kens between a TL and the other constituents of a on one dimension of variation. Lower entropy val- given lexically specified Cxn. ues are to be understood as evidence of fixedness, while higher values suggest a more variable dis- In the experiment reported in the next section, tribution of the states of a given variable, i.e. the we have combined the single variability measures target construction tends to be freer. Hrel (X) into an overall flexibility index F (X) Observed entropy values, however, can span corresponding to four possible combinations: from 0 to the logarithm of the number of values • SUM: F (X) is obtained by summing over all that X can assume. As a consequence, entropy the single Hrel (X) values; values related to different dimensions of variation • AVERAGE: F (X) is the mean of the single are not comparable, and cannot be combined into Hrel (X) values; a single fixedness index. We overcome this limita- • AVERAGEP OS : F (X) is the mean of the posi- tion by following Wulff (2008) and describing the tive Hrel (X) values; randomness of each variability dimension in terms of relative entropy, computed as the ratio between • MAX: F (X) is the highest Hrel (X) value. the observed entropy from eq.1 and the maximum We leave to future research the investigation of entropy Hmax for the variable X: further ways to combine the variability indexes. H(X) H(X) 4 Evaluation Hrel (X) = = (2) Hmax (X) log2 (|X|) In order to evaluate our approach, we set out to test This measure, that ranges from 0 to 1, has been if our indexes can mimic the intuitive judgments employed as a flexibility measure to describe the of native speakers about the fixedness of fully lex- flexibility of a given set of target Cxns along the ically specified constructions. To do so, we se- following dimensions of variation: lected a subset of the idioms in the norms collected 146 by Tabossi et al. (2011), and tested to what degree Combination r the speaker-elicited flexibility judgments available in this repository can be modeled by a composition SUM .44 of our variability indexes. AVERAGE .44 4.1 The descriptive norms by Tabossi et al. AVERAGE P OS .46 Tabossi et al. (2011) collected several normative MAX .47 measures for 245 Italian verbal idiomatic expres- sions. Using a group of 740 Italian speakers, they Table 1: Pearson’s Correlation strength between collected a minimum of 40 elicited judgments for different combination methods of the SYMPAThy- each idiom on several psycholinguically relevant based fixedness indexes and the syntactic flexibil- variables. ity judgments in Tabossi et al. (2011). All reported Among the different kinds of ratings, those con- values are associated with p < .05, N = 23. cerning syntactic flexibility have been collected by inserting each idiomatic expression in a sen- tence in which one of the following five syntactic flexibility ratings in Tabossi et al. (2011). Corre- modifications occurred: adverb insertion, adjec- lation values are reported in Table 1. In all cases, tive insertion, left dislocation, passive and move- there is a significant (p < .05) positive correlation, ment. Participants were asked to evaluate, on a ranging between .44 and .47, thus supporting the 7-point scale, how much the meaning of the id- psycholinguistic plausibility of our corpus-based iomatic expression in the syntactically modified variability indexes. sentence was similar to its unmarked meaning as expressed in a paraphrase prepared by the authors. These results, albeit preliminary, look promis- ing especially given the different nature of the 4.2 Data extraction behavioral and corpus-based indexes. On the Out the 245 expressions in Tabossi et al., we se- one hand, the speakers’ ratings are semantically lected the 23 target idioms reported in Appendix 2. driven, since they are thought to model how much Each such idiom can be represented, in our ap- the figurative meaning of a given idiom is sensitive proach, as a fully lexically specified transitive Cxn to its syntactic form. On the other hand, the auto- headed by a given verbal TL, for which the subject matically corpus-derived information exploited by slot is underspecified (e.g. gettare#obj:maschera). our indexes does not take meaning into account. We built the variational profiles of our target id- SUch indexes describe a lexically specified Cxn ioms by adopting an adapted version of the proce- that can in principle have an idiomatic as well as dure described in Section 3: a compositional, literal meaning (even if, presum- ably, the latter case is rare in the corpus). 1. for each TL, we extracted the SYMPAThy pat- terns from the “la Repubblica” corpus; 2. the patterns involving one of our target idioms 5 Conclusion were identified and selected; 3. for each idiom, the variability indexes de- In this study we presented a procedure for char- scribed in Section 3.2 were calculated. Note acterizing the combinatorial potential of a lexical that, given the nature of our experimental stim- item and the degree of fixedness of the Cxns it oc- uli, the lexical variability index is not relevant; curs in. Such a procedure has been preliminary 4. we built a fixedness index for each idiom, ac- tested on a small sample of idiomatic expressions cording to the four composition methods in the and the resulting representation has been evaluated previous section. against the subject-elicited judgments collected by Tabossi et al. (2011). In the future, we are plan- 4.3 Results and discussion ning to extend the inventory of variability dimen- In order to test the cognitive plausibility of the sions (addressing also the question of the semantic fixedness indexes extracted from SYMPAThy, we compositionality of Cxns), to study their relative calculated the Pearson’s Product-Moment Corre- weight and their interactions, and to develop more lation strength between them and the syntactic sophisticated ways to combine them. 147 Acknowledgments [Hoffmann and Trousdale2013] Thomas Hoffmann and Graeme Trousdale, editors. 2013. The Oxford This research was carried out within the CombiNet Handbook of Construction Grammar. Oxford Uni- project (PRIN 2010-2011 Word Combinations in versity Press, Oxford. Italian: theoretical and descriptive analysis, com- [Lenci2014] Alessandro Lenci. 2014. Carving verb putational models, lexicographic layout and cre- classes from corpora. In Raffaele Simone and ation of a dictionary, n. 20105B3HE8) funded by Francesca Masini, editors, Word Classes. Nature, ty- the Italian Ministry of Education, University and pology and representations, Current Issues in Lin- guistic Theory, pages 17–36. John Benjamins. Research (MIUR). [Nissim and Zaninello2011] Malvina Nissim and An- drea Zaninello. 2011. A quantitative study on References the morphology of Italian multiword expressions. Lingue e Linguaggio, X:283–300. [Attardi and Dell’Orletta2009] Giuseppe Attardi and Felice Dell’Orletta. 2009. Reverse revision and [Sag et al.2002] Ivan A. Sag, Timothy Baldwin, Fran- linear tree combination for dependency parsing. In cis Bond, Ann Copestake, and D. Flickinger. 2002. Proceedings of NAACL 2009, pages 261–264. Multiword expressions: A pain in the neck for NLP. In Proceedings of CICLing 2002, pages 1–15. [Baroni et al.2004] Marco Baroni, Silvia Bernardini, Federica Comastri, Lorenzo Piccioni, Alessandra [Shannon1948] Claude E. Shannon. 1948. A mathe- Volpi, Guy Aston, and Marco Mazzoleni. 2004. In- matical theory of communication. The Bell System troducing the La Repubblica Corpus: A Large, An- Technical Journal, 27(3):379 – 423. notated, TEI(XML)-Compliant Corpus of Newspa- [Squillante2014] Luigi Squillante. 2014. Towards an per Italian. In Proceedings of LREC 2004, pages empirical subcategorization of multiword expres- 1771–1774. sions. In Proceedings of the 10th Workshop on Mul- tiword Expressions (MWE), pages 77–81, Gothen- [Calzolari et al.2002] Nicoletta Calzolari, Charles J. burg, Sweden, April. Association for Computational Fillmore, Ralph Grishman, Nancy Ide, Alessandro Linguistics. Lenci, Catherine MacLeod, and Antonio Zampolli. 2002. Towards best practice for multiword expres- [Tabossi et al.2011] Patrizia Tabossi, Lisa Arduino, and sions in computational lexicons. In Proceedings of Rachele Fanari. 2011. Descriptive norms for 245 LREC 2002, pages 1934–1940. Italian idiomatic expressions. Behavior Research Methods, 43:110–123. [Dell’Orletta2009] Felice Dell’Orletta. 2009. Ensem- ble system for Part-of-Speech tagging. In Proceed- [Wulff2008] Stefanie Wulff. 2008. Rethinking Id- ings of EVALITA 2009. iomaticity: A Usage-based Approach. Continuum. [Evert and Krenn2005] Stefan Evert and Brigitte [Wulff2009] Stefanie Wulff. 2009. Converging evi- Krenn. 2005. Using small random samples dence from corpus and experimental data to cap- for the manual evaluation of statistical associa- ture idiomaticity. Corpus Linguistics and Linguistic tion measures. Computer Speech & Language, Theory, 5(1):131–159. 19(4):450–466. Special issue on Multiword Expression. [Zeldes2013] Amir Zeldes. 2013. Productive argument selection: Is lexical semantics enough? Corpus Lin- [Fillmore et al.1988] Charles J. Fillmore, Paul Kay, and guistics and Linguistic Theory, 9(2):263–291. Mary Catherine O’Connor. 1988. Regularity and idiomaticity in grammatical constructions: the case of let alone. Language, 64(3):501–538. [Goldberg1995] Adele Goldberg. 1995. Construc- tions. A Construction Grammar Approach to Argu- ment Structures. The University of Chicago Press, Chicago. [Goldberg2006] Adele Goldberg. 2006. Constructions at work. Oxford University Press, Oxford. [Gries2008] Stefan Th. Gries. 2008. Phraseology and linguistic theory: a brief survey. In Sylviane Granger and Fanny Meunier, editors, Phraseology: an interdisciplinary perspective, pages 3–25. John Benjamins, Amsterdam & Philadelphia. 148 Appendix 1: A SYMPAThy-based view of the network of Cxns with the verb gettare ... ... ... TL = GETTARE ‘THROW’ Frame2 (subj#obj) Cxn form [[SUBJ]NP gettare [OBJ]NP] SUBJ: Person, Animal, ... OBJ: Substance, Artifact, ... meaning [CAUSE (OBJ, [GO (AWAY)])] ... ... ... Frame1 (subj#obj#comp-su) Cxn ... ... ... form [[SUBJ]NP gettare [OBJ]NP su [COMP]NP] Frame3 (subj#obj#comp-in) Cxn SUBJ: Person, Event,... form [[SUBJ]NP gettare [OBJ]NP in [COMP]NP] OBJ: Substance, Natural_Phenomenon, ... SUBJ: Event, Act, ... COMP: Artifact, Substance, ... OBJ: Natural_Object, Substance, ... COMP: Feeling, State, ... meaning [CAUSE (OBJ, [GO (OBJ, [TO ([ON (COMP)])])])] II meaning [CAUSE (OBJ, [GO (OBJ, [TO ([IN (COMP)])])])] ... ... ... II (instantiation links) gettare#fango##comp-su Cxn gettare#acqua#sul#fuoco Cxn form [[SUBJ]NP gettare (ADV) (ADJ) fango su [COMP]NP] II form [[SUBJ]NP gettare (ADV) (ADJ) acqua sul fuoco] SUBJ: Person, Event,... SUBJ: Person, Event,... OBJ: fango (⇒ SG; bare | partitive) II OBJ: acqua COMP: Person, Institution, ... COMP: fuoco meaning ‘defame, discredit, blacken the name of’ SU: sul meaning ‘defuse, minimize a situation’ gettare#ombra#comp-su Cxn gettare#benzina#sul#fuoco Cxn form [[SUBJ]NP gettare (ADV) [ombra]NP su [COMP]NP] SUBJ: Person, Event,... form [[SUBJ]NP gettare (ADV) (ADJ) benzina sul fuoco] OBJ: ombra (⇒ full NP) SUBJ: Person, Event,... II COMP: Person, Institution, ... OBJ: benzina ‘cast a shadow’ II meaning COMP: fuoco SU: sul meaning ‘add fuel to the fire’ II ... Questo getta una pesantissima ombra sulla legittimità ... ... rischia di gettare ulteriore fango sul calcio ... ‘This casts a serious shadow on the legitimacy...’ ‘(it) may sully football even more’ ... la società getta acqua sul fuoco ... II ... Il rivale getta ombra sulla salute del leader ... ‘the company defuses (the situation)’ ... Hanno sempre gettato fango su di noi ... ... lei sta gettando benzina sul fuoco ... ‘His opponent casts a shadow on the leader’s health’ ‘They have always sullied us’ ... getta abbondante acqua sul fuoco ... ‘she is adding fuel to the fire’ ‘(it) minimizes (the situation) greatly’ ... Evitiamo di gettare altra benzina sul fuoco ... ... Gli amici hanno gettato sulla bara garofani rossi ... ‘Friends threw red carnations on his coffin’ ‘Let’s not add fuel to the fire’ ... getta un sasso sull’ autostrada ... ‘(s/he) throws a stone in the highway’ The verb gettare ‘to throw’ combines with the highly schematic subj#obj#comp-su Cxn, whose slots can freely vary with respect to linear order, presence of determiners, modifiers, etc. A semi-productive instance of this construction is the subj#obj:ombra#comp-su Cxn, with a fixed object slot and a partially variable oblique slot, which can appear with a semantically limited range of arguments. A fully lexically specified instance of the same construction is instead the subj#obj:acqua#comp-su:sul-fuoco Cxn, which has both slots instantiated and limited degree of variability. Appendix 2: List of idioms used as experimental stimuli Gettare la maschera (‘to reveal oneself ’) Mettere i puntini sulle i (‘to be nitpicking’) Gettare la spugna (‘to give up’) Mettere zizzania (‘to sow discord’) Gettare acqua sul fuoco (‘to defuse a situation’) Perdere la testa (‘to lose one’s head’) Gettare olio sul fuoco (‘to inflame a situation’) Perdere il treno (‘to miss an opportunity’) Mettere la mano sul fuoco (‘to stake one’s life on Perdere il filo (‘to lose the thread’) sth’) Perdere la bussola (‘to lose one’s bearings’) Mettere il carro davanti ai buoi (‘to put the cart Prendere il toro per le corna (‘to take the bull by before the horse’) the horns’) Mettere le carte in tavola (‘to lay one’s cards on Prendere una cotta (‘to get a crush on somebody’) the table’) Prendere un granchio (‘to make a blunder’) Mettersi il cuore in pace (‘to resign oneself to sth’) Tirare i remi in barca (‘to rest on one’s oars’) Mettere nero su bianco (‘to put sth down in black Tirare la cinghia (‘to tighten one’s belt’) and white’) Tirare le cuoia (‘to die’) Mettere il dito sulla piaga (‘to hit someone where Tirare la corda (‘to take sth too far’) it hurts’) 149