=Paper=
{{Paper
|id=Vol-2006/paper043
|storemode=property
|title=Deep-learning the Ropes: Modeling Idiomaticity with Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2006/paper043.pdf
|volume=Vol-2006
|authors=Yuri Bizzoni,Marco S.G. Senaldi,Alessandro Lenci
|dblpUrl=https://dblp.org/rec/conf/clic-it/BizzoniSL17
}}
==Deep-learning the Ropes: Modeling Idiomaticity with Neural Networks==
Deep-learning the Ropes: Modeling Idiomaticity with Neural Networks Yuri Bizzoni1 , Marco S. G. Senaldi2 , Alessandro Lenci3 University of Gothenburg - Sweden1 , Scuola Normale Superiore - Italy2 , University of Pisa - Italy3 yuri.bizzoni@gu.se1 , marco.senaldi@sns.it2 , alessandro.lenci@unipi.it3 Abstract a shark). On the other hand, although most id- ioms originate as metaphors (Cruse, 1986), they English. In this work we explore the have undergone a crystallization process in di- possibility of training a neural network achrony, whereby they now appear as fixed and to classify and rank idiomatic expressions non-compositional word combinations that be- under constraints of data scarcity. We long to the wider class of Multiword Expressions discuss our results comparing them both (MWEs) (Sag et al., 2002) and always exhibit lex- to other unsupervised models designed ical and morphosyntactic rigidity to some extent to perform idiom detection and to simi- (Cacciari and Glucksberg, 1991; Nunberg et al., lar supervised classifiers trained to detect 1994). It is anyway crucial to underline that id- metaphoric bigrams. iomaticity itself is a multidimensional and gra- Italiano. In questo lavoro esploriamo dient phenomenon (Nunberg et al., 1994; Wulff, la possibilità di addestrare una rete neu- 2010) with different idioms showing varying de- rale per classificare ed ordinare espres- grees of semantic transparency, formal versatility, sioni idiomatiche in condizioni di scar- proverbiality and affective valence. sità di dati. I nostri risultati sono The aim of this work is to explore the fuzzy discussi in comparazione sia con al- boundary between idiomatic and metaphorical ex- tri algoritmi non supervisionati ideati pression, by applying a method designed to dis- per l’identificazione di espressioni id- criminate figurative vs. literal usages to the task of iomatiche sia con classificatori supervi- distinguishing idiomatic from compositional ex- sionati dello stesso tipo addestrati per pressions. Our starting point is the work of Biz- identificare bigrammi metaforici. zoni et al. (2017). The authors managed to clas- sify adjective-noun pairs where the same adjec- tives were used both in a metaphorical and a lit- 1 Introduction eral sense (e.g. clean performance vs. clean floor) Figurative expressions like idioms (e.g. to learn using a neural classifier trained on a composition the ropes ‘to learn how to do a job’, to cut the of the words’ embeddings (Mikolov et al., 2013). mustard ‘to perform up to expectations’, etc.) and Actually, the neural network was able to detect metaphors (e.g. clean performance, that lawyer is the abstract/concrete semantic shift of nouns when a shark, etc.) are pervasive in language use. Im- used with the same adjective in figurative and portant differences have been stressed between the literal compositions respectively, basically treat- two types of expressions from a theoretical (Gibbs, ing the noun as the “context” to discriminate the 1993; Torre, 2014), neurocognitive (Bohrn et al., metaphoricity of the adjective. In our attempt, we 2012) and corpus linguistic (Liu, 2003) prespec- will use a relatively similar approach to classify tive. On the one hand, as stated by Lakoff and idiomatic expressions by training a three-layered Johnson (2008), linguistic metaphors reflect an in- neural network on a set of idiomatic and non- stantiation of conceptual metaphors, whereby ab- idiomatic expressions and we’ll compare the per- stract concepts in a target domain (e.g. the ruth- formance of the network when trained on differ- lessness of a lawyer) are described by a rather ent syntactic patterns (Adjective-Noun and Verb- transparent mapping to concrete examples taken Noun expressions, AN and VN henceforth). from a source domain (e.g. the aggressiveness of Importantly, the abstract/concrete polarity the network was able to learn in Bizzoni et al. (2017) and Senaldi et al. (2016a) have combined insights will not be available this time, since none of the id- from both these approaches by observing that the iom constituents will ever appear in its literal sense vectors of VN and AN idioms are less similar to inside the expressions, whatever their concrete- the vectors of lexical variants of these expressions ness may be. What we want to find out is whether with respect to the vectors of compositional con- the sole information captured by the distributional structions. To the best of our knowledge, neu- vector of a single expression is sufficient to learn ral networks have been previously adopted to per- its potential idiomaticity. Differently from Bizzoni form MWE detection in general (Legrand and Col- et al. (2017), for each idiom we collect a count- lobert, 2016; Klyueva et al., 2017), but not idiom based vector (Turney and Pantel, 2010) of the ex- identification specifically. In Bizzoni et al. (2017), pression as a whole, taken as a single token. We pre-trained noun and adjective vector embeddings compare this approach with a model trained on the are fed to a single-layered neural network to dis- composition of the individual words of an expres- ambiguate metaphorical and literal AN combina- sion, showing that the latter is less effective for tions. Several combination algorithms are exper- idioms than for metaphors. In both cases we will imented with to concatenate adjective and noun be operating on scarce training sets (26 AN and 90 embeddings. All in all, the method is shown to VN constructions). Traditional ways to deal with outperform the state of the art, presumably lever- data scarcity in computational linguistics resort to aging the abstractness degree of the noun as a clue a wide number of different features to annotate the to metaphoricity. training set (see for example Tanguy et al. (2012)) or rely on artificial bootstrapping of the training 3 Dataset set (He and Liu, 2017). In our case we test the 3.1 Target expressions extraction performance of our classifier on scarce data with- out bootstrapping the dataset and relying only on The two idiom datasets we employ in the cur- the information provided by the distributional se- rent study come from Senaldi et al. (2016b) and mantic space, showing that the distribution of an Senaldi et al. (2016a). The first one is composed expression in large corpora can provide enough in- of 45 idiomatic and 45 non-idiomatic Italian V- formation to learn idiomaticity from few examples NP and V-PP constructions (e.g. tagliare la corda with a satisfactory degree of accuracy. ‘to flee’ lit. ‘to cut the rope’ and leggere un libro ‘to read a book’) that were selected from an Ital- 2 Related Work ian idiom dictionary (Quartu, 1993) and extracted from the itWaC corpus (Baroni et al., 2009), com- Previous computational research has exploited dif- posed of about 1,909M tokens. Their frequency ferent methods to perform idiom type detection spanned from 364 (ingannare il tempo ‘to while (i.e., automatically telling apart potential idioms away the time’) to 8294 (andare in giro ‘to get like to get the sack from only literal combinations about’). The latter comprises 13 idiomatic and 13 like to kill a man). For example Lin (1999) and non-idiomatic AN constructions (e.g. punto de- Fazly et al. (2009) label a given word combination bole ‘weak point’ and nuova legge ‘new law’) that as idiomatic if the Pointwise Mutual Information were still extracted from itWaC and whose fre- (PMI) (Church and Hanks, 1991) between its con- quency varied from 21 (alte sfere ‘high places’, stituents is higher than the PMIs between the com- lit. ‘high spheres’) to 194 (punto debole). ponents of a set of lexical variants of this combi- nation obtained by replacing the component words 3.2 Building target vectors of the original expressions with semantically re- Count-based Distributional Semantic Models lated words. Other studies have resorted to Distri- (DSMs) (Turney and Pantel, 2010) allow for butional Semantics (Lenci, 2008; Turney and Pan- representing words and expressions as high- tel, 2010) by measuring the cosine between the dimensionality vectors, where the vector dimen- vector of a given phrase and the single vectors sions register the co-occurrence of the target words of its components (Fazly and Stevenson, 2008) or or expressions with some contextual features, e.g. between the phrase vector and the sum or prod- the content words that linearly precede and follow uct vector of its components (Mitchell and Lapata, the target element within a fixed contextual win- 2010; Krčmář et al., 2013). Senaldi et al. (2016b) dow. We built two DSMs on itWaC, where our tar- get AN and VN idioms and non-idioms were rep- reduction of data dimensionality is carried out by resented as target vectors and co-occurrence statis- the first layer of our model. The last layer applies tics counted how many times each target construc- a sigmoid activation function on the output in or- tion occurred in the same sentence with each of der to produce a binary judgment. While binary the 30,000 top content words in the corpus. Differ- scores are necessary to compute the model classi- ently from Bizzoni et al. (2017), we did not opt for fication accuracy and will be evaluated in terms of prediction-based vector representations (Mikolov F1, our model’s continuous scores can be retrieved et al., 2013). Although some studies have brought and will be used to perform an ordering task on out that context-predicting models fare better than the test set, that we will evaluate in terms of Inter- count-based ones on a variety of semantic tasks polated Average Precision (IAP) 2 and against the (Baroni et al., 2014), including compositionality human idiomaticity judgments with Spearman’s ρ. modeling (Rimell et al., 2016), others (Blacoe and Lapata, 2012; Cordeiro et al., 2016) have shown 5 Evaluation them to perform comparably. Moreover, Levy We trained our model on the 30,000 dimensional et al. (2015) highlight that much of the superior- distributional vectors of VN and AN expressions ity in performance exhibited by word embeddings as well as on the composition of their individual is actually due to hyperparameter optimizations, words’ vectors. We tried with different semantic which, if applied to traditional models as well, can spaces as well. When trained on PPMI- (Church bring to equivalent outcomes. Therefore, we felt and Hanks, 1991) and SVD-transformed (Deer- confident in resorting to count-based vectors as an wester et al., 1990) vectors of 150, 200, 250 and equally reliable representation for the task at hand. 300 dimensions, our models performed compara- 3.3 Gold standard idiomaticity judgments bly or even worse; so, results for these cases won’t be presented here. Details of both classification In Senaldi et al. (2016b) and Senaldi et al. (2016a), and ordering task are shown in Table 1. we collected gold standard idiomaticity judgments for our target AN and VN constructions. 9 Lin- 5.1 Verb-Noun guistics students were presented with a list of our We ran our model on the VN dataset, composed of 26 AN constructions and were asked to evaluate 90 elements, 45 idioms and 45 non-idiomatic ex- how idiomatic each expression was from 1 to 7, pressions. This is the larger of the two datasets. with 1 standing for ‘totally compositional’ and 7 We trained our model both on 30 and 40 elements standing for ‘totally idiomatic’. Inter-coder agree- for 20 epochs and tested on the remaining 60 and ment, measured with Krippendorff’s α (Krippen- 50 elements respectively, reaching a maximum dorff, 2012), was equal to 0.76. The same pro- IAP of 0.87 and Spearman’s ρ of 0.76. In general cedure was repeated for our 90 VN constructions, we found the model’s performance, both in accu- but in this case the inital list was split into 3 sub- racy and in correlation, comparable to the results lists of 30 expressions, each one to be rated by 3 reported in Senaldi et al. (2016b), who reached subjects. Krippendorff’s α was 0.83 for the first a maximum IAP of 0.91 and a maximum Spear- sublist and 0.75 for the other two. man’s ρ of -0.67. 4 Classifier 5.2 Adjective-Noun We built a neural network composed of three We ran our model on the AN dataset, composed of “dense” or fully connected layers1 of dimensional- 26 elements, 13 idioms and 13 non-idiomatic ex- ity 12, 8 and 1 respectively. Our network takes in pressions. We empirically found that our model input a single vector at a time, which can be a word was able to perform some generalization on the embedding, a count-based distributional vector or data when the training set contained at least 14 a composition of several word vectors. For the elements, evenly balanced between positive and core part of our experiment we used as input sin- negative examples. We trained our model on 16 gle distributional vectors of two-word expressions. elements for 30 epochs and tested on the remain- Due to our input’s magnitude, the most important ing 10 elements. While accuracy’s exact value can 1 2 We used Keras, a library running on TensorFlow (Abadi Following Fazly et al. (2009), IAP was computed at re- et al., 2016). call levels of 20%, 50% and 80%. Vector Training Test IAP rho F1 VN 15+15 30+30 0.82 0.50*** 0.8 VN 20+20 15+15 0.82 0.76*** 0.87 Concat (VN) 15+15 14+14 0.7 0.47* 0.69 AN 8+8 6+4 1? 0.93*** 0.9 VN+AN 23+23 14+14(VN) 0.9 0.76*** 0.82 VN+AN 23+23 18+20(joint) 0.8 0.64*** 0.76 VN+AN 23+23 5+5(AN) 0.57 -0.31 0.58 Table 1: Interpolated Average Precision, Spearman’s correlation with the speaker judgments and F- measure for Vector-Noun training (VN), Adjective-Noun training (AN), joint training and training through vector concatenation (** = p < .01, *** = p < .001). Training and test set are expressed as the sum of positive and negative examples. undergo some fluctuations when a model is trained 5.4 Vector composition on very small sets, we always registered accura- cies higher than 80%, with 4 out of 5 idioms cor- In addition to using the vector of an expression as rectly labeled in every trial. We reached an IAP of a whole, we tried to feed our model with the con- 1.0 and a ρ of 0.93, although it is important to keep catenation of the vectors of the single words in an in mind that such scores are computed on a very expression, as in Bizzoni et al. (2017). For exam- restricted test set. Senaldi et al. (2016b) reached ple, instead of using the 30,000 dimensional vec- a maximum IAP of 0.85 and a maximum ρ of - tor of the expression cambiare musica, we used 0.68. When the training size was under the critical the 60,000 dimensional vector resulting from the threshold, accuracy dropped significantly. With concatenation of cambiare and musica. We ran training sets of 10 or 12 elements, our model nat- this experiment only on the VN dataset, being the urally went in overfitting, quickly reaching 100% largest and the one that yielded the best results accuracy on the training set and failing to correctly in the previous settings. We used 30 elements in classify unforeseen expressions. In these cases a training and 26 in testing and trained our model partial learning was still visible in the ordering for 80 epochs overall. Predictably enough, vec- task, where most idioms, even if labeled incor- tor composition resulted in the worst performance, rectly, received higher scores than non-idioms. differently from what happened with metaphors (Bizzoni et al., 2017); nonetheless, the results are not completely random: with an F1 of 69%, the 5.3 Joint training model seems able to learn idiomaticity to a lower, but not null, degree; these findings would be in We also tried to train our model on both datasets line with the claim that the meaning of the sub- together, to check to what extent it would be parts of several idioms, while less important than able to recognize the same underlying seman- in metaphors, is not completely obliterated (Mc- tic phenomenon through different syntactic con- Glone et al., 1994). structions. We used two different approaches for this experiment. Training our model first on one dataset, e.g. the AN pairs, and then on the other re- 6 Error Analysis quired more epochs overall (more than 100) to sta- bilize and resulted in a poorer performance (66% Two frequent false positives are tagliare il tra- F-measure on both test sets). Training our model guardo and abbassare la guardia. While we la- on a mixed dataset containing the elements of both beled them as non-idioms in our dataset, since training sets, our model employed only 12 epochs they’re rather compositional, nonetheless they can to reach an F-measure of 76% on the mixed train- be very often used figuratively and that’s probably ing set. Anyway, we also noticed that VN expres- why our algorithms identified them as idioms. A sions were learned better than AN expressions. In frequent false negative was vedere la luce, which short, our model was able to generalize over the probably occurs more often in its literal sense in two datasets, but this involved a loss in accuracy. the corpus we used. 7 Discussion and Conclusions of context-counting vs. context-predicting se- mantic vectors. In Proceedings of the 52nd An- It seems that the distribution of idiomatic and com- nual Meeting of the Association for Computa- positional expressions in large corpora can suf- tional Linguistics, pages 238–247. fice for a supervised classifier to learn the dif- ference between the two linguistic elements from Bizzoni, Y., Chatzikyriakidis, S., and Ghanimi- small training sets and with a good level of accu- fard, M. (2017). “Deep” learning: Detecting racy. Unlike with metaphors (Bizzoni et al., 2017), metaphoricity in adjective-noun pairs. In Pro- feeding the classifier with a composition of the in- ceedings of the Workshop on Stylistic Variation, dividual words’ vectors of such expressions per- pages 43–52. forms quite scarcely and can be used to detect only Blacoe, W. and Lapata, M. (2012). A compari- some idioms. This takes us back to the core dif- son of vector-based representations for seman- ference that while metaphors are more composi- tic composition. In Proceedings of the 2012 tional and preserve a transparent source domain to joint conference on empirical methods in natu- target domain mapping, idioms are by and large ral language processing and computational nat- non-compositional. Since our classifiers rely only ural language learning, pages 546–556. Asso- on contextual features, their ability in classifica- ciation for Computational Linguistics. tion must stem from a difference in distribution be- Bohrn, I. C., Altmann, U., and Jacobs, A. M. tween idioms and non-idioms. A possible expla- (2012). Looking at the brains behind figu- nation is that while the literal expressions we se- rative language: A quantitative meta-analysis lected, like vedere un film or ascoltare un discorso, of neuroimaging studies on metaphor, id- tend to be used with animated subjects and thus to iom, and irony processing. Neuropsychologia, appear in more concrete contexts, most of our id- 50(11):2669–2683. ioms (e.g. cadere dal cielo or lasciare il segno) Cacciari, C. and Glucksberg, S. (1991). Under- allow for varying degrees of animacy or concrete- standing idiomatic expressions: The contribu- ness of the subject, and thus their context can eas- tion of word meanings. Advances in Psychol- ily get more diverse. At the same time, the drop in ogy, 77:217–240. performance we observe in the joint models seems to indicate that the different parts of speech com- Church, K. W. and Hanks, P. (1991). Word asso- posing our elements entail a significant contextual ciation norms, mutual information, and lexicog- difference between the two groups, which intro- raphy. Computational Linguistics, 16(1):22–29. duces a considerable amount of uncertainty in our Cordeiro, S., Ramisch, C., Idiart, M., and Villavi- model. It is also possible that other contextual el- cencio, A. (2016). Predicting the composition- ements we did not consider have played a role in ality of nominal compounds: Giving word em- the learning process of our models. We intend to beddings a hard time. In Proceedings of the 54th deepen this aspect in future works. Annual Meeting of the Association for Com- putational Linguistics, volume 1, pages 1986– References 1997. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Cruse, D. A. (1986). Lexical semantics. Cam- Chen, Z., Citro, C., Corrado, G. S., Davis, A., bridge University Press. Dean, J., Devin, M., et al. (2016). Tensor- Deerwester, S., Dumais, S. T., Furnas, G. W., Lan- flow: Large-scale machine learning on hetero- dauer, T. K., and Harshman, R. (1990). In- geneous distributed systems. arXiv preprint dexing by latent semantic analysis. Journal of arXiv:1603.04467. the American society for information science, Baroni, M., Bernardini, S., Ferraresi, A., and 41(6):391. Zanchetta, E. (2009). The WaCky wide web: a Fazly, A., Cook, P., and Stevenson, S. (2009). Un- collection of very large linguistically processed supervised type and token identification of id- web-crawled corpora. Language Resources and iomatic expressions. Computational Linguis- Evaluation, 43(3):209–226. tics, 1(35):61–103. Baroni, M., Dinu, G., and Kruszewski, G. (2014). Fazly, A. and Stevenson, S. (2008). A distribu- Don’t count, predict! a systematic comparison tional account of the semantics of multiword expressions. Italian Journal of Linguistics, Mikolov, T., Sutskever, I., Chen, K., Corrado, 1(20):157–179. G. S., and Dean, J. (2013). Distributed repre- Gibbs, R. W. (1993). Why idioms are not dead sentations of words and phrases and their com- metaphors. Idioms: Processing, structure, and positionality. In Proceedings of the 26tth In- interpretation, pages 57–77. ternational Conference on Neural Information Processing System, pages 3111–3119. He, X. and Liu, Y. (2017). Not enough data?: Joint inferring multiple diffusion networks via net- Mitchell, J. and Lapata, M. (2010). Composition work generation priors. In Proceedings of the in Distributional Models of Semantics. Cogni- Tenth ACM International Conference on Web tive Science, 34(8):1388–1429. Search and Data Mining, pages 465–474. Nunberg, G., Sag, I., and Wasow, T. (1994). Id- ioms. Language, 70(3):491–538. Klyueva, N., Doucet, A., and Straka, M. (2017). Neural networks for multi-word expression de- Quartu, M. B. (1993). Dizionario dei modi di dire tection. Proceedings of the 13th Workshop on della lingua italiana. RCS Libri. Multiword Expressions, pages 60–65. Rimell, L., Maillard, J., Polajnar, T., and Clark, S. Krippendorff, K. (2012). Content analysis: An in- (2016). RELPRON: A relative clause evaluation troduction to its methodology. Sage. data set for compositional distributional seman- tics. Computational Linguistics, 42(4):661– Krčmář, L., Ježek, K., and Pecina, P. (2013). 701. Determining Compositionality of Expresssions Using Various Word Space Models and Mea- Sag, I. A., Baldwin, T., Bond, F., Copestake, A., sures. In Proceedings of the Workshop on Con- and Flickinger, D. (2002). Multiword Expres- tinuous Vector Space Models and their Compo- sions: A Pain in the Neck for NLP. In Pro- sitionality, pages 64–73. ceedings of the 3rd International Conference on Intelligent Text Processing and Computational Lakoff, G. and Johnson, M. (2008). Metaphors we Linguistics, pages 1–15. live by. University of Chicago press. Senaldi, M. S. G., Lebani, G. E., and Lenci, Legrand, J. and Collobert, R. (2016). Phrase rep- A. (2016a). Determining the compositional- resentations for multiword expressions. In Pro- ity of noun-adjective pairs with lexical variants ceedings of the 12th Workshop on Multiword and distributional semantics. In Proceedings of Expressions, pages 67–71. the Third Italian Conference on Computational Lenci, A. (2008). Distributional semantics in lin- Linguistics (CLiC-it 2016), pages 268–273. guistic and cognitive research. Italian Journal Senaldi, M. S. G., Lebani, G. E., and Lenci, A. of Linguistics, 20(1):1–31. (2016b). Lexical variability and composition- Levy, O., Goldberg, Y., and Dagan, I. (2015). ality: Investigating idiomaticity with distribu- Improving distributional similarity with lessons tional semantic models. In Proceedings of the learned from word embeddings. Transactions 12th Workshop on Multiword Expression, pages of the Association for Computational Linguis- 21–31. tics, 3:211–225. Tanguy, L., Sajous, F., Calderone, B., and Lin, D. (1999). Automatic identification of non- Hathout, N. (2012). Authorship attribution: Us- compositional phrases. In Proceedings of the ing rich linguistic features when training data is 37th Annual Meeting of the Association for scarce. In PAN Lab at CLEF. Computational Linguistics, pages 317–324. Torre, E. (2014). The emergent patterns of Ital- Liu, D. (2003). The most frequently used spoken ian idioms: A dynamic-systems approach. PhD american english idioms: A corpus analysis and thesis, Lancaster University. its implications. Tesol Quarterly, 37(4):671– Turney, P. D. and Pantel, P. (2010). From Fre- 700. quency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Re- McGlone, M. S., Glucksberg, S., and Cacciari, C. search, 37:141–188. (1994). Semantic productivity and idiom com- prehension. Discourse Processes, 17(2):167– Wulff, S. (2010). Rethinking Idiomaticity: A 190. Usage-based Approach. A&C Black.