=Paper= {{Paper |id=Vol-2006/paper043 |storemode=property |title=Deep-learning the Ropes: Modeling Idiomaticity with Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2006/paper043.pdf |volume=Vol-2006 |authors=Yuri Bizzoni,Marco S.G. Senaldi,Alessandro Lenci |dblpUrl=https://dblp.org/rec/conf/clic-it/BizzoniSL17 }} ==Deep-learning the Ropes: Modeling Idiomaticity with Neural Networks== https://ceur-ws.org/Vol-2006/paper043.pdf

Deep-learning the Ropes: Modeling Idiomaticity with Neural Networks

Yuri Bizzoni1 , Marco S. G. Senaldi2 , Alessandro Lenci3
University of Gothenburg - Sweden1 , Scuola Normale Superiore - Italy2 , University of Pisa - Italy3
yuri.bizzoni@gu.se1 , marco.senaldi@sns.it2 , alessandro.lenci@unipi.it3

Abstract a shark). On the other hand, although most id-
ioms originate as metaphors (Cruse, 1986), they
English. In this work we explore the have undergone a crystallization process in di-
possibility of training a neural network achrony, whereby they now appear as fixed and
to classify and rank idiomatic expressions non-compositional word combinations that be-
under constraints of data scarcity. We long to the wider class of Multiword Expressions
discuss our results comparing them both (MWEs) (Sag et al., 2002) and always exhibit lex-
to other unsupervised models designed ical and morphosyntactic rigidity to some extent
to perform idiom detection and to simi- (Cacciari and Glucksberg, 1991; Nunberg et al.,
lar supervised classifiers trained to detect 1994). It is anyway crucial to underline that id-
metaphoric bigrams. iomaticity itself is a multidimensional and gra-
Italiano. In questo lavoro esploriamo dient phenomenon (Nunberg et al., 1994; Wulff,
la possibilità di addestrare una rete neu- 2010) with different idioms showing varying de-
rale per classificare ed ordinare espres- grees of semantic transparency, formal versatility,
sioni idiomatiche in condizioni di scar- proverbiality and affective valence.
sità di dati. I nostri risultati sono The aim of this work is to explore the fuzzy
discussi in comparazione sia con al- boundary between idiomatic and metaphorical ex-
tri algoritmi non supervisionati ideati pression, by applying a method designed to dis-
per l’identificazione di espressioni id- criminate figurative vs. literal usages to the task of
iomatiche sia con classificatori supervi- distinguishing idiomatic from compositional ex-
sionati dello stesso tipo addestrati per pressions. Our starting point is the work of Biz-
identificare bigrammi metaforici. zoni et al. (2017). The authors managed to clas-
sify adjective-noun pairs where the same adjec-
tives were used both in a metaphorical and a lit-
1 Introduction eral sense (e.g. clean performance vs. clean floor)
Figurative expressions like idioms (e.g. to learn using a neural classifier trained on a composition
the ropes ‘to learn how to do a job’, to cut the of the words’ embeddings (Mikolov et al., 2013).
mustard ‘to perform up to expectations’, etc.) and Actually, the neural network was able to detect
metaphors (e.g. clean performance, that lawyer is the abstract/concrete semantic shift of nouns when
a shark, etc.) are pervasive in language use. Im- used with the same adjective in figurative and
portant differences have been stressed between the literal compositions respectively, basically treat-
two types of expressions from a theoretical (Gibbs, ing the noun as the “context” to discriminate the
1993; Torre, 2014), neurocognitive (Bohrn et al., metaphoricity of the adjective. In our attempt, we
2012) and corpus linguistic (Liu, 2003) prespec- will use a relatively similar approach to classify
tive. On the one hand, as stated by Lakoff and idiomatic expressions by training a three-layered
Johnson (2008), linguistic metaphors reflect an in- neural network on a set of idiomatic and non-
stantiation of conceptual metaphors, whereby ab- idiomatic expressions and we’ll compare the per-
stract concepts in a target domain (e.g. the ruth- formance of the network when trained on differ-
lessness of a lawyer) are described by a rather ent syntactic patterns (Adjective-Noun and Verb-
transparent mapping to concrete examples taken Noun expressions, AN and VN henceforth).
from a source domain (e.g. the aggressiveness of Importantly, the abstract/concrete polarity the
network was able to learn in Bizzoni et al. (2017) and Senaldi et al. (2016a) have combined insights
will not be available this time, since none of the id- from both these approaches by observing that the
iom constituents will ever appear in its literal sense vectors of VN and AN idioms are less similar to
inside the expressions, whatever their concrete- the vectors of lexical variants of these expressions
ness may be. What we want to find out is whether with respect to the vectors of compositional con-
the sole information captured by the distributional structions. To the best of our knowledge, neu-
vector of a single expression is sufficient to learn ral networks have been previously adopted to per-
its potential idiomaticity. Differently from Bizzoni form MWE detection in general (Legrand and Col-
et al. (2017), for each idiom we collect a count- lobert, 2016; Klyueva et al., 2017), but not idiom
based vector (Turney and Pantel, 2010) of the ex- identification specifically. In Bizzoni et al. (2017),
pression as a whole, taken as a single token. We pre-trained noun and adjective vector embeddings
compare this approach with a model trained on the are fed to a single-layered neural network to dis-
composition of the individual words of an expres- ambiguate metaphorical and literal AN combina-
sion, showing that the latter is less effective for tions. Several combination algorithms are exper-
idioms than for metaphors. In both cases we will imented with to concatenate adjective and noun
be operating on scarce training sets (26 AN and 90 embeddings. All in all, the method is shown to
VN constructions). Traditional ways to deal with outperform the state of the art, presumably lever-
data scarcity in computational linguistics resort to aging the abstractness degree of the noun as a clue
a wide number of different features to annotate the to metaphoricity.
training set (see for example Tanguy et al. (2012))
or rely on artificial bootstrapping of the training 3 Dataset
set (He and Liu, 2017). In our case we test the
3.1 Target expressions extraction
performance of our classifier on scarce data with-
out bootstrapping the dataset and relying only on The two idiom datasets we employ in the cur-
the information provided by the distributional se- rent study come from Senaldi et al. (2016b) and
mantic space, showing that the distribution of an Senaldi et al. (2016a). The first one is composed
expression in large corpora can provide enough in- of 45 idiomatic and 45 non-idiomatic Italian V-
formation to learn idiomaticity from few examples NP and V-PP constructions (e.g. tagliare la corda
with a satisfactory degree of accuracy. ‘to flee’ lit. ‘to cut the rope’ and leggere un libro
‘to read a book’) that were selected from an Ital-
2 Related Work ian idiom dictionary (Quartu, 1993) and extracted
from the itWaC corpus (Baroni et al., 2009), com-
Previous computational research has exploited dif- posed of about 1,909M tokens. Their frequency
ferent methods to perform idiom type detection spanned from 364 (ingannare il tempo ‘to while
(i.e., automatically telling apart potential idioms away the time’) to 8294 (andare in giro ‘to get
like to get the sack from only literal combinations about’). The latter comprises 13 idiomatic and 13
like to kill a man). For example Lin (1999) and non-idiomatic AN constructions (e.g. punto de-
Fazly et al. (2009) label a given word combination bole ‘weak point’ and nuova legge ‘new law’) that
as idiomatic if the Pointwise Mutual Information were still extracted from itWaC and whose fre-
(PMI) (Church and Hanks, 1991) between its con- quency varied from 21 (alte sfere ‘high places’,
stituents is higher than the PMIs between the com- lit. ‘high spheres’) to 194 (punto debole).
ponents of a set of lexical variants of this combi-
nation obtained by replacing the component words 3.2 Building target vectors
of the original expressions with semantically re- Count-based Distributional Semantic Models
lated words. Other studies have resorted to Distri- (DSMs) (Turney and Pantel, 2010) allow for
butional Semantics (Lenci, 2008; Turney and Pan- representing words and expressions as high-
tel, 2010) by measuring the cosine between the dimensionality vectors, where the vector dimen-
vector of a given phrase and the single vectors sions register the co-occurrence of the target words
of its components (Fazly and Stevenson, 2008) or or expressions with some contextual features, e.g.
between the phrase vector and the sum or prod- the content words that linearly precede and follow
uct vector of its components (Mitchell and Lapata, the target element within a fixed contextual win-
2010; Krčmář et al., 2013). Senaldi et al. (2016b) dow. We built two DSMs on itWaC, where our tar-
get AN and VN idioms and non-idioms were rep- reduction of data dimensionality is carried out by
resented as target vectors and co-occurrence statis- the first layer of our model. The last layer applies
tics counted how many times each target construc- a sigmoid activation function on the output in or-
tion occurred in the same sentence with each of der to produce a binary judgment. While binary
the 30,000 top content words in the corpus. Differ- scores are necessary to compute the model classi-
ently from Bizzoni et al. (2017), we did not opt for fication accuracy and will be evaluated in terms of
prediction-based vector representations (Mikolov F1, our model’s continuous scores can be retrieved
et al., 2013). Although some studies have brought and will be used to perform an ordering task on
out that context-predicting models fare better than the test set, that we will evaluate in terms of Inter-
count-based ones on a variety of semantic tasks polated Average Precision (IAP) 2 and against the
(Baroni et al., 2014), including compositionality human idiomaticity judgments with Spearman’s ρ.
modeling (Rimell et al., 2016), others (Blacoe and
Lapata, 2012; Cordeiro et al., 2016) have shown 5 Evaluation
them to perform comparably. Moreover, Levy
We trained our model on the 30,000 dimensional
et al. (2015) highlight that much of the superior-
distributional vectors of VN and AN expressions
ity in performance exhibited by word embeddings
as well as on the composition of their individual
is actually due to hyperparameter optimizations,
words’ vectors. We tried with different semantic
which, if applied to traditional models as well, can
spaces as well. When trained on PPMI- (Church
bring to equivalent outcomes. Therefore, we felt
and Hanks, 1991) and SVD-transformed (Deer-
confident in resorting to count-based vectors as an
wester et al., 1990) vectors of 150, 200, 250 and
equally reliable representation for the task at hand.
300 dimensions, our models performed compara-
3.3 Gold standard idiomaticity judgments bly or even worse; so, results for these cases won’t
be presented here. Details of both classification
In Senaldi et al. (2016b) and Senaldi et al. (2016a), and ordering task are shown in Table 1.
we collected gold standard idiomaticity judgments
for our target AN and VN constructions. 9 Lin- 5.1 Verb-Noun
guistics students were presented with a list of our
We ran our model on the VN dataset, composed of
26 AN constructions and were asked to evaluate
90 elements, 45 idioms and 45 non-idiomatic ex-
how idiomatic each expression was from 1 to 7,
pressions. This is the larger of the two datasets.
with 1 standing for ‘totally compositional’ and 7
We trained our model both on 30 and 40 elements
standing for ‘totally idiomatic’. Inter-coder agree-
for 20 epochs and tested on the remaining 60 and
ment, measured with Krippendorff’s α (Krippen-
50 elements respectively, reaching a maximum
dorff, 2012), was equal to 0.76. The same pro-
IAP of 0.87 and Spearman’s ρ of 0.76. In general
cedure was repeated for our 90 VN constructions,
we found the model’s performance, both in accu-
but in this case the inital list was split into 3 sub-
racy and in correlation, comparable to the results
lists of 30 expressions, each one to be rated by 3
reported in Senaldi et al. (2016b), who reached
subjects. Krippendorff’s α was 0.83 for the first
a maximum IAP of 0.91 and a maximum Spear-
sublist and 0.75 for the other two.
man’s ρ of -0.67.
4 Classifier 5.2 Adjective-Noun
We built a neural network composed of three We ran our model on the AN dataset, composed of
“dense” or fully connected layers1 of dimensional- 26 elements, 13 idioms and 13 non-idiomatic ex-
ity 12, 8 and 1 respectively. Our network takes in pressions. We empirically found that our model
input a single vector at a time, which can be a word was able to perform some generalization on the
embedding, a count-based distributional vector or data when the training set contained at least 14
a composition of several word vectors. For the elements, evenly balanced between positive and
core part of our experiment we used as input sin- negative examples. We trained our model on 16
gle distributional vectors of two-word expressions. elements for 30 epochs and tested on the remain-
Due to our input’s magnitude, the most important ing 10 elements. While accuracy’s exact value can
1 2
We used Keras, a library running on TensorFlow (Abadi Following Fazly et al. (2009), IAP was computed at re-
et al., 2016). call levels of 20%, 50% and 80%.
Vector Training Test IAP rho F1
VN 15+15 30+30 0.82 0.50*** 0.8
VN 20+20 15+15 0.82 0.76*** 0.87
Concat (VN) 15+15 14+14 0.7 0.47* 0.69
AN 8+8 6+4 1? 0.93*** 0.9
VN+AN 23+23 14+14(VN) 0.9 0.76*** 0.82
VN+AN 23+23 18+20(joint) 0.8 0.64*** 0.76
VN+AN 23+23 5+5(AN) 0.57 -0.31 0.58

Table 1: Interpolated Average Precision, Spearman’s correlation with the speaker judgments and F-
measure for Vector-Noun training (VN), Adjective-Noun training (AN), joint training and training
through vector concatenation (** = p < .01, *** = p < .001). Training and test set are expressed as
the sum of positive and negative examples.

undergo some fluctuations when a model is trained 5.4 Vector composition
on very small sets, we always registered accura-
cies higher than 80%, with 4 out of 5 idioms cor- In addition to using the vector of an expression as
rectly labeled in every trial. We reached an IAP of a whole, we tried to feed our model with the con-
1.0 and a ρ of 0.93, although it is important to keep catenation of the vectors of the single words in an
in mind that such scores are computed on a very expression, as in Bizzoni et al. (2017). For exam-
restricted test set. Senaldi et al. (2016b) reached ple, instead of using the 30,000 dimensional vec-
a maximum IAP of 0.85 and a maximum ρ of - tor of the expression cambiare musica, we used
0.68. When the training size was under the critical the 60,000 dimensional vector resulting from the
threshold, accuracy dropped significantly. With concatenation of cambiare and musica. We ran
training sets of 10 or 12 elements, our model nat- this experiment only on the VN dataset, being the
urally went in overfitting, quickly reaching 100% largest and the one that yielded the best results
accuracy on the training set and failing to correctly in the previous settings. We used 30 elements in
classify unforeseen expressions. In these cases a training and 26 in testing and trained our model
partial learning was still visible in the ordering for 80 epochs overall. Predictably enough, vec-
task, where most idioms, even if labeled incor- tor composition resulted in the worst performance,
rectly, received higher scores than non-idioms. differently from what happened with metaphors
(Bizzoni et al., 2017); nonetheless, the results are
not completely random: with an F1 of 69%, the
5.3 Joint training model seems able to learn idiomaticity to a lower,
but not null, degree; these findings would be in
We also tried to train our model on both datasets line with the claim that the meaning of the sub-
together, to check to what extent it would be parts of several idioms, while less important than
able to recognize the same underlying seman- in metaphors, is not completely obliterated (Mc-
tic phenomenon through different syntactic con- Glone et al., 1994).
structions. We used two different approaches for
this experiment. Training our model first on one
dataset, e.g. the AN pairs, and then on the other re- 6 Error Analysis
quired more epochs overall (more than 100) to sta-
bilize and resulted in a poorer performance (66% Two frequent false positives are tagliare il tra-
F-measure on both test sets). Training our model guardo and abbassare la guardia. While we la-
on a mixed dataset containing the elements of both beled them as non-idioms in our dataset, since
training sets, our model employed only 12 epochs they’re rather compositional, nonetheless they can
to reach an F-measure of 76% on the mixed train- be very often used figuratively and that’s probably
ing set. Anyway, we also noticed that VN expres- why our algorithms identified them as idioms. A
sions were learned better than AN expressions. In frequent false negative was vedere la luce, which
short, our model was able to generalize over the probably occurs more often in its literal sense in
two datasets, but this involved a loss in accuracy. the corpus we used.
7 Discussion and Conclusions of context-counting vs. context-predicting se-
mantic vectors. In Proceedings of the 52nd An-
It seems that the distribution of idiomatic and com- nual Meeting of the Association for Computa-
positional expressions in large corpora can suf- tional Linguistics, pages 238–247.
fice for a supervised classifier to learn the dif-
ference between the two linguistic elements from Bizzoni, Y., Chatzikyriakidis, S., and Ghanimi-
small training sets and with a good level of accu- fard, M. (2017). “Deep” learning: Detecting
racy. Unlike with metaphors (Bizzoni et al., 2017), metaphoricity in adjective-noun pairs. In Pro-
feeding the classifier with a composition of the in- ceedings of the Workshop on Stylistic Variation,
dividual words’ vectors of such expressions per- pages 43–52.
forms quite scarcely and can be used to detect only Blacoe, W. and Lapata, M. (2012). A compari-
some idioms. This takes us back to the core dif- son of vector-based representations for seman-
ference that while metaphors are more composi- tic composition. In Proceedings of the 2012
tional and preserve a transparent source domain to joint conference on empirical methods in natu-
target domain mapping, idioms are by and large ral language processing and computational nat-
non-compositional. Since our classifiers rely only ural language learning, pages 546–556. Asso-
on contextual features, their ability in classifica- ciation for Computational Linguistics.
tion must stem from a difference in distribution be- Bohrn, I. C., Altmann, U., and Jacobs, A. M.
tween idioms and non-idioms. A possible expla- (2012). Looking at the brains behind figu-
nation is that while the literal expressions we se- rative language: A quantitative meta-analysis
lected, like vedere un film or ascoltare un discorso, of neuroimaging studies on metaphor, id-
tend to be used with animated subjects and thus to iom, and irony processing. Neuropsychologia,
appear in more concrete contexts, most of our id- 50(11):2669–2683.
ioms (e.g. cadere dal cielo or lasciare il segno)
Cacciari, C. and Glucksberg, S. (1991). Under-
allow for varying degrees of animacy or concrete-
standing idiomatic expressions: The contribu-
ness of the subject, and thus their context can eas-
tion of word meanings. Advances in Psychol-
ily get more diverse. At the same time, the drop in
ogy, 77:217–240.
performance we observe in the joint models seems
to indicate that the different parts of speech com- Church, K. W. and Hanks, P. (1991). Word asso-
posing our elements entail a significant contextual ciation norms, mutual information, and lexicog-
difference between the two groups, which intro- raphy. Computational Linguistics, 16(1):22–29.
duces a considerable amount of uncertainty in our Cordeiro, S., Ramisch, C., Idiart, M., and Villavi-
model. It is also possible that other contextual el- cencio, A. (2016). Predicting the composition-
ements we did not consider have played a role in ality of nominal compounds: Giving word em-
the learning process of our models. We intend to beddings a hard time. In Proceedings of the 54th
deepen this aspect in future works. Annual Meeting of the Association for Com-
putational Linguistics, volume 1, pages 1986–
References 1997.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Cruse, D. A. (1986). Lexical semantics. Cam-
Chen, Z., Citro, C., Corrado, G. S., Davis, A., bridge University Press.
Dean, J., Devin, M., et al. (2016). Tensor- Deerwester, S., Dumais, S. T., Furnas, G. W., Lan-
flow: Large-scale machine learning on hetero- dauer, T. K., and Harshman, R. (1990). In-
geneous distributed systems. arXiv preprint dexing by latent semantic analysis. Journal of
arXiv:1603.04467. the American society for information science,
Baroni, M., Bernardini, S., Ferraresi, A., and 41(6):391.
Zanchetta, E. (2009). The WaCky wide web: a Fazly, A., Cook, P., and Stevenson, S. (2009). Un-
collection of very large linguistically processed supervised type and token identification of id-
web-crawled corpora. Language Resources and iomatic expressions. Computational Linguis-
Evaluation, 43(3):209–226. tics, 1(35):61–103.
Baroni, M., Dinu, G., and Kruszewski, G. (2014). Fazly, A. and Stevenson, S. (2008). A distribu-
Don’t count, predict! a systematic comparison tional account of the semantics of multiword
expressions. Italian Journal of Linguistics, Mikolov, T., Sutskever, I., Chen, K., Corrado,
1(20):157–179. G. S., and Dean, J. (2013). Distributed repre-
Gibbs, R. W. (1993). Why idioms are not dead sentations of words and phrases and their com-
metaphors. Idioms: Processing, structure, and positionality. In Proceedings of the 26tth In-
interpretation, pages 57–77. ternational Conference on Neural Information
Processing System, pages 3111–3119.
He, X. and Liu, Y. (2017). Not enough data?: Joint
inferring multiple diffusion networks via net- Mitchell, J. and Lapata, M. (2010). Composition
work generation priors. In Proceedings of the in Distributional Models of Semantics. Cogni-
Tenth ACM International Conference on Web tive Science, 34(8):1388–1429.
Search and Data Mining, pages 465–474. Nunberg, G., Sag, I., and Wasow, T. (1994). Id-
ioms. Language, 70(3):491–538.
Klyueva, N., Doucet, A., and Straka, M. (2017).
Neural networks for multi-word expression de- Quartu, M. B. (1993). Dizionario dei modi di dire
tection. Proceedings of the 13th Workshop on della lingua italiana. RCS Libri.
Multiword Expressions, pages 60–65. Rimell, L., Maillard, J., Polajnar, T., and Clark, S.
Krippendorff, K. (2012). Content analysis: An in- (2016). RELPRON: A relative clause evaluation
troduction to its methodology. Sage. data set for compositional distributional seman-
tics. Computational Linguistics, 42(4):661–
Krčmář, L., Ježek, K., and Pecina, P. (2013). 701.
Determining Compositionality of Expresssions
Using Various Word Space Models and Mea- Sag, I. A., Baldwin, T., Bond, F., Copestake, A.,
sures. In Proceedings of the Workshop on Con- and Flickinger, D. (2002). Multiword Expres-
tinuous Vector Space Models and their Compo- sions: A Pain in the Neck for NLP. In Pro-
sitionality, pages 64–73. ceedings of the 3rd International Conference on
Intelligent Text Processing and Computational
Lakoff, G. and Johnson, M. (2008). Metaphors we Linguistics, pages 1–15.
live by. University of Chicago press.
Senaldi, M. S. G., Lebani, G. E., and Lenci,
Legrand, J. and Collobert, R. (2016). Phrase rep- A. (2016a). Determining the compositional-
resentations for multiword expressions. In Pro- ity of noun-adjective pairs with lexical variants
ceedings of the 12th Workshop on Multiword and distributional semantics. In Proceedings of
Expressions, pages 67–71. the Third Italian Conference on Computational
Lenci, A. (2008). Distributional semantics in lin- Linguistics (CLiC-it 2016), pages 268–273.
guistic and cognitive research. Italian Journal Senaldi, M. S. G., Lebani, G. E., and Lenci, A.
of Linguistics, 20(1):1–31. (2016b). Lexical variability and composition-
Levy, O., Goldberg, Y., and Dagan, I. (2015). ality: Investigating idiomaticity with distribu-
Improving distributional similarity with lessons tional semantic models. In Proceedings of the
learned from word embeddings. Transactions 12th Workshop on Multiword Expression, pages
of the Association for Computational Linguis- 21–31.
tics, 3:211–225. Tanguy, L., Sajous, F., Calderone, B., and
Lin, D. (1999). Automatic identification of non- Hathout, N. (2012). Authorship attribution: Us-
compositional phrases. In Proceedings of the ing rich linguistic features when training data is
37th Annual Meeting of the Association for scarce. In PAN Lab at CLEF.
Computational Linguistics, pages 317–324. Torre, E. (2014). The emergent patterns of Ital-
Liu, D. (2003). The most frequently used spoken ian idioms: A dynamic-systems approach. PhD
american english idioms: A corpus analysis and thesis, Lancaster University.
its implications. Tesol Quarterly, 37(4):671– Turney, P. D. and Pantel, P. (2010). From Fre-
700. quency to Meaning: Vector Space Models of
Semantics. Journal of Artificial Intelligence Re-
McGlone, M. S., Glucksberg, S., and Cacciari, C.
search, 37:141–188.
(1994). Semantic productivity and idiom com-
prehension. Discourse Processes, 17(2):167– Wulff, S. (2010). Rethinking Idiomaticity: A
190. Usage-based Approach. A&C Black.