In God we trust. All others must bring data. — W. Edwards Deming Using word embeddings to recognize idioms Jing Peng, Anna Feldman Department of Computer Science Department of Linguistics Montclair State University USA pengj@mail.montclair.edu, feldmana@mail.montclair.edu Abstract Idioms are conventiolized expressions whose figurative meanings cannot be derived from literal Expressions, such as add fuel to the fire, meaning of the phrase. There is no single agreed- can be interpreted literally or idiomatically upon definition of idioms that covers all members depending on the context they occur in. of this class (Glucksberg, 1993; Cacciari, 1993; Many Natural Language Processing appli- Nunberg et al., 1994; Sag et al., 2002; Villavicen- cations could improve their performance if cio et al., 2004; Fellbaum et al., 2006). At the idiom recognition were improved. Our ap- same time, idioms do not form a homogeneous proach is based on the idea that idioms vi- class that can be easily defined. Some examples olate cohesive ties in local contexts, while of idioms are I’ll eat my hat (I’m confident), Cut literal expressions do not. We propose it out (Stop talking/doing something), a blessing two approaches: 1) Compute inner prod- in disguise (some bad luck or misfortune results uct of context word vectors with the vec- in something positive), kick the bucket (die), ring tor representing a target expression. Since a bell (sound familiar), keep your chin up (remain literal vectors predict well local contexts, cheerful), piece of cake (easy task), miss the boat their inner product with contexts should be (miss out on something), (to be) on the ball (be at- larger than idiomatic ones, thereby telling tentive/competent), put one’s foot in one’s mouth apart literals from idioms; and (2) Com- (say something one regrets), rake someone over pute literal and idiomatic scatter (covari- the coals (to reprimand someone severely), under ance) matrices from local contexts in word the weather (sick), a hot potato (controversial is- vector space. Since the scatter matri- sue), an arm and a leg (expensive), at the drop of a ces represent context distributions, we can hat (without any hesitation), barking up the wrong then measure the difference between the tree (looking in the wrong place), beat around the distributions using the Frobenius norm. bush (avoiding main topic). For comparison, we implement Fazly et al. It turns out that expressions are often ambigu- (2009)’s, Sporleder and Li (2009)’s, and ous between an idiomatic and a literal interpreta- Li and Sporleder (2010b)’s methods and tion, as one can see in the examples below 1 : apply them to our data. We provide ex- (A) After the last page was sent to the printer, perimental results validating the proposed an editor would ring a bell, walk toward the door, techniques. and holler ” Good night! ” (Literal) (B) His name never fails to ring a bell among local voters. 1 Introduction Nearly 40 years ago, Carthan was elected mayor Natural language is filled with emotion and im- of Tchula. . . (Idiomatic) plied intent, which are often not trivial to detect. (C) . . . that caused the reactor to literally blow One specific challenge are idioms. Figurative lan- its top. About 50 tons of nuclear fuel evapo- guage draws off of prior references and is unique rated in the explosion. . . (Literal) (D) . . . He didn’t to each culture and sometimes what we don’t say pound the table, he didn’t blow his top. He always is even more important than what we do. This, kept his composure. (Idiomatic) naturally, presents a significant problem for many 1 These examples are extracted from the Corpus of Con- Natural Language Processing (NLP) applications temporary American English (COCA) (http://corpus. as well as for big data analytics. byu.edu/coca/ 96 (E) . . . coming out of the fourth turn, slid down cur are more affective. Still, they treat idioms as the track, hit the inside wall and then hit the atten- semantic outliers. uator at the start of pit road. (Literal) (F) . . . job training, research and more have hit a Republican 3 Our Approach wall. (Idiomatic) Fazly et al. (2009)’s analysis of 60 idioms from We hypothesize that words in a given text seg- the British National Corpus (BNC) has shown ment that are representatives of the local context that close to half of these also have a clear lit- are likely to associate strongly with a literal ex- eral meaning; and of those with a literal mean- pression in the segment, in terms of projection (or ing, on average around 40% of their usages are inner product) of word vectors onto the vector rep- literal. Therefore, idioms present great challenges resenting the literal expression. We also hypoth- for many Natural Language Processing (NLP) ap- esize that the context word distribution for a lit- plications. Most current translation systems rely eral expression in word vector space will be dif- on large repositories of idioms. Unfortunately, ferent from the distribution for an idiomatic one. more frequently than not, MT systems are not able This hypothesis also underlies the distributional to translate idiomatic expressions correctly. approach to meaning (Firth, 1957; Katz and Gies- In this paper we describe an algorithm for auto- brecht, 2006). matic classification of idiomatic and literal expres- sions. Similarly to Peng et al. (2014), we treat id- 3.1 Projection Based On Local Context ioms as semantic outliers. Our assumption is that Representation the context word distribution for a literal expres- The local context of a literal target verb-noun con- sion will be different from the distribution for an struction (VNC) must be different from that of an idiomatic one. We capture the distribution in terms idiomatic one. We propose to exploit recent ad- of covariance matrix in vector space. vances in vector space representation to capture the difference between local contexts (Mikolov et 2 Previous Work al., 2013a; Mikolov et al., 2013b). Previous approaches to idiom detection can be A word can be represented by a vector of fixed classified into two groups: 1) type-based extrac- dimensionality q that best predicts its surrounding tion, i.e., detecting idioms at the type level; 2) words in a sentence or a document (Mikolov et al., token-based detection, i.e., detecting idioms in 2013a; Mikolov et al., 2013b). Given such a vector context. Type-based extraction is based on the representation, our first proposal is the following. idea that idiomatic expressions exhibit certain lin- Let v and n be the vectors corresponding to the guistic properties such as non-compositionality verb and noun in a target verb-noun construction, that can distinguish them from literal expressions as in blow whistle, where v 2